DETAILED ACTION
1. 	This action is in response to amendments and arguments filed 2 September 2022 for application 16/783047 filed 5 February 2020.  Currently claims 1-4 and 6-20 are pending.  Claim 5 has been canceled. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Response to Arguments
Applicant's arguments filed 2 September 2022 have been fully considered but they are not persuasive. 

Specifically, Applicants ArgueDocket No. 191370 
Paragraph 36 recites, in part, "when all the operation tasks (e.g., the computation performed by the PEs 110, the data transmissions performed by the DMA engine 131, etc.) configured in a one-time manner by the MCU 133 are completed, the MCU 133 configures the next round of operation tasks for the NoC structure. No matter whether the operation tasks are performed by the PEs 110 or the DMA engine 131, the MCU 133 is notified as long as each operation task is completed ... As long as the MCU 133 is notified of the fact that the current round of operation tasks performed by the PEs 110 and the DMA engine 131 is completed or learns that the registers of each PE 110 and the DMA engine 131 have completed the operation tasks, the MCU 133 then configures the next round of operation tasks." … As shown in Fig. IA, the configuration module 130 includes the MCU. Therefore, the MCU is included in the pipeline network. As discussed above, paragraph 42 of Li discloses that the value of one PE is written to another PE through the pipeline network, which includes the MCU (see Li paragraph 31 and Fig. IA). Therefore, the writing of the results of the computations from PEO to the VMs of PE as disclosed in paragraph 42, is not performed independently of a host processor because the MCU is a type of host processor that configures each round of operation tasks. … Finally, as discussed in paragraph 42, the operation tasks are performed through the pipeline network, which includes the MCU. Therefore, Li fails disclose that "routing the intermediate inference request results directly between the first inference accelerator and the second inference accelerator is performed independently of the host processor," as recited in claim 1. 

Examiner’s Response:
The Examiner respectfully disagrees, noting that a claim must be given its broadest reasonable interpretation consistent with the specification.  M.P.E.P. 2173.01(I), M.P.E.P. 2111.01(II). As set forth in both the 3 June 2022 NOFA and in the current office action, Li teaches that ‘the routing of the intermediate interface request results being performed independently of the host processor” because he teaches that the routing of the (intermediate) results from a computational node/accelerator (layer) to a subsequent computational node/accelerator (layer) is performed by directly conveying the results of the computation from one node/accelerator to the memory of a second node/accelerator through a pipeline architecture in which the conveyance of that data does not pass through the host processor (system memory/configuration module) as can be seen in Figure 7. This is consistent with a BRI interpretation of “independent” in the context of conveying the intermediate results data. Any status-related communication between the nodes/accelerators and the host processor (such as might simply indicate that that data has been conveyed) does not diminish, in the BRI sense, that the routing/conveyance of that data is independent of the host processor (-viz., [0042, 0048, Figure 7, Figure 11A, Figure 11B] The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . …Therefore , the PE 110 directly outputs the operation result to the VMs 118 ( VM2 ) of the next PES 110 ( PE1 - PE3 ) or the system memory 120 . … At the same time , the PE 110 ( PE2 ) directly outputs the operation result ( e . g . , the computation results of the third layer of the NN computation performed on the data before the previous data ) to the VM 118 ( VM2 ) of the PE 110 ( PE3 ) . At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 .)

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.



Claims 1-3, 7, 11-13, and 17-19 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Li et al. (US2019/0286974, Filed 11 June 2018), hereinafter referred to as Li. 


In regards to claim 1, Li teaches A method for accelerating machine learning on a computing device comprising a first inference accelerator, a second inference accelerator, and a host processor, the method comprising: hosting a neural network in a first inference accelerator and a second inference accelerator, the neural network split between the first inference accelerator and the second inference accelerator; ([0026, 0039, 0040, Figure 1, Figure 2, Figure 7] Please refer to FIG . 1A and FIG . 2 . FIG . 2 is a schematic view of a computation node 100 in a NoC structure constituted by one PE 110 and the corresponding auxiliary memory 115 . In the present embodiment , in order to better adapt the PE 110 to the NN computation , the PE 110 may be an application - specific integrated circuit ( ASIC ) of an artificial intelligence ( AI ) accelerator , e . g . , a tensor processor , a neural network processor ( NNP ) , a neural engine , and so on ., In another aspect , the NN structure includes several software layers ( e . g . , the aforesaid convolutional layer , an activation layer , a pooling layer , a fully connected layer , and so on ) . Computations of data are performed in each software layer , and the computation results are then input to the next software layer . According to this concept as well as the aforesaid NoC structure of the processing circuit 1 , a channel mapping - data flow computation mode is provided herein ., That is , each computation node 100 corresponds to one software layer , and the computation nodes 100 are connected through the NoC interface 113 to form a pipeline . The PES 110 in each computation node 100 completes the NN computations in each software layer through the pipeline . Similarly , the allocation of the operation tasks of each computation node 100 is done in advance and stored in the MCU 133 ., wherein a neural network (e.g., CNN) is implemented/hosted in a network-on-chip framework comprising a set of accelerators/computational nodes allocated to perform neural network computations (inferences) split across multiple accelerators in which each accelerator includes a processing element (e.g., ASIC, FPGA) and such that each accelerator/computational node performs/hosts the distinct (but reconfigurable) computations associated with a portion of the neural network (e.g., a CNN layer according to a configuration module), and wherein it is noted that this framework also includes a host processor (e.g., with system memory and a configuration module such as shown in Figure 7).) routing intermediate inference request results directly between the first inference accelerator and the second inference accelerator, ([0042, 0048, Figure 7, Figure 11A, Figure 11B] The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . …Therefore , the PE 110 directly outputs the operation result to the VMs 118 ( VM2 ) of the next PES 110 ( PE1 - PE3 ) or the system memory 120 . … At the same time , the PE 110 ( PE2 ) directly outputs the operation result ( e . g . , the computation results of the third layer of the NN computation performed on the data before the previous data ) to the VM 118 ( VM2 ) of the PE 110 ( PE3 ) . At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 ., wherein the (intermediate) output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) is sent/routed directly (pipelined) to a second agent/computational node via NoC pipeline configuration.) the routing of the intermediate interface request results being performed independently of the host processor;  ([0042, 0048, Figure 7, Figure 11A, Figure 11B] The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . …Therefore , the PE 110 directly outputs the operation result to the VMs 118 ( VM2 ) of the next PES 110 ( PE1 - PE3 ) or the system memory 120 . … At the same time , the PE 110 ( PE2 ) directly outputs the operation result ( e . g . , the computation results of the third layer of the NN computation performed on the data before the previous data ) to the VM 118 ( VM2 ) of the PE 110 ( PE3 ) . At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 ., wherein the routing of the (intermediate) results from a computational node/accelerator (layer) to a subsequent computational node/accelerator (layer) is independent of the host processor/system memory since it is based on the directly pipelining of the (auxiliary) memories of the respective accelerators/computational nodes (for instance, Figure 7 shows this direct routing from one PE to another while bypassing the host processor (system memory/configuration module).) and generating a final inference request result from the intermediate inference request results.  ([0042, 0048, Figure 7, Figure 11A, Figure 11B]The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . … At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 . , wherein the result/output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) is generated at a computational node and sent/output to system memory (host processor) in which any output written to the system memory is being interpreted as being a result, especially if that output is the result of terminal layer (e.g., the 4th layer in a forward propagation).)

In regards to claim 2, the rejection of claim 1 is incorporated and Li further teaches in which generating the final inference request result comprises generating the final inference request result by the second inference accelerator in response to the intermediate inference request results from the first inference accelerator.  ([0042, 0048, Figure 7, Figure 11A, Figure 11B] The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . … At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 ., wherein the result/output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) that is generated at a (terminal) computational node (e.g., for a terminal layer) is based on the processing of intermediate results received from and previously generated by other computational nodes/accelerators such that the generation of this result occurs in response to the generation and reception of the intermediate results.)   

In regards to claim 3, the rejection of claim 2 is incorporated and Li further teaches further comprising transmitting the final inference request result from the second inference accelerator directly to the host processor.  ([0042, 0048, Figure 6A, Figure 7, Figure 11A, Figure 11B] The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . … At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 ., wherein the result/output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) is sent/output directly to the system memory of the host processor (i.e., while the NoC encompasses a host processor which controls/configures the data flow operations with the associated system memory storing results and feeding feature maps as input into the neural network) 

In regards to claim 7, the rejection of claim 1 is incorporated and Li further teaches further comprising: implementing a request queue for the second inference accelerator in a memory of a host processor of the computing device; and implementing the request queue for the first inference accelerator in the memory of the host processor of the computing device.  ([0036, 0040] Besides , when all the operation tasks ( e . g . , the computation performed by the PEs 110 , the data transmissions performed by the DMA engine 131 , etc. . ) configured in a one - time manner by the MCU 133 are completed , the MCU 133 configures the next round of operation tasks for the NoC structure . No matter whether the operation tasks are performed by the PES 110 or the DMA engine 131 , the MCU 133 is notified as long as each operation task is completed , and the way to notify the MCU 133 may include the transmission of an interruption message to the MCU 133 . The MCU 133 is equipped with a timer ; when the time is up , the MCU 133 inquires in turn whether the registers of each PE 110 and the DMA engine 131 have completed the operation tasks . As long as the MCU 133 is notified of the fact that the current round of operation tasks performed by the PEs 110 and the DMA engine 131 is completed or learns that the registers of each PE 110 and the DMA engine 131 have completed the operation tasks , the MCU 133 then configures the next round of operation tasks ., The configuration module 130 includes the MCU 133 and the DMA engine 131 . The MCU 133 may control the DMA engine 131 to process the data transmissions between the system memory 120 and the auxiliary memories 115 and the data transmissions between the auxiliary memories 115 of two adjacent computation nodes… Four PES 110 and the auxiliary memories 115 connected thereto four computation nodes 100 , and the configuration module 133 establishes a phase sequence for the computation nodes 1l00 according to the NN computation and instructs each of the computation nodes 100 to transmit data to another of the computation nodes 100 according to the phase sequence ., wherein the configuration module (including the MCU micro-control unit) in the host processor determines in advance the configuration of the computations and data flow across the pipelined architecture such that the MCU includes a request queue in its memory that specifies a phase/sequence of operations and includes the functionality to determine when a given computational node/accelerator has completed a task so that a subsequent computational node/accelerator can initiate a task based on the results of the previous computational node/accelerator and wherein it is noted that the MCU identifies completion of a task by a messaging process or by querying the registers of the given computational node/accelerator.)  

In regards to claim 11, Li teaches A system for accelerating machine learning, comprising: a first inference accelerator and a second inference accelerator, the neural network split between the first inference accelerator and the second inference accelerator; 
 ([0026, 0039, 0040, Figure 1, Figure 2, Figure 7] Please refer to FIG . 1A and FIG . 2 . FIG . 2 is a schematic view of a computation node 100 in a NoC structure constituted by one PE 110 and the corresponding auxiliary memory 115 . In the present embodiment , in order to better adapt the PE 110 to the NN computation , the PE 110 may be an application - specific integrated circuit ( ASIC ) of an artificial intelligence ( AI ) accelerator , e . g . , a tensor processor , a neural network processor ( NNP ) , a neural engine , and so on ., In another aspect , the NN structure includes several software layers ( e . g . , the aforesaid convolutional layer , an activation layer , a pooling layer , a fully connected layer , and so on ) . Computations of data are performed in each software layer , and the computation results are then input to the next software layer . According to this concept as well as the aforesaid NoC structure of the processing circuit 1 , a channel mapping - data flow computation mode is provided herein ., That is , each computation node 100 corresponds to one software layer , and the computation nodes 100 are connected through the NoC interface 113 to form a pipeline . The PES 110 in each computation node 100 completes the NN computations in each software layer through the pipeline . Similarly , the allocation of the operation tasks of each computation node 100 is done in advance and stored in the MCU 133 ., wherein a neural network (e.g., CNN) is implemented/hosted in a network-on-chip framework/system comprising a set of accelerators/computational nodes allocated to perform neural network computations (inferences) split across multiple accelerators in which each accelerator includes a processing element (e.g., ASIC, FPGA) and such that each accelerator/computational node performs/hosts the distinct (but reconfigurable) computations associated with a portion of the neural network (e.g., a CNN layer according to a configuration module).) a host processor to receive a final inference request result generated from intermediate inference request results;  ([0042, 0048, Figure 7, Figure 11A, Figure 11B]The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . … At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 . , wherein the result/output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) is generated at a computational node and sent/output to system memory (host processor) in which any output written to the system memory is being interpreted as being a result, especially if that output is the result of terminal layer (e.g., the 4th layer in a forward propagation).) and a switch to route the intermediate inference request results directly between the first inference accelerator and the second inference accelerator,  ([0040, 0042, 0048, Figure 7, Figure 11A, Figure 11B] The MCU 133 may control the DMA engine 131 to process the data transmissions between the system memory 120 and the auxiliary memories 115 and the data transmissions between the auxiliary memories 115 of two adjacent computation nodes . Here , the data transmissions are DMA transmissions .,  The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . …Therefore , the PE 110 directly outputs the operation result to the VMs 118 ( VM2 ) of the next PES 110 ( PE1 - PE3 ) or the system memory 120 . … At the same time , the PE 110 ( PE2 ) directly outputs the operation result ( e . g . , the computation results of the third layer of the NN computation performed on the data before the previous data ) to the VM 118 ( VM2 ) of the PE 110 ( PE3 ) . At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 ., wherein the (intermediate) output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) is sent/routed directly (pipelined) to a second agent/computational node via NoC pipeline configuration according the dataflow configured by the MCU control unit such that the crossbar is configured by the MCU to switch the data flow between different computational nodes (auxiliary memories).) the routing of the routing of the intermediate interface request results being performed independently of the host processor.  ([0042, 0048, Figure 7, Figure 11A, Figure 11B] The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . …Therefore , the PE 110 directly outputs the operation result to the VMs 118 ( VM2 ) of the next PES 110 ( PE1 - PE3 ) or the system memory 120 . … At the same time , the PE 110 ( PE2 ) directly outputs the operation result ( e . g . , the computation results of the third layer of the NN computation performed on the data before the previous data ) to the VM 118 ( VM2 ) of the PE 110 ( PE3 ) . At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 ., wherein the routing of the (intermediate) results from a computational node/accelerator (layer) to a subsequent computational node/accelerator (layer) is independent of the host processor/system memory since it is based on the directly pipelining of the (auxiliary) memories of the respective accelerators/computational nodes (for instance, Figure 7 shows this direct routing from one PE to another while bypassing the host processor (system memory/configuration module).) 

In regards to claim 12, the rejection of claim 11 is incorporated and Li further teaches in which the final inference request result is generated by the second inference accelerator in response to the intermediate inference request results from the first inference accelerator.  ([0042, 0048, Figure 7, Figure 11A, Figure 11B] The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . … At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 ., wherein the result/output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) that is generated at a (terminal) computational node (e.g., for a terminal layer) is based on the processing of intermediate results received from and previously generated by other computational nodes/accelerators such that the generation of this result occurs in response to the generation and reception of the intermediate results.)   

In regards to claim 13, the rejection of claim 12 is incorporated and Li further teaches in which the final inference request result are transmitted from the second inference accelerator directly to the host processor.  ([0042, 0048, Figure 6A, Figure 7, Figure 11A, Figure 11B] The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . … At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 ., wherein the result/output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) is sent/output directly to the system memory of the host processor (i.e., while the NoC encompasses a host processor which controls/configures the data flow operations with the associated system memory storing results and feeding feature maps as input into the neural network) 

In regards to claim 17, Li teaches A system for accelerating machine learning, comprising: a first inference accelerator and a second inference accelerator to host a neural network, the neural network split between the first inference accelerator and the second inference accelerator; ([0026, 0039, 0040, Figure 1, Figure 2, Figure 7] Please refer to FIG . 1A and FIG . 2 . FIG . 2 is a schematic view of a computation node 100 in a NoC structure constituted by one PE 110 and the corresponding auxiliary memory 115 . In the present embodiment , in order to better adapt the PE 110 to the NN computation , the PE 110 may be an application - specific integrated circuit ( ASIC ) of an artificial intelligence ( AI ) accelerator , e . g . , a tensor processor , a neural network processor ( NNP ) , a neural engine , and so on ., In another aspect , the NN structure includes several software layers ( e . g . , the aforesaid convolutional layer , an activation layer , a pooling layer , a fully connected layer , and so on ) . Computations of data are performed in each software layer , and the computation results are then input to the next software layer . According to this concept as well as the aforesaid NoC structure of the processing circuit 1 , a channel mapping - data flow computation mode is provided herein ., That is , each computation node 100 corresponds to one software layer , and the computation nodes 100 are connected through the NoC interface 113 to form a pipeline . The PES 110 in each computation node 100 completes the NN computations in each software layer through the pipeline . Similarly , the allocation of the operation tasks of each computation node 100 is done in advance and stored in the MCU 133 ., wherein a neural network (e.g., CNN) is implemented/hosted in a network-on-chip framework/system comprising a set of accelerators/computational nodes allocated to perform neural network computations (inferences) split across multiple accelerators in which each accelerator includes a processing element (e.g., ASIC, FPGA) and such that each accelerator/computational node performs/hosts the distinct (but reconfigurable) computations associated with a portion of the neural network (e.g., a CNN layer according to a configuration module).) a host processor to receive a final inference request result generated from intermediate inference request results;  ([0042, 0048, Figure 7, Figure 11A, Figure 11B]The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . … At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 . , wherein the result/output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) is generated at a computational node and sent/output to host processor (with system memory and a configuration module such as shown in Figure 7) in which any output written to the system memory is being interpreted as being a result, especially if that output is the result of terminal layer (e.g., the 4th layer in a forward propagation).) and means for routing the intermediate inference request results directly between the first inference accelerator and the second inference accelerator, ([0040, 0042, 0048, Figure 7, Figure 11A, Figure 11B] The MCU 133 may control the DMA engine 131 to process the data transmissions between the system memory 120 and the auxiliary memories 115 and the data transmissions between the auxiliary memories 115 of two adjacent computation nodes . Here , the data transmissions are DMA transmissions .,  The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . …Therefore , the PE 110 directly outputs the operation result to the VMs 118 ( VM2 ) of the next PES 110 ( PE1 - PE3 ) or the system memory 120 . … At the same time , the PE 110 ( PE2 ) directly outputs the operation result ( e . g . , the computation results of the third layer of the NN computation performed on the data before the previous data ) to the VM 118 ( VM2 ) of the PE 110 ( PE3 ) . At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 ., wherein the (intermediate) output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) is sent/routed directly (pipelined) to a second agent/computational node via NoC pipeline configuration according the dataflow configured by the MCU control unit such that the crossbar is configured by the MCU to switch the data flow between different computational nodes (auxiliary memories).) the routing of the intermediate interface request results being performed independently of the host processor.  ([0042, 0048, Figure 7, Figure 11A, Figure 11B] The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . …Therefore , the PE 110 directly outputs the operation result to the VMs 118 ( VM2 ) of the next PES 110 ( PE1 - PE3 ) or the system memory 120 . … At the same time , the PE 110 ( PE2 ) directly outputs the operation result ( e . g . , the computation results of the third layer of the NN computation performed on the data before the previous data ) to the VM 118 ( VM2 ) of the PE 110 ( PE3 ) . At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 ., wherein the routing of the (intermediate) results from a computational node/accelerator (layer) to a subsequent computational node/accelerator (layer) is independent of the host processor/system memory since it is based on the directly pipelining of the (auxiliary) memories of the respective accelerators/computational nodes (for instance, Figure 7 shows this direct routing from one PE to another while bypassing the host processor (system memory/configuration module).) 
Claim 17 recites means plus function language in the form of  “means for routing” and is being interpreted under 35 USC 112(f). The corresponding function corresponding to this claim limitation is found in paragraph [0062] of the specification:  “The system for accelerating machine learning includes means for routing intermediate inference request results directly between a first inference accelerator and a second inference accelerator. In one aspect, the routing means may be the switch device 302 configured to perform the functions recited. In another configuration, theSeyfarth Ref. No. 072178-004978 16 61641007v.1 Qualcomm Ref. No. 191370aforementioned means may be any module or any apparatus configured to perform the functions recited by the aforementioned means.” The “means for routing” is therefore being interpreted as any apparatus (e.g., MCU, crossbar as noted above in Li) which performs routing, particularly using a switch or a switching functionality.

In regards to claim 18, the rejection of claim 17 is incorporated and Li further teaches in which the final inference request result is generated by the second inference accelerator in response to the intermediate inference request results from the first inference accelerator. ([0042, 0048, Figure 7, Figure 11A, Figure 11B] The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . … At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 ., wherein the result/output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) that is generated at a (terminal) computational node (e.g., for a terminal layer) is based on the processing of intermediate results received from and previously generated by other computational nodes/accelerators such that the generation of this result occurs in response to the generation and reception of the intermediate results.)   

In regards to claim 19, the rejection of claim 18 is incorporated and Li further teaches in which the final inference request result are transmitted from the second inference accelerator directly to the host device.  ([0042, 0048, Figure 6A, Figure 7, Figure 11A, Figure 11B] The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . … At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 ., wherein the result/output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) is sent/output directly to the system memory of the host processor (i.e., while the NoC encompasses a host processor which controls/configures the data flow operations with the associated system memory storing results and feeding feature maps as input into the neural network) 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 4, 14, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Li, in view of Huang et al. (“Gpipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism”, https://arxiv.org/pdf/1811.06965v1.pdf,  arXiv:1811.06965v1 [cs.CV] 16 Nov 2018, pp. 1-11), hereinafter referred to as Huang.

In regards to claim 4, the rejection of claim 2 is incorporated and Li further teaches further comprising transmitting the final inference request result from the second inference accelerator to the host processor ….   ([0042, 0048, Figure 6A, Figure 7, Figure 11A, Figure 11B] The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . … At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 . , wherein the result/output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) is sent/output directly to the system memory of the host processor (i.e., while the NoC encompasses a host processor which controls/configures the data flow operations with the associated system memory storing results and feeding feature maps as input into the neural network) 
However, Li does not explicitly teach via the first inference accelerator.  Although Li discloses that any of the computational nodes/PE’s can at any point in the computational flow write to the system memory/host processor and indicates that any of the computational nodes/PE’s can communicate directly to one another via the pipelined architecture, he does not explicitly disclose a scenario in which the results of a computational node are sent backward from a current computational node to a computational node that was previously used to generate the intermediate results used by the current computational node before being then transferred to the host processor/system memory.
However, Huang, in the analogous environment of using efficiently implementing neural networks, teaches further comprising transmitting the final inference request result from the second inference accelerator to the host processor via the first inference accelerator ([p. 2, Section 1, p. 2, Section 3.2, Figure 2] GPipe partitions a model across different accelerators and automatically splits a mini-batch of training examples into smaller micro-batches. By pipelining the execution across micro-batches, accelerators can operate in parallel. In addition, GPipe automatically recomputes the forward activations during the backpropagation to further reduce the memory consumption., During the forward pass (Figure 2c), the (k + 1)-th accelerator starts to compute Fk+1,t as soon as it finishes the (t − 1)-th micro-batch and receives inputs from Fk,t. At the same time, the k-th accelerator can start to compute Fk,t+1. Each accelerator repeats this process T times to finish the forward pass of the whole mini-batch. There are still up to O(K) idle time per accelerator, which refers to bubble overhead as depicted in Figure 2c. This bubble time is amortized by the number of micro-batches T. The last accelerator is also responsible for concatenating the outputs across micro-steps and computing the final loss. During the backward pass, gradients for each microbatch are computed based on the same model parameters as the forward pass. Gradients are applied to update model parameters across accelerators only at the end of each minibatch., wherein a (deep) neural network is implemented/hosted in library-based framework/host processor (GPipe using the TensorFlow framework) comprising a set of accelerators/computational nodes allocated to perform neural network computations (inferences) split across multiple accelerators/devices in which each accelerator/device performs/hosts distinct computations associated with a portion of the neural network (e.g., a layer according to a configuration module)  and such that, during training, the sequence of devices successively used during the forward propagation pass (with results from one device pipelined to the next device) leads to a final result (e.g., loss computation) that is fed backwards across the device in the reversed sequence to generate a final result (gradients which are interpreted as corresponding to an update in a host processing system according in response to a host-processor supplied mini-batch partitioned by GPipe) such that that final result includes the final inference result of the final (second) accelerator in the forward pass that is conferred via the final device (first accelerator) in the backward pass.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Li to incorporate the teachings of Huang to transmit the final inference request result from the second inference accelerator to a host processor of the computing device via the first inference accelerator. The modification would be obvious because one of ordinary skill would be motivated to achieve improved neural network training efficiency and memory usage while achieving performance consistent with the state of the art by exploiting a pipelined architecture between processing elements that makes use of both forward and backward propagation of results across those elements, especially for the training of the neural network (Huang, [Abstract, p. 9, Section 6, Figure 1, Figure 2, Table 3]).

In regards to claim 14, the rejection of claim 12 is incorporated and Li further teaches in which the final inference request result are transmitted from the second inference accelerator to the host processor …   ([0042, 0048, Figure 6A, Figure 7, Figure 11A, Figure 11B] The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . … At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 . , wherein the result/output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) is sent/output directly to the system memory of the host processor (i.e., while the NoC encompasses a host processor which controls/configures the data flow operations with the associated system memory storing results and feeding feature maps as input into the neural network) 
However, Li does not explicitly teach via the first inference accelerator.  Although Li discloses that any of the computational nodes/PE’s can at any point in the computational flow write to the system memory/host processor and indicates that any of the computational nodes/PE’s can communicate directly to one another via the pipelined architecture, he does not explicitly disclose a scenario in which the results of a computational node are sent backward from a current computational node to a computational node that was previously used to generate the intermediate results used by the current computational node before being then transferred to the host processor/system memory.
However, Huang, in the analogous environment of using efficiently implementing neural networks, teaches in which the final inference request result are transmitted from the second inference accelerator to the host device via the first inference accelerator ([p. 2, Section 1, p. 2, Section 3.2, Figure 2] GPipe partitions a model across different accelerators and automatically splits a mini-batch of training examples into smaller micro-batches. By pipelining the execution across micro-batches, accelerators can operate in parallel. In addition, GPipe automatically recomputes the forward activations during the backpropagation to further reduce the memory consumption., During the forward pass (Figure 2c), the (k + 1)-th accelerator starts to compute Fk+1,t as soon as it finishes the (t − 1)-th micro-batch and receives inputs from Fk,t. At the same time, the k-th accelerator can start to compute Fk,t+1. Each accelerator repeats this process T times to finish the forward pass of the whole mini-batch. There are still up to O(K) idle time per accelerator, which refers to bubble overhead as depicted in Figure 2c. This bubble time is amortized by the number of micro-batches T. The last accelerator is also responsible for concatenating the outputs across micro-steps and computing the final loss. During the backward pass, gradients for each microbatch are computed based on the same model parameters as the forward pass. Gradients are applied to update model parameters across accelerators only at the end of each minibatch., wherein a (deep) neural network is implemented/hosted in library-based framework/host processor (GPipe using the TensorFlow framework) comprising a set of accelerators/computational nodes allocated to perform neural network computations (inferences) split across multiple accelerators/devices in which each accelerator/device performs/hosts distinct computations associated with a portion of the neural network (e.g., a layer according to a configuration module)  and such that, during training, the sequence of devices successively used during the forward propagation pass (with results from one device pipelined to the next device) leads to a final result (e.g., loss computation) that is fed backwards across the device in the reversed sequence to generate a final result (gradients which are interpreted as corresponding to an update in a host processing system according in response to a host-processor supplied mini-batch partitioned by GPipe) such that that final result includes the final inference result of the final (second) accelerator in the forward pass that is conferred via the final device (first accelerator) in the backward pass.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Li to incorporate the teachings of Huang for the final inference request result to be transmitted from the second inference accelerator to the host device via the first inference accelerator. The modification would be obvious because one of ordinary skill would be motivated to achieve improved neural network training efficiency and memory usage while achieving performance consistent with the state of the art by exploiting a pipelined architecture between processing elements that makes use of both forward and backward propagation of results across those elements, especially for the training of the neural network (Huang, [Abstract, p. 9, Section 6, Figure 1, Figure 2, Table 3]).

In regards to claim 20, the rejection of claim 18 is incorporated and Li further teaches further comprising transmitting the final inference request result from the second inference accelerator to the host device ….   ([0042, 0048, Figure 6A, Figure 7, Figure 11A, Figure 11B] The PE 110 ( PE2 ) directly writes the results of computations ( e . g . , the computation results of the third layer of the NN computation on the values recorded in the VMs 116 and 117 ( VMO and VM1 ) into the VM 116 ( VMO ) of the PE 110 ( PE3 ) ( corresponding to the auxiliary memory 115 located in the left portion of FIG . 7 ) through the pipeline network . The PE 110 ( PE3 ) directly writes the results of computations ( e . g . , the computation results of the fourth layer of the NN computation ) on the values recorded in the VMs 117 and 118 ( VM1 and VM2 ) into the system memory 120 through the aforesaid retrieval network ., FIG . 11A and FIG . 11B exemplarily illustrate data flow computations implemented by the single - port VMs 116 - 118 and the PES 110 connectable to the NoC structure . In this example , the crossbar interface 112 may control the PEs 110 to directly perform the writing operation on the system memory 120 or the auxiliary memories 115 of other PEs 110 through the NoC interface 113 , given that the VM 116 ( VM ) has already stored the weight ( the DMA trans mission of the weight is the same as that depicted in FIG . 4B ) . … At the same time , the PE 110 ( PE1 ) directly outputs the operation result ( e . g . , the computation results of the fourth layer of the NN computation performed on the foremost data ) to the system memory 120 . , wherein the result/output from any particular accelerator/computational node (e.g., the result of computation at a particular layer) is sent/output directly to the system memory of the host processor (i.e., while the NoC encompasses a host processor which controls/configures the data flow operations with the associated system memory storing results and feeding feature maps as input into the neural network) 
However, Li does not explicitly teach via the first inference accelerator.  Although Li discloses that any of the computational nodes/PE’s can at any point in the computational flow write to the system memory/host processor and indicates that any of the computational nodes/PE’s can communicate directly to one another via the pipelined architecture, he does not explicitly disclose a scenario in which the results of a computational node are sent backward from a current computational node to a computational node that was previously used to generate the intermediate results used by the current computational node before being then transferred to the host processor/system memory.
However, Huang, in the analogous environment of using efficiently implementing neural networks, teaches further comprising transmitting the final inference request result from the second inference accelerator to the host device via the first inference accelerator ([p. 2, Section 1, p. 2, Section 3.2, Figure 2] GPipe partitions a model across different accelerators and automatically splits a mini-batch of training examples into smaller micro-batches. By pipelining the execution across micro-batches, accelerators can operate in parallel. In addition, GPipe automatically recomputes the forward activations during the backpropagation to further reduce the memory consumption., During the forward pass (Figure 2c), the (k + 1)-th accelerator starts to compute Fk+1,t as soon as it finishes the (t − 1)-th micro-batch and receives inputs from Fk,t. At the same time, the k-th accelerator can start to compute Fk,t+1. Each accelerator repeats this process T times to finish the forward pass of the whole mini-batch. There are still up to O(K) idle time per accelerator, which refers to bubble overhead as depicted in Figure 2c. This bubble time is amortized by the number of micro-batches T. The last accelerator is also responsible for concatenating the outputs across micro-steps and computing the final loss. During the backward pass, gradients for each microbatch are computed based on the same model parameters as the forward pass. Gradients are applied to update model parameters across accelerators only at the end of each minibatch., wherein a (deep) neural network is implemented/hosted in library-based framework/host processor (GPipe using the TensorFlow framework) comprising a set of accelerators/computational nodes allocated to perform neural network computations (inferences) split across multiple accelerators/devices in which each accelerator/device performs/hosts distinct computations associated with a portion of the neural network (e.g., a layer according to a configuration module)  and such that, during training, the sequence of devices successively used during the forward propagation pass (with results from one device pipelined to the next device) leads to a final result (e.g., loss computation) that is fed backwards across the device in the reversed sequence to generate a final result (gradients which are interpreted as corresponding to an update in a host processing system according in response to a host-processor supplied mini-batch partitioned by GPipe) such that that final result includes the final inference result of the final (second) accelerator in the forward pass that is conferred via the final device (first accelerator) in the backward pass.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Li to incorporate the teachings of Huang to transmit the final inference request result from the second inference accelerator to the host device via the first inference accelerator. The modification would be obvious because one of ordinary skill would be motivated to achieve improved neural network training efficiency and memory usage while achieving performance consistent with the state of the art by exploiting a pipelined architecture between processing elements that makes use of both forward and backward propagation of results across those elements, especially for the training of the neural network (Huang, [Abstract, p. 9, Section 6, Figure 1, Figure 2, Table 3]).


Claims 6 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Li, in view of Nicole et al. (US20180181503, Published 28 June 2018), hereinafter referred to as Nicol.

In regards to claim 6, the rejection of claim 1 is incorporated and Li does not further teach further comprising implementing a request queue for the second inference accelerator in a memory of the first inference accelerator.  Although Li teaches that the request queue is in the memory of the host process (see claim 7) with implementation/synchronization of particular tasks across the computational nodes determined according the MCU of the host processor, he does not explicitly disclose that a particular computational node has a memory that queues another computational node.
However, Nicol, in the analogous environment of designing efficient implementations of neural networks, teaches further comprising implementing a request queue for the second inference accelerator in a memory of the first inference accelerator ([0036, 0051, 0062, 0063, Figure 7] In embodiments , each processing element may store two addresses for each FIFO that its process agent accesses . One address is a starting address or “ head ” pointer . Another address is an ending address or " tail ” pointer…. When a FIFO is inserted between a first processing element and a second processing element , both processing elements can synchronize the starting address registers and ending address registers of the FIFO …. The sending processing element may send data to the receiving processing element to indicate the starting address and ending address . As the sending processing element puts data on the FIFO , it may adjust the starting address register . As the receiving processing element removes data from the FIFO , it may adjust the ending address register., In embodiments , as agents execute on the processing elements and place data in a FIFO or remove data from a FIFO , a corresponding read and write pointer or register is updated to refer to the next location to be read to or written from . In embodiments , as agents execute on the processing elements and place data on a FIFO or remove data from a FIFO , the head and / or tail pointer / register is updated to refer to the next location to be read from or written to., FIG . 7 shows an example 700 of scheduled sections relating to an agent . A FIFO 720 serves as an input FIFO for a process agent 710 . Data from FIFO 720 is read into a local buffer 741 of a FIFO controlled switching element 740 . A circular buffer 743 may contain instructions that are executed by a switching element ( SE ) , and may modify data based on one or more logical operations , including , but not limited to , XOR , OR , AND , NAND , and / or NOR . The plurality of processing elements can be controlled by circular buffers . The modified data may be passed to a circular buffer 732 under static scheduled processing 730 . Thus , the scheduling of circular buffer 732 may be performed at compile time. The circular buffer 732 may provide data to a FIFO controlled switching element 742. Circular buffer 745 may rotate to provide a plurality of instructions / operations to modify and / or transfer data to a data buffer 747 , which is then transferred to external FIFO 722., A signaling component can signal process agents executing on neighboring processing elements about conditions of a FIFO . For example , a process agent can issue a FIRE signal to another process agent operating on another processing element when new data is available in a FIFO that was previously empty ., wherein a neural network (e.g., CNN) is implemented in a framework comprising a set of accelerators/processing elements/agents allocated to perform neural network computations (inferences) according to a directed acyclic graph and dataflow graph (in which particular computational points are mapped onto the set of processing elements/agents) such that each processing element/agent is directly communicatively coupled with other processing elements/agents so that the neural network computations/inferences are split between those elements and such that any particular (first) accelerator/agent/processing element uses enables/facilitates the appropriately scheduled/queued processing request in another (second) accelerator/agent/processing element either through a circular buffer in the memory of the first agent/accelerator which provides scheduled data to an FIFO, via a controlled switching element, in response to which the second agent/accelerator may then process that data or through signally operations between the agents to indicate that the queued task for the second agent should be executed (fired) because the data is not available for that agent to process it.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Li to incorporate the teachings of Nicol to implement a request queue for the second inference accelerator in a memory of the first inference accelerator. The modification would be obvious because one of ordinary skill would be motivated to achieve improved flexibility, efficiency, and throughput in implementing applications that require high performance computing such as neural networks though a reconfigurable pipelined-based multi-core processing topology in which the synchronization of tasks across the array topology is achieved through local interactions between processing agents.  (Nicol, [0009, 0010, 0011, 0012]).

In regards to claim 10, the rejection of claim 1 is incorporated and Li further teaches  in which routing the intermediate inference request results comprises writing, by the first inference accelerator, across a switch device … a global synchronization manager (GSM) … after a direct memory access (DMA) transfer of the intermediate inference request results from the first inference accelerator to the second inference accelerator.   ([0028, 0040, 0048, Figure 11A, Figure 11B] Note that each PE 110 may , through the crossbar interface 112 , determine which of the VMs 116 , 117 , and 118 may be configured for storing the weight , for being read or written by the corresponding PE 110 , and for data transmissions with other computation nodes 100 ( including other PES 110 , their auxiliary memories 115 , and the system memory 120 ) in the NoC structure , whereby the functions of the VMs 116 , 117 , and 118 may be changed according to actual requirements for operation tasks., The MCU 133 may control the DMA engine 131 to process the data transmissions between the system memory 120 and the auxiliary memories 115 and the data transmissions between the auxiliary memories 115 of two adjacent computation nodes . Here , the data transmissions are DMA transmissions . Four PES 110 and the auxiliary memories 115 connected thereto four computation nodes 100 , and the configuration module 133 establishes a phase sequence for the computation nodes 100 according to the NN computation and instructs each of the computation nodes 100 to transmit data to another of the computation nodes 100 according to the phase sequence ., The two kinds of operation tasks shown in FIGS . 11A and 11B are repeatedly switched and performed until all the operation tasks of the NN computation are completed ., wherein the result/output from any particular accelerator/computational node (including intermediate outputs) at a computational node and sent/routed to either system memory (host processor) or to another computational node according the dataflow configured by the MCU control unit such that the crossbar is configured by the MCU to switch the data flow between different computational nodes (auxiliary memories) or between any computational node and system memory/host processor, such that all of these data transfers are performed by DMA, and such that the MCU performs the function of a global synchronization manager.)
However, Li does not explicitly teach … to … of the second inference accelerator … In other words, although Li teaches a global synchronization manager (GSM) in the form of the MCU, he does not disclose that the GSM is a part of a (second) computational node. 
However, Nicol, in the analogous environment of designing efficient implementations of neural networks, teaches in which routing the intermediate inference request results comprises writing, by the first inference accelerator, across a switch device to a global synchronization manager (GSM) of the second inference accelerator after a direct memory access (DMA) transfer of the intermediate inference request results from the first inference accelerator to the second inference accelerator ([0046, 0079, 0081, Figure 7, Figure 10] The flow 200 continues with issuing a first done signal 220 . This can occur when a process agent empties a FIFO by reading its contents . Once that happens , the process agent may issue a first done signal to the upstream agent …. . In other embodiments , the first done signal may be an instruction passed directly to a circular buffer of an upstream processing element. In a similar manner , the flow 200 continues with issuing a second done signal 230 . This can occur when a downstream process agent empties a FIFO by reading its contents . Once that happens , the downstream process agent may issue a second done signal to the process agent., For many applications , the reconfigurable fabric can be a DMA slave , which enables a host processor to gain direct access to the instruction and data RAMs ( and registers ) that are located within the quads in the cluster . DMA transfers are initiated by the host processor on a system bus . Several DMA paths can propagate through the fabric in parallel . The DMA paths generally start or finish at a streaming interface to the processor system bus . DMA paths may be horizontal , vertical , or a combination ( as determined by a router ) . To facilitate high bandwidth DMA transfers , several DMA paths can enter the fabric at different times , providing both spatial and temporal multiplexing of DMA channels ., FIG . 10 shows a block diagram of a circular buffer . The circular buffer 1010 can include a switching element 1012 corresponding to the circular buffer . The circular buffer and the corresponding switching element can be used in part for dynamic reconfiguration with partially resident agents . Using the circular buffer 1010 and the corresponding switching element 1012 , data can be obtained from a first switching unit , where the first switching unit can be controlled by a first circular buffer . Data can be sent to a second switching element , where the second switching element can be controlled by a second circular buffer . The obtaining data from the first switching element and the sending data to the second switching element can include a direct memory access ( DMA ) ., wherein a set of accelerators/processing elements/agents are allocated to perform neural network computations (inferences) such that each processing element/agent is directly communicatively coupled with other processing elements/agents so that any particular (first) accelerator/agent/processing element uses a switch (switch device – Figure 7) to route its processed results/inference to another (second) accelerator/agent (according to a DAG-based dataflow) to a circular buffer (GSM) of the other (second) accelerator agent with, in particular, a signal (e.g., “done”) also being transmitted from the particular (first) accelerator/agent to the other (second) accelerator agent after the DMA transference of the (intermediate) result data (but, it is noted, this signaling can go in either direction), and wherein, as previously noted, the circular buffer is a GSM because it enables/facilitates the appropriately scheduled/queued processing request in the other (second) accelerator/agent/processing element.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Li to incorporate the teachings of Nicol  for the routing the intermediate inference request results to comprise writing, by the first inference accelerator, across a switch device to a global synchronization manager (GSM) of the second inference accelerator after a direct memory access (DMA) transfer of the intermediate inference request results from the first inference accelerator to the second inference accelerator. The modification would be obvious because one of ordinary skill would be motivated to achieve improved flexibility, efficiency, and throughput in implementing applications that require high performance computing such as neural networks though a reconfigurable pipelined-based multi-core processing topology in which the synchronization of tasks across the array topology is achieved through local interactions between processing agents including circular memory buffers and signaling modalities.  (Nicol, [Abstract, 0009, 0010, 0011, 0012, 0046]).

Claims 8, 9, 15, and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Li, in view of Nicol, and in further view of Huang.

In regards to claim 8, the rejection of claim 1 is incorporated and Li further teaches in which generating the final inference request result further comprises writing, by the second inference accelerator, across a switch device … a global synchronization manager (GSM) … after a direct memory access (DMA) transfer of the … inference request result from the second inference accelerator to the first inference accelerator.  ([0028, 0040, 0048, Figure 11A, Figure 11B] Note that each PE 110 may , through the crossbar interface 112 , determine which of the VMs 116 , 117 , and 118 may be configured for storing the weight , for being read or written by the corresponding PE 110 , and for data transmissions with other computation nodes 100 ( including other PES 110 , their auxiliary memories 115 , and the system memory 120 ) in the NoC structure , whereby the functions of the VMs 116 , 117 , and 118 may be changed according to actual requirements for operation tasks., The MCU 133 may control the DMA engine 131 to process the data transmissions between the system memory 120 and the auxiliary memories 115 and the data transmissions between the auxiliary memories 115 of two adjacent computation nodes . Here , the data transmissions are DMA transmissions . Four PES 110 and the auxiliary memories 115 connected thereto four computation nodes 100 , and the configuration module 133 establishes a phase sequence for the computation nodes 100 according to the NN computation and instructs each of the computation nodes 100 to transmit data to another of the computation nodes 100 according to the phase sequence ., The two kinds of operation tasks shown in FIGS . 11A and 11B are repeatedly switched and performed until all the operation tasks of the NN computation are completed ., wherein the result/output from any particular accelerator/computational node is generated at a computational node and sent/output to either system memory (host processor) or to another computational node according the dataflow configured by the MCU control unit such that the crossbar is configured by the MCU to switch the data flow between different computational nodes (auxiliary memories) or between any computational node and system memory/host processor, such that all of these data transfers are performed by DMA, and such that the MCU performs the function of a global synchronization manager  and wherein, as previously noted any output written to the system memory is being interpreted as being a result, especially if that output is the result of terminal layer (e.g., the 4th layer in a forward propagation).)
However, Li does not explicitly teach …to … of the first inference accelerator … final… In other words, although Li teaches a global synchronization manager (GSM) in the form of the MCU, he does not disclose that the GSM is a part of a (first) computational node. Although Li discloses that any of the computational nodes/PE’s can at any point in the computational flow write to the system memory/host processor and indicates that any of the computational nodes/PE’s can communicate directly to one another via the pipelined architecture, he does not explicitly disclose a scenario in which the results of a computational node are sent backward from a current computational node to a computational node that was previously used to generate the intermediate results used by the current computational node before being then transferred to the host processor/system memory. 
However, Nicol, in the analogous environment of designing efficient implementations of neural networks, teaches in which generating the … inference request result further comprises writing, by the second inference accelerator, across a switch device to a global synchronization manager (GSM) of the first inference accelerator after a direct memory access (DMA) transfer of the … inference request result from the second inference accelerator to the first inference accelerator ([0046, 0079, 0081, Figure 7, Figure 10] The flow 200 continues with issuing a first done signal 220 . This can occur when a process agent empties a FIFO by reading its contents . Once that happens , the process agent may issue a first done signal to the upstream agent …. . In other embodiments , the first done signal may be an instruction passed directly to a circular buffer of an upstream processing element. In a similar manner , the flow 200 continues with issuing a second done signal 230 . This can occur when a downstream process agent empties a FIFO by reading its contents . Once that happens , the downstream process agent may issue a second done signal to the process agent ., For many applications , the reconfigurable fabric can be a DMA slave , which enables a host processor to gain direct access to the instruction and data RAMs ( and registers ) that are located within the quads in the cluster . DMA transfers are initiated by the host processor on a system bus . Several DMA paths can propagate through the fabric in parallel . The DMA paths generally start or finish at a streaming interface to the processor system bus . DMA paths may be horizontal , vertical , or a combination ( as determined by a router ) . To facilitate high bandwidth DMA transfers , several DMA paths can enter the fabric at different times , providing both spatial and temporal multiplexing of DMA channels ., FIG . 10 shows a block diagram of a circular buffer . The circular buffer 1010 can include a switching element 1012 corresponding to the circular buffer . The circular buffer and the corresponding switching element can be used in part for dynamic reconfiguration with partially resident agents . Using the circular buffer 1010 and the corresponding switching element 1012 , data can be obtained from a first switching unit , where the first switching unit can be controlled by a first circular buffer . Data can be sent to a second switching element , where the second switching element can be controlled by a second circular buffer . The obtaining data from the first switching element and the sending data to the second switching element can include a direct memory access ( DMA ) ., wherein each processing element/agent is directly communicatively coupled with other processing elements/agents so that any particular (second) accelerator/agent/processing element uses a switch (switch device – Figure 7) to route its processed results/inference to another (first) accelerator/agent (according to a DAG-based dataflow) to a circular buffer (GSM) of the other (first) accelerator agent with, in particular, a signal (e.g., “done”) also being transmitted from the particular (second) accelerator/agent to the other (first) accelerator agent after the DMA transference of the result data (but, it is noted, this signaling can go in either direction), wherein this process continues over the reconfigurable fabric until the completion of the set of (host determined) tasks, and wherein the circular buffer is a GSM because it enables/facilitates the appropriately scheduled/queued processing request in the other (first) accelerator/agent/processing element.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Li to incorporate the teachings of Nicol  generate the inference request result by writing, by the second inference accelerator, across a switch device to a global synchronization manager (GSM) of the first inference accelerator after a direct memory access (DMA) transfer of the inference request result from the second inference accelerator to the first inference accelerator. The modification would be obvious because one of ordinary skill would be motivated to achieve improved flexibility, efficiency, and throughput in implementing applications that require high performance computing such as neural networks though a reconfigurable pipelined-based multi-core processing topology in which the synchronization of tasks across the array topology is achieved through local interactions between processing agents including circular memory buffers and signaling modalities.  (Nicol, [Abstract, 0009, 0010, 0011, 0012, 0046]).
However, Li and Nicol do not  final … final.  Although Li discloses that any of the computational nodes/PE’s can at any point in the computational flow write to the system memory/host processor and indicates that any of the computational nodes/PE’s can communicate directly to one another via the pipelined architecture, he does not explicitly disclose a scenario in which the results of a computational node are sent backward from a current computational node to a computational node that was previously used to generate the intermediate results used by the current computational node before being then transferred to the host processor/system memory. Although Nicol also discloses a reconfigurable topology with flexible switchable data flows from processors, he also does not specifically disclose this backward data flow.
However, Huang, in the analogous environment of using efficiently implementing neural networks, teaches in which generating the final inference request result further comprises writing, by the second inference accelerator, … after a direct memory access (DMA) transfer of the final inference request result from the second inference accelerator to the first inference accelerator  ([p. 2, Section 1, p. 2, Section 3.2, Figure 2] GPipe partitions a model across different accelerators and automatically splits a mini-batch of training examples into smaller micro-batches. By pipelining the execution across micro-batches, accelerators can operate in parallel. In addition, GPipe automatically recomputes the forward activations during the backpropagation to further reduce the memory consumption., During the forward pass (Figure 2c), the (k + 1)-th accelerator starts to compute Fk+1,t as soon as it finishes the (t − 1)-th micro-batch and receives inputs from Fk,t. At the same time, the k-th accelerator can start to compute Fk,t+1. Each accelerator repeats this process T times to finish the forward pass of the whole mini-batch. There are still up to O(K) idle time per accelerator, which refers to bubble overhead as depicted in Figure 2c. This bubble time is amortized by the number of micro-batches T. The last accelerator is also responsible for concatenating the outputs across micro-steps and computing the final loss. During the backward pass, gradients for each microbatch are computed based on the same model parameters as the forward pass. Gradients are applied to update model parameters across accelerators only at the end of each minibatch., wherein a set of accelerators/computational nodes are allocated to perform neural network computations (inferences) split across multiple accelerators/devices in which each accelerator/device performs/hosts distinct computations associated with a portion of the neural network (e.g., a layer according to a configuration module)  and such that, during training, the sequence of devices successively used during the forward propagation pass (with results from one device pipelined to the next device) leads to a final result (e.g., loss computation) that is fed backwards across the device in the reversed sequence to generate a final result (gradients which are interpreted as corresponding to an update in a host processing system according in response to a host-processor supplied mini-batch partitioned by GPipe) such that that final result includes the final inference result of the final (second) accelerator in the forward pass that is conferred via the final device (first accelerator) in the backward pass according to the pipelined architecture (DMA transference).)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Li and Nicol to incorporate the teachings of Huang to generate the final inference request result by writing, by the second inference accelerator, across a switch device to a global synchronization manager (GSM) of the first inference accelerator after a direct memory access (DMA) transfer of the final inference request result from the second inference accelerator to the first inference accelerator. The modification would be obvious because one of ordinary skill would be motivated to achieve improved neural network training efficiency and memory usage while achieving performance consistent with the state of the art by exploiting a pipelined architecture between processing elements that makes use of both forward and backward propagation of results across those elements, especially for the training of the neural network (Huang, [Abstract, p. 9, Section 6, Figure 1, Figure 2, Table 3]).

In regards to claim 9, the rejection of claim 8 is incorporated and Li does not further teach in which the DMA transfer of the final inference request result is based on a request queue element in a request queue of the second inference accelerator stored in a memory of the first inference accelerator.  Although Li teaches the DMA transference of all computational node results (as previously noted) and that the request queue is in the memory of the host process (also as previously noted) with implementation/synchronization of particular tasks across the computational nodes determined according the MCU of the host processor, he does not explicitly disclose that a particular computational node has a memory that queues another computational node.
However, Nicol, in the analogous environment of designing efficient implementations of neural networks, teaches which the DMA transfer of the … inference request result is based on a request queue element in a request queue of the second inference accelerator stored in a memory of the first inference accelerator ([0074, 0075, 0087, 0094, Figure 9, Figure 10, Figure 11] FIG . 9 is an example cluster 900 for coarse - grained reconfigurable processing . Data can be obtained from a first switching unit , where the first switching unit can be con t rolled by a first circular buffer . Data can be sent to a second switching element , where the second switching element can be controlled by a second circular buffer . The obtaining data from the first switching element and the sending data to the second switching element can include a direct memory access ( DMA ) . The cluster 900 comprises a circular buffer 902 . The circular buffer 902 can be referred to as a main circular buffer or a switch - instruction circular buffer . In some embodiments , the cluster 900 comprises additional circular buffers corresponding to processing elements within the cluster r . The additional circular buffers can be referred to as processor instruction circular buffers . … In embodiments , the circular buffer 902 controls the passing of data to the quad of processing elements through switching elements ., The circular buffer 902 can contain switch instructions that implement configurable connections . For example , an instruction effectively connects the west input 910 with the north output 914 and the east output 918 and this routing is accomplished via bus 930 ., In the example 1000 shown , the circular buffer 1010 rotates instructions in each pipeline stage into switching element 1012 via a forward data path 1022 , and also back to a pipeline stage 0 1030 via a feedback data path 1020 ., As can be seen in FIG . 11 , different circular buffers can have different instruction sets within them . For example , circular buffer 1110 contains a MOV instruction . Circular buffer 1112 contains a SKIP instruction . Circular buffer 1114 contains a SLEEP instruction and an ANDI instruction . Circular buffer 1116 contains an AND instruction , a MOVE instruction , an ANDI instruction , and an ADD instruction ., wherein the interactions between any pair of accelerators/agents is determined by at least one circular buffer (either a main circular buffer controlling the flow within a cluster but also according to a circular buffer at each agent) such that data can be configured to flow (via DMA transfer) between any set of agents (also interpreted as any direction as well in the reconfigurable mesh topology) according to the instructions in the circular buffer (which drive switching operations) such that the DMA transference from a second agent to a first agent is based (in part) on the instructions in the memory of the circular buffer of the first agent that correspond to (statically) queuing of information to be obtained from the second agent (i.e., the DMA transfer is determined by circular buffer instructions in the receiving agent corresponding to the previously configured queued set of instructions such as shown in Figure 11 with successive queued pipelined data flow patterns shown in Figure 10 to enable a synchronization among processing elements and the efficient sequential progression through the DAG-based data flow).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Li to incorporate the teachings of Nicol for the DMA transfer of the inference request result to be based on a request queue element in a request queue of the second inference accelerator stored in a memory of the first inference accelerator. The modification would be obvious because one of ordinary skill would be motivated to achieve improved flexibility, efficiency, and throughput in implementing applications that require high performance computing such as neural networks though a reconfigurable pipelined-based multi-core processing topology in which the synchronization of tasks across the array topology is achieved through local interactions between processing agents including statically configured circular memory buffers to queue those reconfigurable data flow instructions.  (Nicol, [0009, 0010, 0011, 0012]).
However, Li and Nicol do not  teach final.  Although Li discloses that any of the computational nodes/PE’s can at any point in the computational flow write to the system memory/host processor and indicates that any of the computational nodes/PE’s can communicate directly to one another via the pipelined architecture, he does not explicitly disclose a scenario in which the results of a computational node are sent backward from a current computational node to a computational node that was previously used to generate the intermediate results used by the current computational node before being then transferred to the host processor/system memory. Although Nicol also discloses a reconfigurable topology with flexible switchable data flows from processors (including a feedback path), he also does not specifically disclose this backward data flow.
However, Huang, in the analogous environment of using efficiently implementing neural networks, teaches which the DMA transfer of the final inference request …the second inference accelerator … ([p. 2, Section 1, p. 2, Section 3.2, Figure 2] GPipe partitions a model across different accelerators and automatically splits a mini-batch of training examples into smaller micro-batches. By pipelining the execution across micro-batches, accelerators can operate in parallel. In addition, GPipe automatically recomputes the forward activations during the backpropagation to further reduce the memory consumption., During the forward pass (Figure 2c), the (k + 1)-th accelerator starts to compute Fk+1,t as soon as it finishes the (t − 1)-th micro-batch and receives inputs from Fk,t. At the same time, the k-th accelerator can start to compute Fk,t+1. Each accelerator repeats this process T times to finish the forward pass of the whole mini-batch. There are still up to O(K) idle time per accelerator, which refers to bubble overhead as depicted in Figure 2c. This bubble time is amortized by the number of micro-batches T. The last accelerator is also responsible for concatenating the outputs across micro-steps and computing the final loss. During the backward pass, gradients for each microbatch are computed based on the same model parameters as the forward pass. Gradients are applied to update model parameters across accelerators only at the end of each minibatch., wherein a set of accelerators/computational nodes allocated to perform neural network computations (inferences) are split across multiple accelerators/devices in which each accelerator/device performs/hosts distinct computations associated with a portion of the neural network (e.g., a layer according to a configuration module)  and such that, during training, the sequence of devices successively used during the forward propagation pass (with results from one device pipelined to the next device) leads to a final result (e.g., loss computation) that is fed backwards across the device in the reversed sequence to generate a final result (gradients which are interpreted as corresponding to an update in a host processing system according in response to a host-processor supplied mini-batch partitioned by GPipe) such that that final result includes the final inference result of the final (second) accelerator in the forward pass that is conferred via the final device (first accelerator) in the backward pass according to the pipelined architecture (DMA transference).)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Li and Nicol to incorporate the teachings of Huang which the DMA transfer of the final inference request result is based on a request queue element in a request queue of the second inference accelerator stored in a memory of the first inference accelerator. The modification would be obvious because one of ordinary skill would be motivated to achieve improved neural network training efficiency and memory usage while achieving performance consistent with the state of the art by exploiting a pipelined architecture between processing elements that makes use of both forward and backward propagation of results across those elements, especially for the training of the neural network (Huang, [Abstract, p. 9, Section 6, Figure 1, Figure 2, Table 3]).

In regards to claim 15, the rejection of claim 11 is incorporated and Li further teaches … a global synchronization manager (GSM) to notify the first inference accelerator of a direct memory access (DMA) transfer of the … inference request result from the second inference accelerator to the first inference accelerator ([0028, 0040, 0048, Figure 11A, Figure 11B] Note that each PE 110 may , through the crossbar interface 112 , determine which of the VMs 116 , 117 , and 118 may be configured for storing the weight , for being read or written by the corresponding PE 110 , and for data transmissions with other computation nodes 100 ( including other PES 110 , their auxiliary memories 115 , and the system memory 120 ) in the NoC structure , whereby the functions of the VMs 116 , 117 , and 118 may be changed according to actual requirements for operation tasks., The MCU 133 may control the DMA engine 131 to process the data transmissions between the system memory 120 and the auxiliary memories 115 and the data transmissions between the auxiliary memories 115 of two adjacent computation nodes . Here , the data transmissions are DMA transmissions . Four PES 110 and the auxiliary memories 115 connected thereto four computation nodes 100 , and the configuration module 133 establishes a phase sequence for the computation nodes 100 according to the NN computation and instructs each of the computation nodes 100 to transmit data to another of the computation nodes 100 according to the phase sequence ., The two kinds of operation tasks shown in FIGS . 11A and 11B are repeatedly switched and performed until all the operation tasks of the NN computation are completed ., wherein the result/output from any particular accelerator/computational node is generated at a computational node and sent/output to either system memory (host processor) or to another computational node according the dataflow configured by the MCU control unit, such that all of these data transfers are performed by DMA, and such that the MCU performs the function of a global synchronization manager  and wherein, as previously noted any output written to the system memory is being interpreted as being a result, especially if that output is the result of terminal layer (e.g., the 4th layer in a forward propagation).)
However, Li does not explicitly teach first inference accelerator comprises … final In other words, although Li teaches a global synchronization manager (GSM) in the form of the MCU, he does not disclose that the GSM is a part of a (first) computational node. Although Li discloses that any of the computational nodes/PE’s can at any point in the computational flow write to the system memory/host processor and indicates that any of the computational nodes/PE’s can communicate directly to one another via the pipelined architecture, he does not explicitly disclose a scenario in which the results of a computational node are sent backward from a current computational node to a computational node that was previously used to generate the intermediate results used by the current computational node before being then transferred to the host processor/system memory. 
However, Nicol, in the analogous environment of designing efficient implementations of neural networks, teaches in which the first inference accelerator comprises a global synchronization manager (GSM) to notify the first inference accelerator of a direct memory access (DMA) transfer of the … inference request result from the second inference accelerator to the first inference accelerator ([0046, 0079, 0081, Figure 7, Figure 10] The flow 200 continues with issuing a first done signal 220 . This can occur when a process agent empties a FIFO by reading its contents . Once that happens , the process agent may issue a first done signal to the upstream agent …. . In other embodiments , the first done signal may be an instruction passed directly to a circular buffer of an upstream processing element. In a similar manner , the flow 200 continues with issuing a second done signal 230 . This can occur when a downstream process agent empties a FIFO by reading its contents . Once that happens , the downstream process agent may issue a second done signal to the process agent ., For many applications , the reconfigurable fabric can be a DMA slave , which enables a host processor to gain direct access to the instruction and data RAMs ( and registers ) that are located within the quads in the cluster . DMA transfers are initiated by the host processor on a system bus . Several DMA paths can propagate through the fabric in parallel . The DMA paths generally start or finish at a streaming interface to the processor system bus . DMA paths may be horizontal , vertical , or a combination ( as determined by a router ) . To facilitate high bandwidth DMA transfers , several DMA paths can enter the fabric at different times , providing both spatial and temporal multiplexing of DMA channels ., FIG . 10 shows a block diagram of a circular buffer . The circular buffer 1010 can include a switching element 1012 corresponding to the circular buffer . The circular buffer and the corresponding switching element can be used in part for dynamic reconfiguration with partially resident agents . Using the circular buffer 1010 and the corresponding switching element 1012 , data can be obtained from a first switching unit , where the first switching unit can be controlled by a first circular buffer . Data can be sent to a second switching element , where the second switching element can be controlled by a second circular buffer . The obtaining data from the first switching element and the sending data to the second switching element can include a direct memory access ( DMA ) ., wherein each processing element/agent is directly communicatively coupled with other processing elements/agents so that any particular (second) accelerator/agent/processing element uses a switch (switch device – Figure 7) to route its processed results/inference to another (first) accelerator/agent (according to a DAG-based dataflow) to a circular buffer (GSM) of the other (first) accelerator agent with, in particular, a signal (e.g., “done”) also being transmitted from the particular (second) accelerator/agent to the other (first) accelerator agent after the DMA transference of the result data (but, it is noted, this signaling can go in either direction), wherein this process continues over the reconfigurable fabric until the completion of the set of (host determined) tasks, and wherein the circular buffer is a GSM because it enables/facilitates the appropriately scheduled/queued processing request in the other (first) accelerator/agent/processing element.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Li to incorporate the teachings of Nicol  for the first inference accelerator to comprise a global synchronization manager (GSM) to notify the first inference accelerator of a direct memory access (DMA) transfer of the  inference request result from the second inference accelerator to the first inference accelerator. The modification would be obvious because one of ordinary skill would be motivated to achieve improved flexibility, efficiency, and throughput in implementing applications that require high performance computing such as neural networks though a reconfigurable pipelined-based multi-core processing topology in which the synchronization of tasks across the array topology is achieved through local interactions between processing agents including circular memory buffers and signaling modalities.  (Nicol, [Abstract, 0009, 0010, 0011, 0012, 0046]).
However, Li and Nicol do not  final.  Although Li discloses that any of the computational nodes/PE’s can at any point in the computational flow write to the system memory/host processor and indicates that any of the computational nodes/PE’s can communicate directly to one another via the pipelined architecture, he does not explicitly disclose a scenario in which the results of a computational node are sent backward from a current computational node to a computational node that was previously used to generate the intermediate results used by the current computational node before being then transferred to the host processor/system memory. Although Nicol also discloses a reconfigurable topology with flexible switchable data flows from processors, he also does not specifically disclose this backward data flow.
However, Huang, in the analogous environment of using efficiently implementing neural networks, teaches … a direct memory access (DMA) transfer of the final inference request result from the second inference accelerator to the first inference accelerator  ([p. 2, Section 1, p. 2, Section 3.2, Figure 2] GPipe partitions a model across different accelerators and automatically splits a mini-batch of training examples into smaller micro-batches. By pipelining the execution across micro-batches, accelerators can operate in parallel. In addition, GPipe automatically recomputes the forward activations during the backpropagation to further reduce the memory consumption., During the forward pass (Figure 2c), the (k + 1)-th accelerator starts to compute Fk+1,t as soon as it finishes the (t − 1)-th micro-batch and receives inputs from Fk,t. At the same time, the k-th accelerator can start to compute Fk,t+1. Each accelerator repeats this process T times to finish the forward pass of the whole mini-batch. There are still up to O(K) idle time per accelerator, which refers to bubble overhead as depicted in Figure 2c. This bubble time is amortized by the number of micro-batches T. The last accelerator is also responsible for concatenating the outputs across micro-steps and computing the final loss. During the backward pass, gradients for each microbatch are computed based on the same model parameters as the forward pass. Gradients are applied to update model parameters across accelerators only at the end of each minibatch., wherein a set of accelerators/computational nodes are allocated to perform neural network computations (inferences) split across multiple accelerators/devices in which each accelerator/device performs/hosts distinct computations associated with a portion of the neural network (e.g., a layer according to a configuration module)  and such that, during training, the sequence of devices successively used during the forward propagation pass (with results from one device pipelined to the next device) leads to a final result (e.g., loss computation) that is fed backwards across the device in the reversed sequence to generate a final result (gradients which are interpreted as corresponding to an update in a host processing system according in response to a host-processor supplied mini-batch partitioned by GPipe) such that that final result includes the final inference result of the final (second) accelerator in the forward pass that is conferred via the final device (first accelerator) in the backward pass according to the pipelined architecture (DMA transference).)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Li and Nicol to incorporate the teachings of Huang for the first inference accelerator to comprise a global synchronization manager (GSM) to notify the first inference accelerator of a direct memory access (DMA) transfer of the  final inference request result from the second inference accelerator to the first inference accelerator. The modification would be obvious because one of ordinary skill would be motivated to achieve improved neural network training efficiency and memory usage while achieving performance consistent with the state of the art by exploiting a pipelined architecture between processing elements that makes use of both forward and backward propagation of results across those elements, especially for the training of the neural network (Huang, [Abstract, p. 9, Section 6, Figure 1, Figure 2, Table 3]).

In regards to claim 16, the rejection of claim 11 is incorporated and Li further teaches  in which … a global synchronization manager (GSM) to notify the second inference accelerator after a direct memory access (DMA) transfer of the intermediate inference request results from the first inference accelerator to the second inference accelerator. ([0028, 0040, 0048, Figure 11A, Figure 11B] Note that each PE 110 may , through the crossbar interface 112 , determine which of the VMs 116 , 117 , and 118 may be configured for storing the weight , for being read or written by the corresponding PE 110 , and for data transmissions with other computation nodes 100 ( including other PES 110 , their auxiliary memories 115 , and the system memory 120 ) in the NoC structure , whereby the functions of the VMs 116 , 117 , and 118 may be changed according to actual requirements for operation tasks., The MCU 133 may control the DMA engine 131 to process the data transmissions between the system memory 120 and the auxiliary memories 115 and the data transmissions between the auxiliary memories 115 of two adjacent computation nodes . Here , the data transmissions are DMA transmissions . Four PES 110 and the auxiliary memories 115 connected thereto four computation nodes 100 , and the configuration module 133 establishes a phase sequence for the computation nodes 100 according to the NN computation and instructs each of the computation nodes 100 to transmit data to another of the computation nodes 100 according to the phase sequence ., The two kinds of operation tasks shown in FIGS . 11A and 11B are repeatedly switched and performed until all the operation tasks of the NN computation are completed ., wherein the result/output from any particular accelerator/computational node (including intermediate outputs) at a computational node and sent/routed to either system memory (host processor) or to another computational node according the dataflow configured by the MCU control unit such that the crossbar is configured by the MCU to switch the data flow between different computational nodes (auxiliary memories) or between any computational node and system memory/host processor, such that all of these data transfers are performed by DMA, and such that the MCU performs the function of a global synchronization manager.)
However, Li does not explicitly teach … the second inference accelerator comprises … In other words, although Li teaches a global synchronization manager (GSM) in the form of the MCU, he does not disclose that the GSM is a part of a (second) computational node. 
However, Nicol, in the analogous environment of designing efficient implementations of neural networks, teaches in which the second inference accelerator comprises a global synchronization manager (GSM) to notify the second inference accelerator after a direct memory access (DMA) transfer of the intermediate inference request results from the first inference accelerator to the second inference accelerator([0046, 0079, 0081, Figure 7, Figure 10] The flow 200 continues with issuing a first done signal 220 . This can occur when a process agent empties a FIFO by reading its contents . Once that happens , the process agent may issue a first done signal to the upstream agent …. . In other embodiments , the first done signal may be an instruction passed directly to a circular buffer of an upstream processing element. In a similar manner , the flow 200 continues with issuing a second done signal 230 . This can occur when a downstream process agent empties a FIFO by reading its contents . Once that happens , the downstream process agent may issue a second done signal to the process agent., For many applications , the reconfigurable fabric can be a DMA slave , which enables a host processor to gain direct access to the instruction and data RAMs ( and registers ) that are located within the quads in the cluster . DMA transfers are initiated by the host processor on a system bus . Several DMA paths can propagate through the fabric in parallel . The DMA paths generally start or finish at a streaming interface to the processor system bus . DMA paths may be horizontal , vertical , or a combination ( as determined by a router ) . To facilitate high bandwidth DMA transfers , several DMA paths can enter the fabric at different times , providing both spatial and temporal multiplexing of DMA channels ., FIG . 10 shows a block diagram of a circular buffer . The circular buffer 1010 can include a switching element 1012 corresponding to the circular buffer . The circular buffer and the corresponding switching element can be used in part for dynamic reconfiguration with partially resident agents . Using the circular buffer 1010 and the corresponding switching element 1012 , data can be obtained from a first switching unit , where the first switching unit can be controlled by a first circular buffer . Data can be sent to a second switching element , where the second switching element can be controlled by a second circular buffer . The obtaining data from the first switching element and the sending data to the second switching element can include a direct memory access ( DMA ) ., wherein any particular (first) accelerator/agent/processing element uses a switch (switch device – Figure 7) to route its processed results/inference to another (second) accelerator/agent (according to a DAG-based dataflow) to a circular buffer (GSM) of the other (second) accelerator agent with, in particular, a signal (e.g., “done”) also being transmitted from the particular (first) accelerator/agent to the other (second) accelerator agent after the DMA transference of the (intermediate) result data (but, it is noted, this signaling can go in either direction), and wherein, as previously noted, the circular buffer is a GSM because it enables/facilitates the appropriately scheduled/queued processing request in the other (second) accelerator/agent/processing element.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Li to incorporate the teachings of Nicol  for the second inference accelerator to comprise a global synchronization manager (GSM) to notify the second inference accelerator after a direct memory access (DMA) transfer of the intermediate inference request results from the first inference accelerator to the second inference accelerator. The modification would be obvious because one of ordinary skill would be motivated to achieve improved flexibility, efficiency, and throughput in implementing applications that require high performance computing such as neural networks though a reconfigurable pipelined-based multi-core processing topology in which the synchronization of tasks across the array topology is achieved through local interactions between processing agents including circular memory buffers and signaling modalities.  (Nicol, [Abstract, 0009, 0010, 0011, 0012, 0046]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Gao et al. (HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing”, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016, pp. 126-137) teach a reconfigurable pipelined accelerator cross-bar array topology for deep neural network learning with direct routing to individual functional units (accelerator processing elements).

THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT LEWIS KULP whose telephone number is (571)272-7983. The examiner can normally be reached M, Th, F 8-5:30; Tu 8-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang, can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ROBERT LEWIS KULP/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124