DETAILED ACTION
This communication is in responsive to RCE for Application 16/844314 filed on 4/27/2022. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims:
		Claims 1-18 and 20-24 are presented for examination.

Continued Examination under 37 CFR 1.114
3.	A request for continued examination under 37 CFR 1.114 was filed in this application after appeal to the Patent Trial and Appeal Board, but prior to a decision on the appeal. Since this application is eligible for continued examination under 37 CFR 1.114 and the fee set forth in 37 CFR 1.17(e) has been timely paid, the appeal has been withdrawn pursuant to 37 CFR 1.114 and prosecution in this application has been reopened pursuant to 37 CFR 1.114. Applicant’s submission filed on 4/27/2022 has been entered.

Response to Arguments
4.	Examiner statements in the mailed Non-Final with respect to obvious limitations including common knowledge or well-known in the art are taken to be admitted prior art because applicant failed to traverse the Examiner’s assertion, see MPEP 2144.03 C. 

5.	Applicant’s arguments in the amendment filed on 4/27/2022 regarding claim rejection under 35 USC § 103 with respect to Claims 1-18 and 20-24 are moot in view of the new ground of rejection.  

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-12, 14, 16, 18 and 20-24 are rejected under 35 U.S.C. 103 as being unpatentable over Sarel et al. (hereinafter Sarel) US 2021/0049804 A1 in view of Univ XI AN JIAOTONG submitted with the IDS filed 5/26/2021 (hereinafter XI) CN 109409512 A. 
Regarding Claim 1, Sarel teaches a data processing system comprising a plurality of processors (Fig. 21 and Fig. 2a; plurality of parallel processing 200/202), wherein a first processor of the plurality of processors comprises at least one circuit configured to perform data transfer operations during of at least some of a plurality of exchange stages wherein for the first process (Fig. 2A; parallel processor 200 including parallel processing unit 202 and cluster array 212), the at least one circuit is configured to:
perform data transfer operations to transfer outgoing data to one or more others of the processors during a first of the exchange stages (¶0047-¶0055; The parallel processing unit 202 can transfer data from system memory via the I/O unit 204 for processing. During processing the transferred data can be stored to on-chip memory (e.g., parallel processor memory 222) during processing, then written back to system memory. Also see Fig. 14 model and data parallelism 1406 & ¶0150; synchronization and data exchange varies across embodiments. In one embodiment the GPU link 910 couples with a high speed interconnect to transmit and receive data to other GPGPUs or parallel processors); 
receive incoming data from the one or more others of the processors during the first of the exchange stages (¶0047-¶0063; any one of the clusters 214A-214N of the processing cluster array 212 can process data that will be written to any of the memory units 224A-224N within parallel processor memory 222. The memory crossbar 216 can be configured to transfer the output of each cluster 214A-214N to any partition unit 220A-220N or to another cluster 214A-214N, which can perform additional processing operations on the output. Each cluster 214A-214N can communicate with the memory interface 218 through the memory crossbar 216 to read from or write to various external memory devices. In one embodiment the memory crossbar 216 has a connection to the memory interface 218 to communicate with the I/O unit 204, as well as a connection to a local instance of the parallel processor memory 222, enabling the processing units within the different processing clusters 214A-214N to communicate with system memory or other memory that is not local to the parallel processing unit 202. In one embodiment the memory crossbar 216 can use virtual channels to separate traffic streams between the clusters 214A-214N and the partition units 220A-220N. Also,  see Fig. 14 & ¶0177 & ¶0187-¶0193; receiving a portion of date during exchange stage); 
determine further outgoing data in dependence upon at least part of the incoming data (same as above with respect to data exchange and synchronization. For example, ¶0047-¶0063; any one of the clusters 214A-214N of the processing cluster array 212 can process data that will be written to any of the memory units 224A-224N within parallel processor memory 222. The memory crossbar 216 can be configured to transfer the output of each cluster 214A-214N to any partition unit 220A-220N or to another cluster 214A-214N, which can perform additional processing operations on the output. Each cluster 214A-214N can communicate with the memory interface 218 through the memory crossbar 216 to read from or write to various external memory devices. In one embodiment the memory crossbar 216 has a connection to the memory interface 218 to communicate with the I/O unit 204, as well as a connection to a local instance of the parallel processor memory 222, enabling the processing units within the different processing clusters 214A-214N to communicate with system memory or other memory that is not local to the parallel processing unit 202. In one embodiment the memory crossbar 216 can use virtual channels to separate traffic streams between the clusters 214A-214N and the partition units 220A-220N. Also, see ¶0187; Such software can directly issue computational workloads to the GPGPU 1506 or the computational workloads can be issued to the multi-core processor 1508, which can offload at least a portion of those operations to the GPGPU 1506); 
Sarel does not expressly teach count an amount of at least part of the incoming data received during the first of the exchange stages from the one or more others of the processors; 
and withhold performing data transfer operations to transfer the further outgoing data until the amount of the at least part of the incoming data received has reached a predefined amount; 
and in response to determining, based on the count, that the amount of the at least part of the incoming data received has reached the predefined amount, perform the data transfer operations to transfer the further outgoing data to the one or more others of the processors during a second of the exchange stages.
XI -analogues art- teaches data processing system in Fig. 4 where computing array comprising PEs. Also, in P. 5 of translated doc second paragraph or ¶0057 as published, further teaches, according to hardware resources and the computing performance requirements of the system, multiple configurable computing units can be instantiated and connected to each other to generate a convolution calculation array. Convolution calculations for different types of convolution layers can be completed by this array; for same networks in the model, there are two or more sizes of convolution kernels in the same convolutional layer, which can divide the array and provide different convolution parameters for different areas. in order to ensure the synchronization of the output results of all areas of the calculation array, by calculating the difference between different convolution kernel sizes, the time difference between the calculation units in different regions can be found to produce the output result. The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods.
XI further teaches count an amount of at least part of the incoming data received during the first of the exchange stages from the one or more others of the processors (page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed); 
and withhold performing data transfer operations to transfer the further outgoing data until the amount of the at least part of the incoming data received has reached a predefined amount (page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods); 
and in response to determining, based on the count, that the amount of the at least part of the incoming data received has reached the predefined amount, perform the data transfer operations to transfer the further outgoing data to the one or more others of the processors during a second of the exchange stages (page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed).
It would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to incorporate the teachings of XI into the system of Sarel in order to provide a flexible neural network computing unit, a computing array and a construction method thereof, and firstly extracts parameters required for design according to a target network model, and is designed by parameters (page 3 third paragraph from the translated document). The internal structure of the neural network computing unit, for the requirements of different convolution modes provided by the external input, the control module of the computing unit may perform partial or full calculation of the convolution of the corresponding mode of the storage and calculation module. Id. By instantiating and generating a number of configurable neural network computing units and arranging to generate a complete convolutional computing array, the array can be divided into regions and different convolution parameters can be input into different regions, which can complete parallelization of different types of convolution modes. Id. The invention designs a hardware architecture of a convolutional layer in a convolutional neural network, and can support the convolutional convolution mode of different network models under the premise of ensuring system computing performance, thereby greatly improving the flexibility of the system. Id. The working modes cached in the computing unit make full use of the data reusability of the convolutional neural network, effectively reducing the system power consumption caused by data movement, and reducing the storage burden to some extent, multiple computing units Composing a computational array can support parallel calculations of convolutional kernels of different sizes, and fully exploit the algorithm parallelism and data reusability of convolutional layers in convolutional neural networks. Id. 

Regarding Claim 2, Sarel in view of XI teaches the data processing system of claim 1, XI further teaches wherein for the first processor, the at least one circuit is configured to: 
prior to the determining that the amount of the at least part of the incoming data received has reached the predefined amount, perform only some of the data transfer operations to transfer only part of the outgoing data to one or more others of the processors (This limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods); 
and in response to the determining that the amount of incoming data received has reached the predefined amount: perform remaining data transfer operations to transfer a remaining part of the outgoing data to the one or more others of the processors during the first of the exchange stages (This limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods); 
and subsequently, perform the data transfer operations to transfer the further outgoing data to the one or more others of the processors during the second of the exchange stages (This limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods).

Regarding Claim 3, Sarel in view of XI teaches the data processing system of claim 2, XI further teaches wherein for the first processor, the at least one circuit is configured to: 
count an amount of a further part of the incoming data received during the first of the exchange stages from the one or more others of the processors (This limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods); 
and following starting to perform the remaining data transfer operations, determine that the amount of the further part of the incoming data received has reached a predefined amount (This limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods), 
wherein the subsequently, perform the data transfer operations to transfer the further outgoing data to the one or more others of the processors during the second of the exchange stages is performed in response to determining that the amount of the further part of the incoming data received has reached a predefined amount (This limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods).
Regarding Claim 4, Sarel in view of XI teaches the data processing system of claim 3, Sarel further teaches wherein the at least part of the incoming data is addressed to a first location in the first processor, wherein the further part of the incoming data is addressed to a second location in the first processor (obvious from Fig. 21 & this limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods).

Regarding Claim 5, Sarel in view of XI teaches the data processing system of claim 1, Sarel further teaches wherein, for the first processor, the one or more others of the processors comprises two or more processors (obvious from Fig. 14 1406 and 21 & this limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods).

Regarding Claim 6, Sarel in view of XI teaches the data processing system of claim 5, Sarel further teaches wherein, for the first processor, the two or more processors comprises only two processors (obvious from Fig. 14 1406 and 21 and & this limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods).
Regarding Claim 7, Sarel in view of XI teaches the data processing system of claim 1, Sarel further teaches wherein the first processor comprises a plurality of processing units, configured to: 
receive part of the incoming data from the one or more others of the processors; and send part of the outgoing data to the one or more others of the processors (obvious from Fig. 14 1406 and 21 and & this limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods); 
wherein the steps of counting the amount of incoming data received and determining that the amount of the incoming data received has reached the predefined amount are performed by one or more of the plurality of processing units of a first type (obvious from Fig. 14 1406 and 21 and & this limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods).

Regarding Claim 8, Sarel in view of XI teaches the data processing system of claim 3 Sarel further teaches wherein the first processor comprises a plurality of processing units, configured to: receive part of the incoming data from the one or more others of the processors (obvious from Fig. 14 1406 and 21 and & this limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods); and send part of the outgoing data to the one or more others of the processors (obvious from Fig. 14 1406 and 21 and & this limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods).; wherein the steps of counting the amount of incoming data received and determining that the amount of the incoming data received has reached the predefined amount are performed by one or more of the plurality of processing units of a first type (obvious from Fig. 14 1406 and 21 and & this limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods).

Regarding Claim 9, Sarel in view of XI teaches the data processing system of claim 8, Sarel further teaches wherein the first processor comprises two of the plurality of processing units of the first type, wherein: a first of the plurality of processing units of the first type is configured to perform the steps of counting the amount of incoming data received and determining that the amount of the incoming data received has reached the predefined amount, a second of the plurality of processing units of the first type is configured to perform the steps of counting the amount of the further part of the incoming data received and determine that the amount of the further part of the incoming data received has reached the predefined amount (obvious from Fig. 14 1406 and 21 and & this limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods).

Regarding Claim 10, Sarel in view of XI teaches the data processing system of claim 7, Sarel further teaches wherein a first processing unit of the plurality of processing units is configured to, subsequent to performing operations to send part of the outgoing data, cause control to pass to a second processing unit of the plurality of processing units for that second processing unit to perform operations to send part of the outgoing data (obvious from Fig. 14 1406 and 21 and & this limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods).
Regarding Claim 11, Sarel in view of XI teaches the data processing system of claim 9, Sarel further teaches wherein the first processing unit is configured to perform the causing of control to pass in response to determining that an amount of a part of the incoming data received has reached the predetermined amount (obvious from Fig. 14 1406 and 21 and & this limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods).

Regarding Claim 12, Sarel in view of XI teaches the data processing system of claim 1, Sarel further teaches wherein each of the incoming data, outgoing data, and further outgoing data comprise a set of gradients for weights of a machine learning model (¶0139-¶0152 and see gradients for weights of a machine learning model in abstract in the XI doc).

Regarding Claim 14, Sarel in view of XI teaches the data processing system of claim 1, Sarel further teaches wherein the at least one circuit comprises a remote direct memory access engine configured to perform the data transfer operations during each of a plurality of exchange stages (see Sarel in ¶0714 and XI in Fig. 4).

Regarding Claim 16, Sarel in view of XI teaches the data processing system of claim 1, Sarel further teaches wherein the determining further outgoing data in dependence upon at least part of the incoming data comprises reducing the at least part of the incoming data with data stored in memory of the first processor (¶0093, ¶0120).

Regarding Claim 18, Sarel in view of XI teaches the data processing system of claim 1, Sarel further teaches wherein the at least one circuit comprises at least one of a field programmable gate array or application specific integrated circuit configured to performing the counting of an amount of the incoming data (Fig. 27 or 29).

Claim 20-24 are substantially similar to above claims, thus the same rationale applies.

10.	Claims 13 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Sarel in view of XI and further in view of Wentzlaff et al. (hereinafter Wentzlaff) US 7734894 B1.

Regarding Claim 13, Sarel in view of XI teaches the data processing system of claim 1, wherein the at least one circuit comprises: counting circuitry configured to perform the counting an amount of the incoming data received during the first of the exchange stages (this limitation is obvious from page 4-5 of translated doc or ¶0056-¶0057 as published. See step 5; The control module configures the upper limit value for each counter, the upper limit value of the input data counter and the output data counter is configured as k*a, and the upper limit value of the input weight counter is configured as k*a*b, The upper limit value of the output channel number counter is configured as b, and the upper limit value of the output characteristic map size counter is configured as h; when the input counters are all at the upper limit value, the calculation unit starts to perform calculation, and each output counter performs corresponding counting and The jump of the state machine is controlled; when each output counter is at the upper limit value, it indicates that the convolution calculation of some or all of the output channels of the convolutional layer has been completed…The calculation unit in the less calculation area will wait for the calculation unit in the more calculation area until the time difference is zero. Then start the calculation to ensure the synchronization of the output results of the array and complete the parallel calculation of different types of convolution methods); and an execution unit configured to execute computer readable instructions to: 
Sarel in view of XI do not expressly teach poll the counting circuitry to determine the amount of the incoming data received; and determine that the amount of the incoming data received has reached the predefined amount.
Wentzlaff teaches poll the counting circuitry to determine the amount of the incoming data received (164 or Col. 27, lines 13-35; When a DMA transaction completes, the DMA engine interrupts the main processor 802.  Alternatively, instead of receiving an interrupt, the main processor 802 poll a status register to determine when a DMA transaction completes); and determine that the amount of the incoming data received has reached the predefined amount (Col. 27, lines 13-35; When a DMA transaction completes, the DMA engine interrupts the main processor 802.  Alternatively, instead of receiving an interrupt, the main processor 802 poll a status register to determine when a DMA transaction completes).
It would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to incorporate the teachings of Wentzlaff into the system of Sarel in view of XI in order to process and transfer data (abstract). Utilizing such teachings e.g. parallel processing enable the system to transfer data from system memory via the I/O unit for processing by reconfigurability of an FPGA along with the performance and capability of an ASIC (known knowledge).  

Regarding Claim 15, Sarel in view of XI teaches the data processing system of claim 1, but they do not expressly teach wherein the plurality of processors are arranged in a ring topology such that the at least one circuit of each the first processor is configured to perform the data transfer operations during each of the plurality of exchange stages to transfer data to its two neighboring processors in the ring, wherein the counting the amount of the incoming data comprises counting an amount of data received from the two neighboring processors during the first of the exchange stages.
Wentzlaff further teaches wherein the plurality of processors are arranged in a ring topology such that the at least one circuit of each processor is configured to perform the data transfer operations during each of the plurality of exchange stages to transfer data to its two neighboring processors in the ring (Col. 5, lines 59-67 & Col. 6, lines 1-10; see claim 1. Also a switch coupled to a processor forwards data to and from the processor or between neighboring processors over data paths of a one-dimensional interconnection network such as ring network also see Sridharan in ¶0272), wherein the counting the amount of the incoming data received during the first of the exchange stages from the one or more others of the processors comprises counting an amount of data received from the two neighboring processors during the first of the exchange stages (see claim 1. Also Col. 5, lines 59-67 & Col. 6, lines 1-10; The example of the integrated circuit 100 shown in FIG. 1 includes a two-dimensional array 101 of rectangular tiles with data paths 104 between neighboring tiles to form a mesh network.  The data path 104 between any two tiles can include multiple "wires" (e.g., serial, parallel or fixed serial and parallel signal paths on the IC100) to support parallel channels in each direction.  Optionally, specific subsets of wires between the tiles can be dedicated to different mesh networks that can operate independently).

11.	Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Sarel in view of XI and further in view of Sridharan et al.  (hereinafter Sridharan) US 2018/0322386 A1.

Regarding Claim 17, Sarel in view of XI teaches the data processing system of claim 16, but do not expressly teach wherein the at least one circuit is configured to implement a reduce-scatter collective: transferring data determined in dependence upon data received at the first respective processor in a preceding stage from at least one other of the processors; and determining further outgoing data in dependence upon at least part of the incoming data.
Sridharan teaches wherein the at least one circuits of the plurality of processors are configured to implement a reduce-scatter collective comprising the steps of each of the at least one circuits: transferring data determined in dependence upon data received at the respective processor in a preceding stage from at least one other of the processors (¶0193-¶0200 & Figs. 14a-14e; multiple types of low-level communication patterns are used to transfer data between nodes.  The low-level communication patterns used are illustrated in Table 5 below including SCATTER distribute data from a single array into multiple segments “reduce.” See Figs. 14a-e that illustrate data transfer using data parallelism); 
and determining further outgoing data in dependence upon at least part of the incoming data (¶0193-¶0200 & Figs. 14a-14e; multiple types of low-level communication patterns are used to transfer data between nodes.  The low-level communication patterns used are illustrated in Table 5 below including SCATTER distribute data from a single array into multiple segments “reduce.” See Figs. 14a-e that illustrate data transfer using data parallelism).
It would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to incorporate the teachings of Sridharan into the system of Sarel in view of XI in order to configure the network interface to transmit and receive the gradient data associated with the trainable parameters during a workflow of a machine learning framework (abstract). Utilizing such teachings e.g. parallel processing enable the system to transfer data from system memory via the I/O unit for processing (¶0060).  Also during processing the transferred data can be stored to on-chip memory (e.g., parallel processor memory) during processing, then written back to system memory. Id. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MAHRAN ABU ROUMI whose telephone number is (469)295-9170. The examiner can normally be reached Monday-Thursday 6AM-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Emmanuel Moise can be reached on 571-272-3865. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

MAHRAN ABU ROUMI
Primary Examiner
Art Unit 2455



/MAHRAN Y ABU ROUMI/           Primary Examiner, Art Unit 2455