Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

				Examiner’s Amendment	 
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments regarding Statement of Reasons for Allowance". 
Authorization for this amendment was given in a communication with attorney Walstein B Smith on July 28, 2022. Additions to the claims are reflected by underline (example) and deletions are reflected by strikethrough (

Claim Amendment

1. (currently amended) A method comprising:
performing dataflow-based and instruction-based processing and exchanging fabric packets respectively in and between a plurality of processing elements interconnected as a fabric, each processing element comprising a compute engine and a fabric router; 
specifying communications and computations respectively corresponding to a plurality of branches and a plurality of nodes of a dataflow graph;
allocating a plurality of the processing elements to locally perform the computations, at least two of the processing elements being allocated to respectively locally perform a plurality of computation portions corresponding to a partitioned one of the nodes; 
performing the computations and communications in accordance with the specifying, the allocating, and a virtual channel specifier of each fabric packet sent via one or more virtual channels between the at least two processing elements to transfer between the respective computation portions data comprising one or more sources and results; 
wherein the virtual channel specifier identifies one of the one or more virtual channels, a first of the fabric packets has a first instance of the virtual channel specifier, a second of the fabric packets has a second instance of the virtual channel specifier, the first instance and the second instance are different, and the first fabric packet and the second fabric packet originate from a same one of the fabric routers; and
wherein the performing the computations and communications is at least in part further in accordance with instruction fetch addresses respectively calculated using a selected portion of each of a respective plurality of the fabric packets, the selection in accordance with a control field of the fabric packets, and the virtual channel specifier of a first plurality of the fabric packets is the selected portion for the calculation of a first plurality of the instruction fetch addresses.

2. (currently amended) A method comprising:
performing dataflow-based and instruction-based processing and exchanging fabric packets respectively in and between a plurality of processing elements interconnected as a fabric, each processing element comprising a compute engine and a fabric router; 
specifying communications and computations respectively corresponding to a plurality of branches and a plurality of nodes of a dataflow graph;
allocating a plurality of the processing elements to locally perform the computations, at least a single one of the processing elements being allocated to locally perform a plurality of respective first computation portions of each of at least two partitioned ones of the nodes, each of the partitioned nodes comprising a respective plurality of computation portions including the respective first computation portions; 
performing the computations and communications in accordance with the specifying, the allocating, and a virtual channel specifier of each fabric packet sent via one or more virtual channels between the at least single one of the processing elements and other ones of the allocated processing elements to transfer data between the respective first computation portions and other ones of the respective plurality of computation portions, the data comprising one or more sources and results; 
wherein the virtual channel specifier identifies one of the one or more virtual channels, a first of the fabric packets has a first instance of the virtual channel specifier, a second of the fabric packets has a second instance of the virtual channel specifier, the first instance and the second instance are different, and the first fabric packet and the second fabric packet originate from a same one of the fabric routers; and
wherein the performing the computations and communications is at least in part further in accordance with instruction fetch addresses respectively calculated using a selected portion of each of a respective plurality of the fabric packets, the selection in accordance with a control field of the fabric packets, and the virtual channel specifier of a first plurality of the fabric packets is the selected portion for the calculation of a first plurality of the instruction fetch addresses.


3. (original) The method of claims 1 or 2, wherein the processing elements are fabricated via wafer-scale integration.


4. (original) The method of claim 1, wherein the at least two processing elements are fabricated via wafer-scale integration on separate die of a single wafer.


5. (original) The method of claim 2, wherein the at least single one of the processing elements and other ones of the allocated processing elements are fabricated via wafer-scale integration on separate die of a single wafer.


6. (original) The method of claims 1 or 2, wherein at least some of the exchanged fabric packets are fabric vectors.


7. (original) The method of claims 1 or 2, wherein the data flow graph corresponds to all or any portions of a neural network, and at least a portion of the performing the computations corresponds to computing weights of the neural network.


8. (original) The method of claims 1 or 2, wherein the locally performed computations and the exchanging fabric packets are respectively performed by the compute engines and the fabric routers of the respective processing elements.


9. (original) The method of claims 1 or 2, wherein the sources and results are with respect to one or more of: multiply and accumulate operations, partial sums, activations, and final sums.


10. (previously presented) The method of claims 1 or 2, wherein the allocating enables parallel partitioned node computations on multiple of the processing elements providing reduced wall-clock time, compared to performing sequential non-partitioned node computations on a single one of the processing elements.


11. (previously presented) The method of claim 10, wherein the parallel computations at times comprise concurrent use of respective all digital multipliers.


12. (original) The method of claim 10, wherein the parallel computations comprise at least partially overlapped computations.


13. (previously presented) The method of claims 1 or 2, further comprising initializing the fabric with all node and branch parameters required for concurrent execution of the communications and computations respectively corresponding to the dataflow graph.


14. (original) The method of claim 13, further comprising, subsequent to the initializing, concurrently executing all layers of the dataflow graph for one or more of inference and training.


15. (original) The method of claim 14, wherein the layer of the dataflow graph comprise input, hidden, and output layers.


16. (original) The method of claim 14, wherein the concurrently executing does not require any access to storage external to the fabric for any intermediate state or additional node and branch parameters of the dataflow graph.


17. (original) The method of claim 16, wherein the dataflow graph is a neural network, the nodes correspond to neurons, the partitioned node corresponds to a split neuron, and at least some of the node and branch parameters of the dataflow graph correspond to a plurality of weights of the neural network.


18. (original) The method of claims 1 or 2, wherein except for defects, the fabric is homogeneous, the plurality of processing elements numbers three million, and each processing element comprises 48kB of private local storage for instructions and data.


19. (original) The method of claims 1 or 2, wherein the fabric is enabled to concurrently store and execute a dataflow graph having communications and computations requirements of up to a combined 24GB of instruction and data storage.


20. (original) The method of claim 19, wherein the data storage is used for one or more of weights, forward partial sums, activations, gradient accumulations, delta partial sums, layer errors, duplicated weights, and other implementation overhead, as required by the concurrently executing.


21. (original) The method of claim 7, wherein the allocating is performed by a node to processing element mapping process in accordance with predetermined criteria.


22. (original) The method of claim 21, wherein the mapping process is performed at least in part manually.


23. (original) The method of claim 21, wherein the mapping process is performed at least in part via software executing on a placement server external to the fabric.


24. (original) The method of claim 21, wherein the predetermined criteria comprises one or more of: reducing wall-clock time for mapping, reducing wall-clock time for configuring the fabric, reducing at least one data movement latency metric, reducing wall-clock time required for training, reducing wall-clock time required for inference after training, reducing the number of die required to fit the dataflow graph, constraining the processing elements used to a particular number of die, complying with at least one storage metric, accounting for known defects, reducing at least one power metric, and optimizing a score based on a weighted sum comprising one or more of the foregoing criteria.


25. (currently amended) An apparatus comprising:
means for performing dataflow-based and instruction-based processing and exchanging fabric packets respectively in and between a plurality of processing elements interconnected as a fabric, each processing element comprising a compute engine and a fabric router; 
means for specifying communications and computations respectively corresponding to a plurality of branches and a plurality of nodes of a dataflow graph;
means for allocating a plurality of the processing elements to locally perform the computations, at least two of the processing elements being allocated to respectively locally perform a plurality of computation portions corresponding to a partitioned one of the nodes; 
means for performing the computations and communications in accordance with the specifying, the allocating, and a virtual channel specifier of each fabric packet sent via one or more virtual channels between the at least two processing elements to transfer between the respective computation portions data comprising one or more sources and results; 
wherein the virtual channel specifier identifies one of the one or more virtual channels, a first of the fabric packets has a first instance of the virtual channel specifier, a second of the fabric packets has a second instance of the virtual channel specifier, the first instance and the second instance are different, and the first fabric packet and the second fabric packet originate from a same one of the fabric routers; and
wherein the performing the computations and communications is at least in part further in accordance with instruction fetch addresses respectively calculated using a selected portion of each of a respective plurality of the fabric packets, the selection in accordance with a control field of the fabric packets, and the virtual channel specifier of a first plurality of the fabric packets is the selected portion for the calculation of a first plurality of the instruction fetch addresses.


26. (currently amended) An apparatus comprising:
means for performing dataflow-based and instruction-based processing and exchanging fabric packets respectively in and between a plurality of processing elements interconnected as a fabric, each processing element comprising a compute engine and a fabric router; 
means for specifying communications and computations respectively corresponding to a plurality of branches and a plurality of nodes of a dataflow graph;
means for allocating a plurality of the processing elements to locally perform the computations, at least a single one of the processing elements being allocated to locally perform a plurality of respective first computation portions of each of at least two partitioned ones of the nodes, each of the partitioned nodes comprising a respective plurality of computation portions including the respective first computation portions; 
means for performing the computations and communications in accordance with the specifying, the allocating, and a virtual channel specifier of each fabric packet sent via one or more virtual channels between the at least single one of the processing elements and other ones of the allocated processing elements to transfer data between the respective first computation portions and other ones of the respective plurality of computation portions, the data comprising one or more sources and results; 
wherein the virtual channel specifier identifies one of the one or more virtual channels, a first of the fabric packets has a first instance of the virtual channel specifier, a second of the fabric packets has a second instance of the virtual channel specifier, the first instance and the second instance are different, and the first fabric packet and the second fabric packet originate from a same one of the fabric routers; and
wherein the performing the computations and communications is at least in part further in accordance with instruction fetch addresses respectively calculated using a selected portion of each of a respective plurality of the fabric packets, the selection in accordance with a control field of the fabric packets, and the virtual channel specifier of a first plurality of the fabric packets is the selected portion for the calculation of a first plurality of the instruction fetch addresses.


27. (original) The apparatus of claims 25 or 26, wherein the processing elements are fabricated via wafer-scale integration.


28. (original) The apparatus of claim 25, wherein the at least two processing elements are fabricated via wafer-scale integration on separate die of a single wafer.


29. (original) The apparatus of claim 26, wherein the at least single one of the processing elements and other ones of the allocated processing elements are fabricated via wafer-scale integration on separate die of a single wafer.


30. (original) The apparatus of claims 25 or 26, wherein at least some of the exchanged fabric packets are fabric vectors.


31. (original) The apparatus of claims 25 or 26, wherein the data flow graph corresponds to all or any portions of a neural network, and at least a portion of the means for performing the computations corresponds to computing weights of the neural network.


32. (original) The apparatus of claims 25 or 26, wherein the locally performed computations and the exchanging fabric packets are respectively performed by the compute engines and the fabric routers of the respective processing elements.


33. (original) The apparatus of claims 25 or 26, wherein the sources and results are with respect to one or more of: multiply and accumulate operations, partial sums, activations, and final sums.


34. (previously presented) The apparatus of claims 25 or 26, wherein the means for allocating enables parallel partitioned node computations on multiple of the processing elements providing reduced wall-clock time, compared to performing sequential non-partitioned node computations on a single one of the processing elements.


35. (previously presented) The apparatus of claim 34, wherein the parallel computations at times comprise concurrent use of respective all digital multipliers.


36. (original) The apparatus of claim 34, wherein the parallel computations comprise at least partially overlapped computations.


37. (previously presented) The apparatus of claims 25 or 26, further comprising means for initializing the fabric with all node and branch parameters required for concurrent execution of the communications and computations respectively corresponding to the dataflow graph.


38. (currently amended) The apparatus of claim 37, further comprising, operable subsequent to the initializing, means for concurrently executing all layers of the dataflow graph for one or more of inference and training.


39. (original) The apparatus of claim 38, wherein the layer of the dataflow graph comprise input, hidden, and output layers.


40. (original) The apparatus of claim 38, wherein the means for concurrently executing does not require any access to storage external to the fabric for any intermediate state or additional node and branch parameters of the dataflow graph.


41. (original) The apparatus of claim 40, wherein the dataflow graph is a neural network, the nodes correspond to neurons, the partitioned node corresponds to a split neuron, and at least some of the node and branch parameters of the dataflow graph correspond to a plurality of weights of the neural network.


42. (original) The apparatus of claims 25 or 26, wherein except for defects, the fabric is homogeneous, the plurality of processing elements numbers three million, and each processing element comprises 48kB of private local storage for instructions and data.


43. (original) The apparatus of claims 25 or 26, wherein the fabric is enabled to concurrently store and execute a dataflow graph having communications and computations requirements of up to a combined 24GB of instruction and data storage.


44. (original) The apparatus of claim 43, wherein the data storage is used for one or more of weights, forward partial sums, activations, gradient accumulations, delta partial sums, layer errors, duplicated weights, and other implementation overhead, as required by the concurrently executing.


45. (original) The apparatus of claim 31, wherein the means for allocating is performed by a node to processing element mapping process in accordance with predetermined criteria.


46. (original) The apparatus of claim 45, wherein the mapping process is performed at least in part manually.


47. (original) The apparatus of claim 45, wherein the mapping process is performed at least in part via software executing on a placement server external to the fabric.


48. (original) The apparatus of claim 45, wherein the predetermined criteria comprises one or more of: reducing wall-clock time for mapping, reducing wall-clock time for configuring the fabric, reducing at least one data movement latency metric, reducing wall-clock time required for training, reducing wall-clock time required for inference after training, reducing the number of die required to fit the dataflow graph, constraining the processing elements used to a particular number of die, complying with at least one storage metric, accounting for known defects, reducing at least one power metric, and optimizing a score based on a weighted sum comprising one or more of the foregoing criteria.


49. (new) The method of claims 1 or 2, further comprising: 
wherein the virtual channel specifier of a first plurality of the fabric packets is the selected portion for the calculation of a first plurality of the instruction fetch addresses; and
wherein at least part of an index of a second plurality of the fabric packets is the selected portion for the calculation of a second plurality of the instruction fetch addresses.


50. (new) The method of claim 49, wherein the calculation of the first plurality of the instruction fetch addresses comprises adding a multiple of the virtual channel specifier to a base register.


51. (new) The method of claim 50, wherein the multiple is 4.


52. (new) The method of claim 49, wherein the calculation of the second plurality of the instruction fetch addresses comprises adding the at least part of an index to a base register.


53. (new) The method of claim 52, wherein the at least part of an index comprises lower index bits of the index and the index further comprises upper index bits.


54. (new) The method of claim 49, further comprising: 
wherein the first plurality of the fabric packets are data fabric packets and the control field comprises a deasserted control bit; and
wherein the second plurality of the fabric packets are control fabric packets and the control field comprises an asserted control bit. 


55. (new) The apparatus of claims 25 or 26, further comprising: 
wherein the virtual channel specifier of a first plurality of the fabric packets is the selected portion for the calculation of a first plurality of the instruction fetch addresses; and
wherein at least part of an index of a second plurality of the fabric packets is the selected portion for the calculation of a second plurality of the instruction fetch addresses.


56. (new) The apparatus of claim 55, wherein the calculation of the first plurality of the instruction fetch addresses comprises adding a multiple of the virtual channel specifier to a base register.


57. (new) The apparatus of claim 56, wherein the multiple is 4.


58. (new) The apparatus of claim 55, wherein the calculation of the second plurality of the instruction fetch addresses comprises adding the at least part of an index to a base register.


59. (new) The apparatus of claim 58, wherein the at least part of an index comprises lower index bits of the index and the index further comprises upper index bits.


60. (new) The apparatus of claim 55, further comprising: 
wherein the first plurality of the fabric packets are data fabric packets and the control field comprises a deasserted control bit; and
wherein the second plurality of the fabric packets are control fabric packets and the control field comprises an asserted control bit.



				Reasons for Allowance
The following is an Examiner’s statement for reasons for allowance. 
Claims 1-60 are considered allowable since when reading the claims in light of the specification, as per MPEP § 2111.01, In re Toro Co. v. White Consol. Indus., Inc., 199 F.3d 1295, 1299, 53 USPQ2d 1065, 1067 (Fed. Cir. 1999), none of the references of record alone or in combination disclose or suggest the combination of limitations specified in the independent claims 1, 2, 25 and 26 describing a fabric router originates different fabric packets having respective different instances of the virtual channel specifier, which is a selected portion of each packet in accordance with a control field of the fabric packet for calculation of an instruction fetch addresses. Reference Tanboise teaches logical communication channel between VRF instances located on the router and reference Gao teaches cache block fetch prediction with adaptive size of address and data packets. Combination fails to expressly disclose he above subject matter. Dependent claims are allowed for at least the same reason.
  
Correspondence Information
Any inquiries concerning this communication or earlier communications from the examiner should be directed to LiWu Chang, who may be reached Monday through Thursday, between 10:00 a.m. and 6:00 p.m. EST. or via telephone at (571) 270-3809 or facsimile transmission (571) 270-4809. If you need to send an Official facsimile transmission, please send it to (571) 273-8300. If attempts to reach the examiner are unsuccessful the Examiner’s Supervisor, Miranda Huang can be reached on (571) 270-7092. Hand-delivered responses should be delivered to the Receptionist @ Customer Service Window, the first floor on the south side of the Randolph Building 401 Dulany Street, Alexandria, VA 22313. 
 	Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Moreover, status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have any questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) toll-free @ 1-866-217-9197.

/LI WU CHANG/Primary Examiner, Art Unit 2124