DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claims
The present application is being examined under the claims filed on 11/23/2021.
Claims 1, 4, 7, 9, 10, 13-15, 17-21, 25 are amended.
Claims 2, 3, and 8 are canceled.
Claims 1, 4-7, and 9-25 are rejected.
Claims 1, 4-7, and 9-25 are pending.

Drawings
The Drawings filed on 12/30/2016 are acceptable for examination purposes.

Specification
The Specification filed on 12/30/2016 is acceptable for examination purposes.

Response to Arguments
In reference to Rejections under 35 USC § 101 – Abstract idea
Applicant asserts that Claim 1 includes a description of a specific layout of computer components including a plurality of matrix processing units that each can perform matrix operations, and the claim limitations implement functional operations specifically for a layout utilizing the plurality of MPUs with simultaneous operations and stages.

Applicant's arguments filed 11/23/2021 have been fully considered but they are not persuasive. 

In reference to Rejections under 35 USC § 102
Applicant asserts that Cadambi does not disclose perform[ing] a plurality of partial matrix operations in a plurality of stages. Nor does Cadambi disclose the partial matrix data includes a partial input matrix, wherein the partial input matrix is to be used by a first MPU in a particular stage of the partial matrix operations, and wherein the partial input matrix is to be used by a second MPU in a subsequent stage of the partial matrix operations.
Examiner respectfully agrees. Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. See at least Fig. 6, Figs. 8a-8c, ¶ [0044]-[0049], and ¶ [0062]-[0064] discloses splitting the matrix into partial matrixes to perform simultaneous parallel processing using multiple processing elements.
Applicant's arguments filed 01/25/2021 have been fully considered but they are not persuasive. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1 and 4-25 are rejected under 35 U.S.C. 101.
Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim recites “perform a plurality of partial matrix operations in a plurality of stages […]” and “determine a result of the neural network operation based on the plurality of partial matrix operations” which are directed to the abstract idea of a Mental Process and Mathematical Concepts. This judicial exception is not integrated into a practical application because the generically recited computer elements do not add a meaningful limitation to the abstract idea because they amount to simply implementing the abstract idea on a computer. These generic elements are the “memory circuitry to store a plurality of input matrices”, “a plurality of matrix processing units (MPUs), wherein each MPU includes processing circuitry to perform matrix arithmetic”, “interface circuitry to communicatively couple the plurality of MPUs”, “controller circuitry”, and “using the plurality of MPUs”. The claim does not include additional elements that are sufficient to amount to significantly more than 
Claim 4 recites additional steps of “matrix data associated with one or more images and one or more filters, wherein the one or more images are associated with one or more channels”. The additional steps do not amount to significantly more because it is directed to the abstract idea of a Mental Process and Mathematical Concepts.
Claim 5 recites additional steps of “the controller circuitry to partition the plurality of input matrices into the plurality of input partitions based on the number of available MPUs is further to 
Claim 6 recites additional steps of “distribute the plurality of partial matrix operations among the plurality of MPUs based on a height and a width of the result of the neural network operation”. The additional steps do not amount to significantly more because they are well-understood, routine, conventional computer functions as per Zhou et al. “Advanced partitioning techniques for massively distributed computation” and Chen et al. “DaDianNao: A Machine-Learning Supercomputer”.
Claim 7 recites the additional steps of “the plurality of MPUs is configured in a cyclic arrangement such that each MPU is communicatively coupled to a plurality of neighbor MPUs”, “the controller circuitry to perform the plurality of partial matrix operations using the plurality of MPUs is further to cause the plurality of MPUs to perform the plurality of partial matrix operations in a plurality of stages”, and “the controller circuitry to transmit, via the interface circuitry, the partial matrix data between the plurality of MPUs while performing the plurality of partial matrix operations is further to transmit a portion of the partial matrix data from each MPU to one or more of the neighbor MPUs while performing a particular stage of the partial matrix operations”. The additional steps do not amount to significantly more because they are well-understood, routine, conventional computer functions as per Zhou et al. “Advanced partitioning techniques for massively distributed computation” and Chen et al. “DaDianNao: A Machine-Learning Supercomputer”. The claim further recites the step of “perform the plurality of partial matrix operations in a plurality of stages” which is directed to the abstract idea of a Mental Process and Mathematical Concepts.

Claim 10 recites additional steps of “a partial result matrix determined by a first MPU in a particular stage of the partial matrix operations, and wherein the partial result matrix is to be used by a second MPU in a subsequent stage of the partial matrix operations”. The additional steps do not amount to significantly more because they are directed to the abstract idea of a Mental Process and Mathematical Concepts.
Claim 11 recites additional steps of “the neural network operation is associated with a forward propagation operation in a neural network”. The additional steps do not amount to significantly more because they are directed to the abstract idea of a Mental Process and Mathematical Concepts.
Claim 12 recites additional steps of “the neural network operation is associated with a backward propagation operation in a neural network”. The additional steps do not amount to significantly more because they are directed to the abstract idea of a Mental Process and Mathematical Concepts.

Claim 13 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim recites “performing a plurality of partial matrix operations in a plurality of stages […]” and “determining a result of the neural network operation based on the plurality of partial matrix operations” which are directed to the abstract idea of a Mental Process and Mathematical Concepts. This judicial exception is not integrated into a practical application because the generically recited computer elements do not add a meaningful limitation to the abstract idea because they amount to simply implementing the abstract idea on a computer. These generic elements are the “matrix processing units (MPUs) in the matrix processor” and “interface circuitry of the matrix processor”. The claim does not include additional elements that are sufficient to amount to significantly 
Claim 14 recites additional steps of “matrix data associated with one or more images and one or more filters, wherein the one or more images are associated with one or more channels”. The additional steps do not amount to significantly more because it is directed to the abstract idea of a Mental Process and Mathematical Concepts. The claim further recites the additional steps of “the plurality of processing elements is further configured to partition the plurality of input matrices based on one or more of: a number of channels associated with the one or more images; a number of filters; and a number of 
Claim 15 recites additional steps of “distributing the plurality of partial matrix operations to the plurality of MPUs based on a height and a width of the result of the neural network operation”. The additional steps do not amount to significantly more because they are well-understood, routine, conventional computer functions as per Zhou et al. “Advanced partitioning techniques for massively distributed computation” and Chen et al. “DaDianNao: A Machine-Learning Supercomputer”.
Claim 16 recites the additional step of “the plurality of MPUs is configured in a cyclic arrangement such that each MPU is communicatively coupled to a plurality of neighbor MPUs”. The additional step does not amount to significantly more because it is well-understood, routine, conventional computer functions as per Zhou et al. “Advanced partitioning techniques for massively distributed computation” and Chen et al. “DaDianNao: A Machine-Learning Supercomputer”.
Claim 17 recites the additional steps of “wherein each MPU transmits a portion of the partial matrix data to one or more of the neighbor MPU while performing a particular stage of the partial matrix operations”. The additional steps do not amount to significantly more because they are well-understood, routine, conventional computer functions as per Zhou et al. “Advanced partitioning techniques for massively distributed computation” and Chen et al. “DaDianNao: A Machine-Learning Supercomputer”.

Claim 18 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim recites “perform a plurality of partial matrix operations in a plurality of stages […]” and “determine a result of the neural network operation based on the plurality of partial matrix operations” which are directed to the abstract idea of a Mental Process and 
Claim 19 recites an additional step of “communication interface circuitry to communicate with one or more remote matrix processing chips over a communication network”. The additional limitations only transmits information in memory, these are well-understood, routine, conventional computer functions as recognized by the court decisions listed in MPEP § 2106.05(d).

Claim 20 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim recites “perform a plurality of partial matrix operations in a plurality of stages […]” and “determine a result of the neural network operation based on the plurality of partial matrix operations” which are directed to the abstract idea of a Mental Process and Mathematical Concepts. This judicial exception is not integrated into a practical application because the generically recited computer elements do not add a meaningful limitation to the abstract idea because they amount to simply implementing the abstract idea on a computer. These generic elements are the “non-transitory machine accessible storage medium having instructions stored thereon, the instructions, when executed on a machine”, “matrix processing units (MPUs) in a matrix processor”, “interface circuitry to communicatively couple the plurality of MPUs”, and “using the plurality of MPUs”. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception because when considered separately and in combination, they do not add significantly more (also known as an “inventive concept”) to the exception. These additional elements are “receive an instruction to perform a neural network operation on the plurality of input matrices, wherein the neural network operation includes a plurality of convolution operations” and “transmit, via interface circuitry of the matrix processor, partial matrix data between the plurality of MPUs while performing the plurality 
Claim 21 recites additional steps of “matrix data associated with one or more images and one or more filters, wherein the one or more images are associated with one or more channels”. The additional steps do not amount to significantly more because it is directed to the abstract idea of a Mental Process and Mathematical Concepts.
Claim 22 recites additional steps of “the instructions that cause the machine to partition the plurality of input matrices into the plurality of input partitions based on the number of available matrix processing units (MPUs) in the matrix processor further cause the machine to partition the plurality of input matrices based on one or more of: a number of channels associated with the one or more images; a number of filters; and a number of images”. The additional steps do not amount to significantly more because they are well-understood, routine, conventional computer functions as per Zhou et al. 
Claim 23 recites additional steps of “distribute the plurality of partial matrix operations to the plurality of MPUs based on a height and a width of the result of the neural network operation”. The additional steps do not amount to significantly more because they are well-understood, routine, conventional computer functions as per Zhou et al. “Advanced partitioning techniques for massively distributed computation” and Chen et al. “DaDianNao: A Machine-Learning Supercomputer”.
Claim 24 recites an additional step of “the plurality of MPUs is configured in a cyclic arrangement such that each MPU is communicatively coupled to a plurality of neighbor MPUs”. The additional step does not amount to significantly more because it is well-understood, routine, conventional computer functions as per Zhou et al. “Advanced partitioning techniques for massively distributed computation” and Chen et al. “DaDianNao: A Machine-Learning Supercomputer”.
Claim 25 recites an additional step of “the instructions that cause the machine to transmit, via the interface circuitry, the partial matrix data between the plurality of MPUs while performing the plurality of partial matrix operations is further to transmit a portion of the partial matrix data from each MPU to one or more of the neighbor MPUs while performing a particular stage of the partial matrix operations”. The additional step does not amount to significantly more because it is well-understood, routine, conventional computer functions as per Zhou et al. “Advanced partitioning techniques for massively distributed computation” and Chen et al. “DaDianNao: A Machine-Learning Supercomputer”. The claim further recites the step of “perform the plurality of partial matrix operations in a plurality of stages” which is directed to the abstract idea of a Mental Process and Mathematical Concepts.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. 
Claims 1, 7, 9-11, 13, 16-18, 20, and 24 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 9-11 of U.S. Patent No. US 20180189236 A1 in view of Cadambi et al. (hereinafter Cadambi) US 20110119467 A1.
Instant Application
US 20180189236 A1
Claim 1. An apparatus, comprising: memory circuitry to store a plurality of input matrices; a plurality of matrix processing units (MPUs), wherein each MPU includes processing circuitry to perform matrix arithmetic; interface circuitry to communicatively couple the plurality of MPUs; and controller circuitry to: receive an instruction to perform a neural network operation on the plurality of input matrices, wherein the neural network operation includes a plurality of convolution operations; partition the plurality of input matrices into a plurality of input partitions based on a number of available MPUs; distribute the plurality of input partitions among the plurality of MPUs, wherein each input partition is distributed to a particular MPU of the plurality of MPUs; perform a plurality of partial matrix operations in a plurality of stages using the 
Claim 9. An apparatus, comprising: a plurality of memory elements to store matrix data, wherein the matrix data comprises a plurality of input matrices; and a plurality of processing elements to perform a matrix operation associated with the plurality of input matrices, wherein the plurality of processing elements is configured to: partition the plurality of input matrices into a plurality of input partitions, wherein the plurality of input matrices is partitioned based on a number of available processing elements; distribute the plurality of input partitions among the plurality of processing elements, wherein each input partition is distributed to a particular processing element of the plurality of processing elements; perform a plurality of partial matrix operations using the plurality of processing elements; transmit partial matrix data between 
the plurality of processing elements is configured in a cyclic arrangement such that each processing element is communicatively coupled to a plurality of neighbor processing elements; and the plurality of neighbor processing elements of each processing element comprises a first neighbor processing element and a second neighbor processing element; wherein the plurality of processing elements is further configured to: perform the plurality of partial matrix operations in a plurality of stages; and transmit a portion of the partial matrix data from each processing element to one or more of the neighbor processing elements while performing a particular stage of the partial matrix operations; wherein the plurality of processing elements is further configured to transmit the portion of the partial matrix data from each processing element to the first neighbor processing element and the 
Claim 7
Claim 9
Claim 9
Claim 11
Claim 10
Claim 9
Claim 11
Claim 10
Claim 13. A method of performing a neural network operation on a matrix processor, comprising: receiving an instruction to perform a neural network operation on the plurality of input matrices, wherein the neural network operation includes a plurality of convolution operations; partitioning the plurality of input matrices into a plurality of input partitions based on a number of available MPUs; distributing the plurality of input partitions among the plurality of MPUs, wherein each input partition is distributed 
Claim 9. An apparatus, comprising: a plurality of memory elements to store matrix data, wherein the matrix data comprises a plurality of input matrices; and a plurality of processing elements to perform a matrix operation associated with the plurality of input matrices, wherein the plurality of processing elements is configured to: partition the plurality of input matrices into a plurality of input partitions, wherein the plurality of input matrices is partitioned based on a number of available processing elements; 
the plurality of processing elements is configured in a cyclic arrangement such that each processing element is communicatively coupled to a plurality of neighbor processing elements; and the plurality of neighbor processing elements of each processing element comprises a first neighbor processing element and a second neighbor processing element; wherein the plurality of processing elements is further configured to: perform the plurality of partial matrix operations in a plurality of stages; and transmit a portion of the partial matrix data from each processing 
Claim 16
Claim 9
Claim 17
Claim 9
Claim 18. A system, comprising: memory circuitry to store a plurality of input matrices; a plurality of matrix processing units (MPUs), wherein each MPU includes processing circuitry to perform matrix arithmetic; interface circuitry to communicatively couple the plurality of MPUs; and host processor circuitry to: receive an 
Claim 9. An apparatus, comprising: a plurality of memory elements to store matrix data, wherein the matrix data comprises a plurality of input matrices; and a plurality of processing elements to perform a matrix operation associated with the plurality of input matrices, wherein the plurality of processing elements is configured to: 
the plurality of processing elements is configured in a cyclic arrangement such that each processing element is communicatively coupled to a plurality of neighbor processing elements; and the plurality of neighbor processing elements of each processing element comprises a first neighbor processing element and a second neighbor processing element; wherein the plurality of 
Claim 20. At least one machine accessible storage medium having instructions stored thereon, the instructions, when executed on a machine, cause the machine to: perform a matrix operation using a plurality of input matrices, wherein the matrix 
Claim 9. An apparatus, comprising: a plurality of memory elements to store matrix data, wherein the matrix data comprises a plurality of input matrices; and a plurality of processing elements to perform a matrix operation associated with 
the plurality of processing elements is configured in a cyclic arrangement such that each processing element is communicatively coupled to a plurality of neighbor processing elements; and the plurality of neighbor processing elements of each processing element comprises a first neighbor 
Claim 24
Claim 9


US 20180189236 A1 does not explicitly discloses:
“receive an instruction to perform a neural network operation on the plurality of input matrices, wherein the neural network operation includes a plurality of convolution operations”;
“transmit, via the interface circuitry, partial matrix data between the plurality of MPUs while performing the plurality of partial matrix operations, wherein each MPU is to transmit a portion of the partial matrix data to one or more of the plurality of MPUs simultaneously while each partial matrix operation of the plurality of partial matrix operations is being performed”;
However, Cadambi discloses:
“receive an instruction to perform a neural network operation on the plurality of input matrices, wherein the neural network operation includes a plurality of convolution operations”;
“transmit, via the interface circuitry, partial matrix data between the plurality of MPUs while performing the plurality of partial matrix operations, wherein each MPU is to transmit a portion of the partial matrix data to one or more of the plurality of MPUs simultaneously while each partial matrix operation of the plurality of partial matrix operations is being performed” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each. Each chain 104 has a bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. Examiner notes that ¶ [0038] recites that “bi-directional communication between neighboring processing resources may be referred to as a "dual-cyclical" configuration” and ¶ [0052] of the Instant Specification recites “a dual-cyclical configuration of processing resources 210 enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 210, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time”),
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine US 20180189236 A1 and Cadambi. One of ordinary skill would have motivation to combine US 20180189236 A1 and Cadambi because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E) "Obvious to try" choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –




Claim(s) 1, 4-7, and 9-25 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Cadambi et al. (hereinafter Cadambi) US 20110119467 A1.
In reference to claim 1. Cadambi teaches an apparatus, comprising:
“memory circuitry to store a plurality of input matrices” (Cadambi in at least Fig. 1, Abstract, ¶ [0029], ¶ [0042]-[0044], and ¶ [0060]-[0063]);
“a plurality of matrix processing units (MPUs), wherein each MPU includes processing circuitry to perform matrix arithmetic” (Cadambi in at least Fig. 1, Figs. 7-8c, ¶ [0007], ¶ [0008], ¶ [0021], ¶ [0027], ¶ [0040], ¶ [0047]-[0055], and ¶ [0058]-[0063]. Examiner notes that Convolutional Neural Networks (CNN) involves matrix operations associated with a plurality of input matrices. Examiner notes that MPUs is not a term of art, the broadest reasonable interpretation for MPUs is a generic processor able to perform the functions claimed. Examiner notes that performing matrix arithmetic is common practice);
“interface circuitry to communicatively couple the plurality of MPUs” (Cadambi in at least Fig. 2 and ¶ [0031] discloses the interconnection between the processing elements (PEs));
controller circuitry to:
“receive an instruction to perform a neural network operation on the plurality of input matrices, wherein the neural network operation includes a plurality of convolution operations” (Cadambi in at least Fig. 1, Figs. 7-8c, ¶ [0007], ¶ [0008], ¶ [0021], ¶ [0027], ¶ [0040], ¶ [0047]-[0055], and ¶ [0058]-[0063]. Examiner notes that Convolutional Neural Networks (CNN) involves matrix operations associated with a plurality of input matrices);
“partition the plurality of input matrices into a plurality of input partitions based on a number of available MPUs” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], and ¶ [0052] discloses massive parallel learning and partitioning);
“distribute the plurality of input partitions among the plurality of MPUs, wherein each input partition is distributed to a particular MPU of the plurality of MPUs” (Cadambi in at least ¶ [0029] and ¶ [0063] discloses the distribution between the PEs);
“perform a plurality of partial matrix operations in a plurality of stages using the plurality of MPUs” (Cadambi in at least ¶ [0046] discloses accumulating the partial results. See at least Fig. 6, Figs. 8a-8c, ¶ [0044]-[0049], and ¶ [0062]-[0064] discloses splitting the matrix into partial matrixes to perform simultaneous parallel processing using multiple processing elements);
“transmit, via the interface circuitry, partial matrix data between the plurality of MPUs while performing the plurality of partial matrix operations, wherein each MPU is to transmit a portion of the partial matrix data to one or more of the plurality of MPUs simultaneously while each partial matrix operation of the plurality of partial matrix operations is being performed” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each. Each chain 104 has a bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. Examiner notes that ¶ [0038] bi-directional communication between neighboring processing resources may be referred to as a "dual-cyclical" configuration” and ¶ [0052] of the Instant Specification recites “a dual-cyclical configuration of processing resources 210 enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 210, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time”),
“wherein the partial matrix data includes a partial input matrix, wherein the partial input matrix is to be used by a first MPU in a particular stage of the partial matrix operations, and wherein the partial input matrix is to be used by a second MPU in a subsequent stage of the partial matrix operations” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each. Each chain 104 has a bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. See at least Fig. 6, Figs. 8a-8c, ¶ [0044]-[0049], and ¶ [0062]-[0064] discloses splitting the matrix into partial matrixes to perform simultaneous parallel processing using multiple processing elements);
“determine a result of the neural network operation based on the plurality of partial matrix operations” (Cadambi in at least ¶ [0031], ¶ [0046], and ¶ [0063] discloses accumulating the results).

In reference to claim 4. Cadambi teaches the apparatus of Claim 1 (as mentioned above), wherein the plurality of input matrices includes:
Cadambi further discloses:
“matrix data associated with one or more images and one or more filters, wherein the one or more images are associated with one or more channels” (Cadambi in at least Fig. 3, ¶ [0021], ¶ [0034], ¶ [0036], ¶ [0038], ¶ [0047], ¶ [0048], ¶ [0059]-[0061], and ¶ [0064]).

In reference to claim 5. Cadambi teaches the apparatus of Claim 4 (as mentioned above), wherein the controller circuitry to partition the plurality of input matrices into the plurality of input partitions based on the number of available MPUs is further to partition the plurality of input matrices based on one or more of:
Cadambi further discloses:
“a number of channels associated with the one or more images; a number of filters; a number of images” (Cadambi in at least Fig. 3, ¶ [0021], ¶ [0034], ¶ [0036], ¶ [0038], ¶ [0047], ¶ [0048], ¶ [0059]-[0061], and ¶ [0064]).

In reference to claim 6. Cadambi teaches the apparatus of Claim 1 (as mentioned above), wherein the controller circuitry is further to:
Cadambi further discloses:
“distribute the plurality of partial matrix operations among the plurality of MPUs based on a height and a width of the result of the neural network operation” (Cadambi in at least ¶ [0029] and ¶ [0063] discloses the distribution between the PEs).

In reference to claim 7. Cadambi teaches the apparatus of Claim 1 (as mentioned above), wherein:
Cadambi further discloses:
“the plurality of MPUs is configured in a cyclic arrangement such that each MPU is communicatively coupled to a plurality of neighbor MPUs” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each. Each chain 104 has a bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. Examiner notes that ¶ [0038] recites that “bi-directional communication between neighboring processing resources may be referred to as a "dual-cyclical" configuration” and ¶ [0052] of the Instant Specification recites “a dual-cyclical configuration of processing resources 210 enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 210, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time”);
“the controller circuitry to transmit, via the interface circuitry, the partial matrix data between the plurality of MPUs while performing the plurality of partial matrix operations is further to transmit a portion of the partial matrix data from each MPU to one or more of the neighbor MPUs while performing a particular stage of the partial matrix operations” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M processing elements (PEs) 108. The PEs 108 are organized bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. Examiner notes that ¶ [0038] recites that “bi-directional communication between neighboring processing resources may be referred to as a "dual-cyclical" configuration” and ¶ [0052] of the Instant Specification recites “a dual-cyclical configuration of processing resources 210 enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 210, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time”).

In reference to claim 9. Cadambi teaches the apparatus of Claim 7 (as mentioned above), wherein:
Cadambi further discloses:
“the neural network operation is associated with a weight update operation in a neural network” (Cadambi in at least ¶ [0059], ¶ [0060], and ¶ [0067]).

In reference to claim 10. Cadambi teaches the apparatus of Claim 7 (as mentioned above), wherein the partial matrix data includes:
Cadambi further discloses:
“a partial result matrix determined by a first MPU in a particular stage of the partial matrix operations, and wherein the partial result matrix is to be used by a second MPU in a subsequent stage of the partial matrix operations” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”).

In reference to claim 11. Cadambi teaches the apparatus of Claim 10 (as mentioned above), wherein:
Cadambi further discloses:
“the neural network operation is associated with a forward propagation operation in a neural network” (Cadambi in at least ¶ [0027]).

In reference to claim 12. Cadambi teaches the apparatus of Claim 10 (as mentioned above), wherein:
Cadambi further discloses:
“the neural network operation is associated with a backward propagation operation in a neural network” (Cadambi in at least ¶ [0027]).

In reference to claim 13. Cadambi teaches a method of performing a neural network operation on a matrix processor, comprising:
“receiving an instruction to perform the neural network operation on a plurality of input matrices, wherein the neural network operation includes a plurality of convolution operations” (Cadambi in at least Fig. 1, Figs. 7-8c, ¶ [0007], ¶ [0008], ¶ [0021], ¶ [0027], ¶ [0040], ¶ [0047]-[0055], and ¶ [0058]-[0063]. Examiner notes that Convolutional Neural Networks (CNN) involves matrix operations associated with a plurality of input matrices);
“partitioning the plurality of input matrices into a plurality of input partitions based on a number of available matrix processing units (MPUs) in the matrix processor” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], and ¶ [0052] discloses massive parallel learning and partitioning);
“distributing the plurality of input partitions among a plurality of MPUs in the matrix processor, wherein each input partition is distributed to a particular MPU of the plurality of MPUs” (Cadambi in at least ¶ [0029] and ¶ [0063] discloses the distribution between the PEs);
“performing a plurality of partial matrix operations in a plurality of stages using the plurality of MPUs” (Cadambi in at least ¶ [0046] discloses accumulating the partial results. See at least Fig. 6, Figs. 8a-8c, ¶ [0044]-[0049], and ¶ [0062]-[0064] discloses splitting the matrix into partial matrixes to perform simultaneous parallel processing using multiple processing elements);
“transmitting, via interface circuitry of the matrix processor, partial matrix data between the plurality of MPUs while performing the plurality of partial matrix operations, wherein each MPU is to transmit a portion of the partial matrix data to one or more of the plurality of MPUs simultaneously while each partial matrix operation of the plurality of partial matrix operations is being performed” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each. Each chain 104 has a bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. Examiner notes that bi-directional communication between neighboring processing resources may be referred to as a "dual-cyclical" configuration” and ¶ [0052] of the Instant Specification recites “a dual-cyclical configuration of processing resources 210 enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 210, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time”),
“wherein the partial matrix data includes a partial input matrix, wherein the partial input matrix is to be used by a first MPU in a particular stage of the partial matrix operations, and wherein the partial input matrix is to be used by a second MPU in a subsequent stage of the partial matrix operations” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each. Each chain 104 has a bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. See at least Fig. 6, Figs. 8a-8c, ¶ [0044]-[0049], and ¶ [0062]-[0064] discloses splitting the matrix into partial matrixes to perform simultaneous parallel processing using multiple processing elements);
“determining a result of the neural network operation based on the plurality of partial matrix operations” (Cadambi in at least ¶ [0031], ¶ [0046], and ¶ [0063] discloses accumulating the results).

In reference to claim 14. Cadambi teaches the method of Claim 13, wherein:
Cadambi further discloses:
“the plurality of input matrices includes matrix data associated with one or more images and one or more filters, wherein the one or more images are associated with one or more channels” (Cadambi in at least Fig. 3, ¶ [0021], ¶ [0034], ¶ [0036], ¶ [0038], ¶ [0047], ¶ [0048], ¶ [0059]-[0061], and ¶ [0064]);
the plurality of input matrices is further partitioned based on one or more of:
“a number of channels associated with the one or more images; a number of filters; a number of images” (Cadambi in at least Fig. 3, ¶ [0021], ¶ [0034], ¶ [0036], ¶ [0038], ¶ [0047], ¶ [0048], ¶ [0059]-[0061], and ¶ [0064]).

In reference to claim 15. Cadambi teaches the method of Claim 13 (as mentioned above), further including:
Cadambi further discloses:
“distributing the plurality of partial matrix operations to the plurality of MPUs based on a height and a width of the result of the neural network operation” (Cadambi in at least ¶ [0029] and ¶ [0063] discloses the distribution between the PEs).

In reference to claim 16. Cadambi teaches the method of Claim 13 (as mentioned above), wherein:
Cadambi further discloses:
“the plurality of MPUs is configured in a cyclic arrangement such that each MPU is communicatively coupled to a plurality of neighbor MPUs” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. Examiner notes that ¶ [0038] recites that “bi-directional communication between neighboring processing resources may be referred to as a "dual-cyclical" configuration” and ¶ [0052] of the Instant Specification recites “a dual-cyclical configuration of processing resources 210 enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 210, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time”).

In reference to claim 17. Cadambi teaches the method of Claim 16 (as mentioned above), wherein:
Cadambi further discloses:
“wherein each MPU transmits a portion of the partial matrix data to one or more of the neighbor MPU while performing a particular stage of the partial matrix operations” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each. Each chain 104 has a bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. Examiner notes that ¶ [0038] recites that “bi-directional communication between neighboring processing resources may be referred to as a "dual-cyclical" configuration” and ¶ [0052] of the Instant Specification recites “a dual-cyclical configuration of processing resources 210 enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 210, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time”).

In reference to claim 18. Cadambi teaches a system, comprising:
“memory circuitry to store a plurality of input matrices” (Cadambi in at least Fig. 1, Abstract, ¶ [0029], ¶ [0042]-[0044], and ¶ [0060]-[0063]);
“a plurality of matrix processing chips, wherein each matrix processing chip includes one or more matrix processing units (MPUs) to perform matrix arithmetic” (Cadambi in at least Fig. 1, Figs. 7-8c, ¶ [0007], ¶ [0008], ¶ [0021], ¶ [0027], ¶ [0040], ¶ [0047]-[0055], and ¶ [0058]-[0063]. Examiner notes that Convolutional Neural Networks (CNN) involves matrix operations associated with a plurality of input matrices. Examiner notes that MPUs is not a term of art, the broadest reasonable interpretation for MPUs is a generic processor able to perform the functions claimed. Examiner notes that performing matrix arithmetic is common practice);
“interface circuitry to communicatively couple the plurality of matrix processing chips” (Cadambi in at least Fig. 2 and ¶ [0031] discloses the interconnection between the processing elements (PEs));
host processor circuitry to:
“receive an instruction to perform a neural network operation on the plurality of input matrices, wherein the neural network operation includes a plurality of convolution operations” (Cadambi in at least Fig. 1, Figs. 7-8c, ¶ [0007], ¶ [0008], ¶ [0021], ¶ [0027], ¶ [0040], ¶ [0047]-[0055], and ¶ [0058]-[0063]. Examiner notes 
“partition the plurality of input matrices into a plurality of input partitions based on a number of available matrix processing chips” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], and ¶ [0052] discloses massive parallel learning and partitioning);
“distribute the plurality of input partitions among the plurality of matrix processing chips, wherein each input partition is distributed to a particular matrix processing chip of the plurality of matrix processing chips” (Cadambi in at least ¶ [0029] and ¶ [0063] discloses the distribution between the PEs);
“perform a plurality of partial matrix operations in a plurality of stages using the plurality of MPUs” (Cadambi in at least ¶ [0046] discloses accumulating the partial results. See at least Fig. 6, Figs. 8a-8c, ¶ [0044]-[0049], and ¶ [0062]-[0064] discloses splitting the matrix into partial matrixes to perform simultaneous parallel processing using multiple processing elements);
“transmit, via the interface circuitry, partial matrix data between the plurality of matrix processing chips while performing the plurality of partial matrix operations, wherein each matrix processing chip is to transmit a portion of the partial matrix data to one or more of the plurality of matrix processing chips simultaneously while each partial matrix operation of the plurality of partial matrix operations is being performed” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M processing bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. Examiner notes that ¶ [0038] recites that “bi-directional communication between neighboring processing resources may be referred to as a "dual-cyclical" configuration” and ¶ [0052] of the Instant Specification recites “a dual-cyclical configuration of processing resources 210 enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 210, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time”),
“wherein the partial matrix data includes a partial input matrix, wherein the partial input matrix is to be used by a first MPU in a particular stage of the partial matrix operations, and wherein the partial input matrix is to be used by a second MPU in a subsequent stage of the partial matrix operations” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each. Each chain 104 has a bi-directional
“determine a result of the neural network operation based on the plurality of partial matrix operations” (Cadambi in at least ¶ [0031], ¶ [0046], and ¶ [0063] discloses accumulating the results).

In reference to claim 19. Cadambi teaches the system of Claim 18 (as mentioned above), further including:
Cadambi further discloses:
“a communication interface circuitry to communicate with one or more remote matrix processing chips over a communication network” (Cadambi in at least Fig. 2, ¶ [0026], and ¶ [0031]).

In reference to claim 20. Cadambi teaches at least one non-transitory machine accessible storage medium having instructions stored thereon, the instructions, when executed on a machine (Cadambi in at least ¶ [0024]), cause the machine to:
“receive an instruction to perform a neural network operation on the plurality of input matrices, wherein the neural network operation includes a plurality of convolution operations” (Cadambi in at least Fig. 1, Figs. 7-8c, ¶ [0007], ¶ [0008], ¶ [0021], ¶ [0027], ¶ [0040], ¶ [0047]-[0055], and ¶ [0058]-[0063]. Examiner notes that Convolutional Neural Networks (CNN) involves matrix operations associated with a plurality of input matrices);
“partition the plurality of input matrices into a plurality of input partitions based on a number of available matrix processing units (MPUs) in a matrix processor” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], and ¶ [0052] discloses massive parallel learning and partitioning);
“distribute the plurality of input partitions among a plurality of MPUs in the matrix processor, wherein each input partition is distributed to a particular MPU of the plurality of MPUs” (Cadambi in at least ¶ [0029] and ¶ [0063] discloses the distribution between the PEs);
“perform a plurality of partial matrix operations in a plurality of stages using the plurality of MPUs” (Cadambi in at least ¶ [0046] discloses accumulating the partial results. See at least Fig. 6, Figs. 8a-8c, ¶ [0044]-[0049], and ¶ [0062]-[0064] discloses splitting the matrix into partial matrixes to perform simultaneous parallel processing using multiple processing elements);
“transmit, via interface circuitry of the matrix processor, partial matrix data between the plurality of MPUs while performing the plurality of partial matrix operations, wherein each MPU is to transmit a portion of the partial matrix data to one or more of the plurality of MPUs simultaneously while each partial matrix operation of the plurality of partial matrix operations is being performed” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each. Each chain 104 has a bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. Examiner notes that ¶ [0038] recites that “bi-directional communication between neighboring processing resources may be referred to as a "dual-cyclical" configuration” and ¶ [0052] of the Instant Specification recites “a dual-cyclical configuration of processing resources 210 enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 210, which 
“wherein the partial matrix data includes a partial input matrix, wherein the partial input matrix is to be used by a first MPU in a particular stage of the partial matrix operations, and wherein the partial input matrix is to be used by a second MPU in a subsequent stage of the partial matrix operations” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each. Each chain 104 has a bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. See at least Fig. 6, Figs. 8a-8c, ¶ [0044]-[0049], and ¶ [0062]-[0064] discloses splitting the matrix into partial matrixes to perform simultaneous parallel processing using multiple processing elements);
“determine a result of the neural network operation based on the plurality of partial matrix operations” (Cadambi in at least ¶ [0031], ¶ [0046], and ¶ [0063] discloses accumulating the results).

In reference to claim 21. Cadambi teaches the storage medium of Claim 20 (as mentioned above), wherein the plurality of input matrices includes:
Cadambi further discloses:
“matrix data associated with one or more images and one or more filters, wherein the one or more images are associated with one or more channels” (Cadambi in at least Fig. 3, ¶ [0021], ¶ [0034], ¶ [0036], ¶ [0038], ¶ [0047], ¶ [0048], ¶ [0059]-[0061], and ¶ [0064]).

In reference to claim 22. Cadambi teaches the storage medium of Claim 21 (as mentioned above), wherein the instructions that cause the machine to partition the plurality of input matrices into the plurality of input partitions based on the number of available matrix processing units (MPUs) in the matrix processor further cause the machine to partition the plurality of input matrices based on one or more of:
Cadambi further discloses:
“a number of channels associated with the one or more images; a number of filters; a number of images” (Cadambi in at least Fig. 3, ¶ [0021], ¶ [0034], ¶ [0036], ¶ [0038], ¶ [0047], ¶ [0048], ¶ [0059]-[0061], and ¶ [0064]).

In reference to claim 23. Cadambi teaches the storage medium of Claim 20 (as mentioned above), wherein the instructions further cause the machine to:
Cadambi further discloses:
“distribute the plurality of partial matrix operations to the plurality of MPUs based on a height and a width of the result of the neural network operation” (Cadambi in at least ¶ [0029] and ¶ [0063] discloses the distribution between the PEs).

In reference to claim 24. Cadambi teaches the storage medium of Claim 20 (as mentioned above), wherein:
Cadambi further discloses:
“the plurality of MPUs is configured in a cyclic arrangement such that each MPU is communicatively coupled to a plurality of neighbor MPUs” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. Examiner notes that ¶ [0038] recites that “bi-directional communication between neighboring processing resources may be referred to as a "dual-cyclical" configuration” and ¶ [0052] of the Instant Specification recites “a dual-cyclical configuration of processing resources 210 enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 210, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time”).

In reference to claim 25. Cadambi teaches the storage medium of Claim 24 (as mentioned above), wherein:
Cadambi further discloses:
“the instructions that cause the machine to transmit, via the interface circuitry, the partial matrix data between the plurality of MPUs while performing the plurality of partial matrix operations is further to transmit a portion of the partial matrix data from each MPU to one or more of the neighbor MPUs while performing a particular stage of the partial matrix operations” (Cadambi in at least Fig. 4, Figs. 8a-8c, Abstract, ¶ [0007], ¶ [0021], ¶ [0022], ¶ [0027], ¶ [0029], ¶ [0031], ¶ [0036], ¶ [0040], ¶ [0044], ¶ [0046], ¶ [0048]-[0050], ¶ [0052], and ¶ [0063] “Each core 100 has p=N·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each. Each chain 104 has a bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other”. Examiner notes that ¶ [0038] recites bi-directional communication between neighboring processing resources may be referred to as a "dual-cyclical" configuration” and ¶ [0052] of the Instant Specification recites “a dual-cyclical configuration of processing resources 210 enables each processing resource to perform matrix computations while simultaneously obtaining matrix operands and data from both of its neighboring processing resources 210, which significantly reduces the latency for communicating matrix operands, and thus avoids any idle processing time”).

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Viker A. Lamardo whose telephone number is (571)270-5871. The examiner can normally be reached Mon. - Fri. 9 AM - 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/VIKER A LAMARDO/Primary Examiner, Art Unit 2126