Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on January 13, 2022 has been entered.
Response to Amendment
The amendment filed on January 13, 2022 has been entered.
In view of the amendment to the claims, the amendment of claims 1-13 and 15-20 have been acknowledged. New claims 21-24 have been added.

Response to Arguments
Applicant’s arguments, see pages 8-11 of Remarks, filed January 13, 2022 have been fully considered. Applicant’s arguments are directed to the amended claims and addressed in the claim rejections below.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:


Claims 1, 9 and 16 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more. 

	Claim 1 is a directed to methods, which is one of the statutory categories of invention. The claim recites a method of “generating one or more data addresses using one or more neural networks based, at least in part, on one or more data offset values”. The limitations merely employ mathematical relationships/formulas to calculate the data addresses based on the data offset values. This idea is similar to the "mathematical concepts" of a mathematical relationship found to be an abstract idea by the courts such as an algorithm for calculating parameters indicating an abnormal condition, In re Grams, 888 F.2d 835, 12 U.S.P.Q.2d 1824 (Fed. Cir. 1989) and calculating the difference between local and average data values, In re Abele, 684 F.2d 902, 214 U.S.P.Q. 682 (CCPA 1982) (MPEP 2106.04 (a)(2)).
	The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the addition of limitations such as “one or more neural networks”. The recitation of “one or more neural networks” is not enough by itself to transform the exception into a patentable invention, because these limitations are mere instructions to implement the idea for performing generic functions at a high level of generality such as generating, transmitting, storing, retrieving and processing data through the program. This limitation also describes a “mathematical calculation”. Therefore, the claim does not include additional elements providing 

	Claim 9 recites a computer readable storage medium storing instructions executable by one or more processors to perform “use one or more neural networks to generate one or more data addresses based, at least in part, on one or more data offset values” (similar to the claim 1). The claim is directed to a manufacture (an article produced from materials), which is a statutory category of invention. Claim 9 is rejected for the same rationale as claim 1.

	Claim 16 recites a processor, which is a mechanical and/or electrical device such as a general purpose computer. Thus, the claim is to a manufacture or a machine, which are statutory categories of invention. The claim recites the limitations of “one or more circuits to use one or more neural networks to generate one or more data addresses based, at least in part, on one or more data offset values” (similar to the claim 1). Claim 16 is rejected for the same rationale as claim 1.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-24 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.

Claim 1 recites “generating one or more data addresses using one or more neural networks based, at least in part, on one or more data offset values”; claim 9 recites “use one or more neural networks to generate one or more data addresses based, at least in part, on one or more data offset values”; and claim 16 recites “one or more circuits to use one or more neural networks to generate one or more data addresses based, at least in part, on one or more data offset values”. Those limitation are unclear how the data addresses can be generated based on the “data offset values”. Thus, those claims are indefinite where it merely recites a use without any active, positive steps delimiting how this use is actually practiced. Therefore, the claims are rejected under U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph.

Dependent claims 2-8, 10-15 and 17-24 are rejected because they depend upon independent claims 1, 9 and 16.

Claim 22 depends from independent claim 16 and recites “the one or more data offset values include one or more column offset values and one or more memory offset values …”. It is unclear for what “column offset values” including in the data offset values for generating the data addresses. Therefore, the scope claim 22 is indefinite. 
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.



Claims 1, 9-10, 16, 19 and 21 are rejected under 35 U.S.C. 102 (a)(1) as being anticipated by Yamamoto et al (U.S. Patent Application Publication 2010/0215253 A1).

	Regarding claim 1, Yamamoto discloses a method, comprising: 
generating one or more data addresses using one or more neural networks (FIG. 1; paragraph [0051], a CNN processing unit 63 is a feature detection processing unit including a hierarchical calculation processing apparatus.  Details of the CNN processing unit 63 will be described later with reference to FIG. 2) based, at least in part, on one or more data offset values (Paragraph [0132], the ring buffer management unit 103 outputs the calculated address for each line (ring counter value) and the offset address value to a memory access control unit 110.  An offset address setting unit 111 temporarily stores the offset address sent from the network composition management unit 102, and outputs the stored value to the memory access control unit 110; paragraph [0149], the memory access control unit 110 generates physical addresses based on the ring counter values and offset address value sent from the ring buffer management unit 103).

Regarding claim 9, Yamamoto discloses a non-transitory computer readable storage medium storing instructions (FIG. 1; paragraph [0052], reference numeral 69 denotes a ROM (Read Only Memory), which stores instructions that specify the operations of the CPU 68 and parameter data required for various calculations), which if performed by one or more processors (Paragraph [0052], reference numeral 68 denotes a CPU, which controls the operation of this apparatus as a whole), cause the one or more processors to at least: 
use one or more neural networks (Paragraph [0051], a CNN processing unit 63 is a feature detection processing unit including a hierarchical calculation processing apparatus.  Details of the CNN processing unit 63 will be described later with reference to FIG. 2) to generate one or more data addresses based, at least in part, on one or more data offset values (Paragraph [0132], the ring buffer management unit 103 outputs the calculated address for each line (ring counter value) and the offset address value to a memory access control unit 110.  An offset address setting unit 111 temporarily stores the offset address sent from the network composition management unit 102, and outputs the stored value to the memory access control unit 110; paragraph [0149], the memory access control unit 110 generates physical addresses based on the ring counter values and offset address value sent from the ring buffer management unit 103).

Regarding claim 10, Yamamoto discloses everything claimed as applied above (see claim 9), and Yamamoto further disclose wherein the instructions, which if performed by the one or more processors, further cause the one or more processors to Paragraphs [0053]-[0054], FIG. 2 is a block diagram showing an example of the arrangement of the hierarchical calculation processing apparatus in the CNN processing unit 63 of the first embodiment. The hierarchical calculation processing apparatus shown in FIG. 2 is used to execute hierarchical calculations shown in, for example, FIG. 3.  In FIG. 3, a processing node indicates a block which executes processing for obtaining a convolution calculation result from a convolution calculation target image and convolution kernels … For example, the fourth processing node in FIG. 3 executes convolution calculations by applying convolution kernels having different coefficients to the outputs from the first to third processing nodes.  Then, the fourth processing node adds the respective convolution calculation results, and executes nonlinear transformation to obtain a calculation result.  Furthermore, the calculation result of the fourth processing node is input to the sixth and seventh processing nodes) that include retrieving input data stored at the one or more data addresses (Paragraph [0132], the ring buffer management unit 103 outputs the calculated address for each line (ring counter value) and the offset address value to a memory access control unit 110.  An offset address setting unit 111 temporarily stores the offset address sent from the network composition management unit 102, and outputs the stored value to the memory access control unit 110; paragraph [0149], the memory access control unit 110 generates physical addresses based on the ring counter values and offset address value sent from the ring buffer management unit 103. Furthermore, the memory access control unit 110 calculates addresses required to read out calculation target pixel data required for the convolution calculations in the calculation unit 106).

Regarding claim 16, Yamamoto discloses a processor (FIG. 1; paragraph [0052], reference numeral 68 denotes a CPU, which controls the operation of this
apparatus as a whole), comprising: 
one or more circuits (Paragraph [0051]; FIG. 1 is a block diagram showing an example of the arrangement of a pattern detection apparatus, which comprises a hierarchical calculation processing circuit according to the first embodiment) to use one or more neural networks (Paragraph [0051], a CNN processing unit 63 is a feature detection processing unit including a hierarchical calculation processing apparatus.  Details of the CNN processing unit 63 will be described later with reference to FIG. 2) to generate one or more data addresses based, at least in part, on one or more data offset values (Paragraph [0132], the ring buffer management unit 103 outputs the calculated address for each line (ring counter value) and the offset address value to a memory access control unit 110.  An offset address setting unit 111 temporarily stores the offset address sent from the network composition management unit 102, and outputs the stored value to the memory access control unit 110; paragraph [0149], the memory access control unit 110 generates physical addresses based on the ring counter values and offset address value sent from the ring buffer management unit 103).

Regarding claim 19, Yamamoto discloses everything claimed as applied above (see claim 16), and Yamamoto further disclose wherein the one or more circuits (Paragraph [0051]; FIG. 1 is a block diagram showing an example of the arrangement of a pattern detection apparatus, which comprises a hierarchical calculation processing circuit according to the first embodiment) are to perform one or more neural network operations on data representing one or more input images (Paragraph [0051], the pattern detection apparatus has a function of detecting a specific object (image pattern) in image data …A CNN processing unit 63 is a feature detection processing unit including a hierarchical calculation processing apparatus.  Details of the CNN processing unit 63 will be described later with reference to FIG. 2), wherein the data is accessed from memory using the one or more data addresses (Paragraph [0149], the memory access control unit 110 generates physical addresses based on the ring counter values and offset address value sent from the ring buffer management unit 103. Furthermore, the memory access control unit 110 calculates addresses required to read out calculation target pixel data required for the convolution calculations in the calculation unit 106).

	Regarding claim 21, Yamamoto discloses everything claimed as applied above (see claim 16), and Yamamoto further disclose wherein the one or more data offset values include one or more memory offset values (FIG. 2; paragraph [0072], the memory 104 is divided into partial areas assigned to respective processing nodes, and each partial area is used as a ring buffer.  FIG. 5 illustrates a state in which the memory 104 is divided into the partial areas upon execution of the hierarchical calculations shown in FIG. 3. FIG. 5 shows offset addresses) relative to one or more start addresses of image data to be used (Paragraph [0080], the address calculation parameter storage table 107 held by the network composition management unit 102 holds the following pieces of information for each processing node, as shown in FIGS. 8A, 8B and 8C; paragraph [0083], read counter value: This counter value is that having line-storing areas as units with reference to the start position of the ring buffer assigned to the memory 104 (see FIG. 7; examples of counter values are described in parentheses); paragraph [0085], offset address: An address (see FIGS. 5 and 7) indicating the start position of the ring buffer assigned to that processing node in the memory 104) as input by the one or more neural networks (FIG. 1; paragraph [0051], the pattern detection apparatus has a function of detecting a specific object (image pattern) in image data …A CNN processing unit 63 is a feature detection processing unit including a hierarchical calculation processing apparatus.  Details of the CNN processing unit 63 will be described later with reference to FIG. 2; paragraphs [0053]-[0054], FIG. 2 is a block diagram showing an example of the arrangement of the hierarchical calculation processing apparatus in the CNN processing unit 63 of the first embodiment. The hierarchical calculation processing apparatus shown in FIG. 2 is used to execute hierarchical calculations shown in, for example, FIG. 3.  In FIG. 3, a processing node indicates a block which executes processing for obtaining a convolution calculation result from a convolution calculation target image and convolution kernels … For example, the fourth processing node in FIG. 3 executes convolution calculations by applying convolution kernels having different coefficients to the outputs from the first to third processing nodes.  Then, the fourth processing node adds the respective convolution calculation results, and executes nonlinear transformation to obtain a calculation result.  Furthermore, the calculation result of the fourth processing node is input to the sixth and seventh processing nodes).  

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claims 2-3, 5-8, 12-15, 17-18, 20 and 22-24 are rejected under 35 U.S.C. 103 as being unpatentable over Yamamoto et al (U.S. Patent Application Publication 2010/0215253 A1) in view of Juffa et al (U.S. Patent No. 7,912,889 B1).

	Regarding claim 2, Yamamoto discloses everything claimed as applied above (see claim 1), and Yamamoto discloses further comprising copying image data stored at the one or more data addresses (FIGS. 1 and 2; paragraph [0149], the memory access control unit 110 generates physical addresses based on the ring counter values and offset address value sent from the ring buffer management unit 103. Furthermore, the memory access control unit 110 calculates addresses required to read out calculation target pixel data required for the convolution calculations in the calculation unit 106), and performing one or more neural network operations based, at least in part, on the copied image data (Paragraph [0149], the convolution calculations in the calculation unit 106).
	However, Yamamoto does not specifically disclose copying image data to an image tile, and performing operations on the image tile.
	In the similar field of endeavor, Juffa discloses (Abstract, the present invention enables efficient matrix multiplication operations on parallel processing devices.  One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations.  Another embodiment is a second method for mapping CTAs to result tiles.  Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations …) copying image data to an image tile (Col 5, lines 4-21, FIG. 2 illustrates the graphics adapter 102 of FIG. 1, according to one embodiment of the invention.  As shown, the graphics adapter 102 includes a graphics processing unit ("GPU") 200 and a global memory ("GMEM") 202; Col 6, lines 13-44, a local memory ("LMEM") that may be included within each streaming multiprocessor; Col 14, lines 44-62, FIG. 9 illustrates a flowchart of method steps for allocating work among the threads of a CTA when performing a non-transposed copy operation or a result tile computation, according to one embodiment of the invention.  For purposes of discussion only, it is assumed that one CTA executing on one of the streaming multiprocessors of the graphics processing unit 200 is either copying the elements of a 32x32 source tile 1000 stored in the GMEM 202, as illustrated in FIG. 10A, to local memory to create a 32x32 local memory tile 1002, as illustrated in FIG. 10B … the tile elements are copied to the corresponding tile element positions in local memory tile 1002 using the GMEM-to-LMEM address mapping shown in FIGS. 10A and 10B), and performing operations on the image tile (Col 14, lines 59-62, for the result tile computation, the positions of the tile elements of result tile 1000 are first determined, and then the tile elements are computed according to the method of FIG. 7).
	Yamamoto and Juffa are analogous art because both pertain to utilize the
image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation processing apparatus taught by Yamamoto incorporate the teachings of Juffa, and applying matrix multiplication operations on a parallel processing device taught by Juffa to have result matrix on a tile-by-tile basis for performing a tiled matrix multiplication operation, as it could be used to reduce the number of times the memory is accessed by the convolution calculations. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify Yamamoto according to the relied-upon teachings of Juffa to obtain the invention as specified in claim.

Regarding claim 3, the combination of Yamamoto in view of Juffa discloses everything claimed as applied above (see claim 2), and Yamamoto further disclose wherein the one or more neural network operations include a convolution operation (Paragraphs [0053]-[0054], FIG. 2 is a block diagram showing an example of the arrangement of the hierarchical calculation processing apparatus in the CNN processing unit 63 of the first embodiment. The hierarchical calculation processing apparatus shown in FIG. 2 is used to execute hierarchical calculations shown in, for example, FIG. 3.  In FIG. 3, a processing node indicates a block which executes processing for obtaining a convolution calculation result from a convolution calculation target image and convolution kernels … For example, the fourth processing node in FIG. 3 executes convolution calculations by applying convolution kernels having different coefficients to the outputs from the first to third processing nodes.  Then, the fourth processing node adds the respective convolution calculation results, and executes nonlinear transformation to obtain a calculation result.  Furthermore, the calculation result of the fourth processing node is input to the sixth and seventh processing nodes).

Regarding claim 5, the combination of Yamamoto in view of Juffa discloses everything claimed as applied above (see claim 2), and Yamamoto further disclose 
wherein copying the image data (FIGS. 1 and 2; paragraph [0149], the memory access control unit 110 generates physical addresses based on the ring counter values and offset address value sent from the ring buffer management unit 103. Furthermore, the memory access control unit 110 calculates addresses required to read out calculation target pixel data required for the convolution calculations in the calculation unit 106) is based, at least in part, on a virtual image matrix (Paragraph [0080], the address calculation parameter storage table 107 held by the network composition management unit 102 holds the following pieces of information for each processing node, as shown in FIGS. 8A, 8B and 8C; paragraph [0083], read counter value: This counter value is that having line-storing areas as units with reference to the start position of the ring buffer assigned to the memory 104 (see FIG. 7; examples of counter values are described in parentheses); paragraph [0085], offset address: An address (see FIGS. 5 and 7) indicating the start position of the ring buffer assigned to that processing node in the memory 104).

Regarding claim 6, Yamamoto discloses everything claimed as applied above (see claim 1), and Yamamoto discloses further comprising performing one or more neural network operations using the readout data (FIGS. 1 and 2; paragraph [0149], the memory access control unit 110 generates physical addresses based on the ring counter values and offset address value sent from the ring buffer management unit 103. Furthermore, the memory access control unit 110 calculates addresses required to read out calculation target pixel data required for the convolution calculations in the calculation unit 106).
However, Yamamoto does not specifically disclose populating an image tile based, at least in part, on the one or more data addresses and one or more virtual addresses included in a virtual image matrix, and performing operations using the image tile as an operand.
In the similar field of endeavor, Juffa discloses (Abstract, the present invention enables efficient matrix multiplication operations on parallel processing devices.  One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations.  Another embodiment is a second method for mapping CTAs to result tiles.  Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations …) populating an image tile based (Col 5, lines 4-21, FIG. 2 illustrates the graphics adapter 102 of FIG. 1, according to one embodiment of the invention.  As shown, the graphics adapter 102 includes a graphics processing unit ("GPU") 200 and a global memory ("GMEM") 202; Col 6, lines 13-44, a local memory ("LMEM") that may be included within each streaming multiprocessor; Col 14, lines 44-54, FIG. 9 illustrates a flowchart of method steps for allocating work among the threads of a CTA when performing a non-transposed copy operation or a result tile computation, according to one embodiment of the invention.  For purposes of discussion only, it is assumed that one CTA executing on one of the streaming multiprocessors of the graphics processing unit 200 is either copying the elements of a 32x32 source tile 1000 stored in the GMEM 202, as illustrated in FIG. 10A, to local memory to create a 32x32 local memory tile 1002, as illustrated in FIG. 10B), at least in part, on the one or more data addresses and one or more virtual addresses included in a virtual image matrix (Col 14, lines 54-59, the tile elements are copied to the corresponding tile element positions in local memory tile 1002 using the GMEM-to-LMEM address mapping shown in FIGS. 10A and 10B ), and performing operations using the image tile as an operand (Col 14, lines 59-62, for the result tile computation, the positions of the tile elements of result tile 1000 are first determined, and then the tile elements are computed according to the method of FIG. 7).
	Yamamoto and Juffa are analogous art because both pertain to utilize the
image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation processing apparatus taught by Yamamoto incorporate the teachings of Juffa, and applying matrix multiplication operations on a parallel processing device taught by Juffa to have result matrix on a tile-by-tile basis for performing a tiled matrix multiplication operation, as it could be used to reduce the number of times the memory is accessed by the convolution calculations. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify Yamamoto according to the relied-upon teachings of Juffa to obtain the invention as specified in claim.

Regarding claim 7, Yamamoto discloses everything claimed as applied above (see claim 1).
However, Yamamoto does not specifically disclose wherein the one or more data addresses are included in a parallel processing memory.
In the similar field of endeavor, Juffa discloses (Abstract, the present invention enables efficient matrix multiplication operations on parallel processing devices.  One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations.  Another embodiment is a second method for mapping CTAs to result tiles.  Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations …) wherein the one or more data addresses are included in a parallel processing memory (Col 5, lines 4-21, FIG. 2 illustrates the graphics adapter 102 of FIG. 1, according to one embodiment of the invention.  As shown, the graphics adapter 102 includes a graphics processing unit ("GPU") 200 and a global memory ("GMEM") 202; Col 5, lines 4-21, a global memory ("GMEM") 202; Col 14, lines 44-54, FIG. 9 illustrates a flowchart of method steps for allocating work among the threads of a CTA when performing a non-transposed copy operation or a result tile computation, according to one embodiment of the invention.  For purposes of discussion only, it is assumed that one CTA executing on one of the streaming multiprocessors of the graphics processing unit 200 is either copying the elements of a 32x32 source tile 1000 stored in the GMEM 202, as illustrated in FIG. 10A, to local memory to create a 32x32 local memory tile 1002, as illustrated in FIG. 10B … the tile elements are copied to the corresponding tile element positions in local memory tile 1002 using the GMEM-to-LMEM address mapping shown in FIGS. 10A and 10B; Col 7, lines 20-25, coalescing provides a way to reduce the overall cost of accessing the GMEM 202 by exploiting the wide interface of the GMEM 202 to perform a plurality of parallel memory operations).
Yamamoto and Juffa are analogous art because both pertain to utilize the
image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation processing apparatus taught by Yamamoto incorporate the teachings of Juffa, and 

Regarding claim 8, Yamamoto discloses everything claimed as applied above (see claim 1).
However, Yamamoto does not specifically disclose further comprising 
dividing a virtual image matrix into one or more image tiles, and 
processing each image tile included in the one or more image tiles using a different thread group, wherein processing each image tile includes accessing image data at one or more of the one or more data addresses.
In the similar field of endeavor, Juffa discloses (Abstract, the present invention enables efficient matrix multiplication operations on parallel processing devices.  One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations.  Another embodiment is a second method for mapping CTAs to result tiles.  Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations …) further comprising 
Col 5, lines 4-21, FIG. 2 illustrates the graphics adapter 102 of FIG. 1, according to one embodiment of the invention.  As shown, the graphics adapter 102 includes a graphics processing unit ("GPU") 200 and a global memory ("GMEM") 202; Col 6, lines 13-44, a local memory ("LMEM") that may be included within each streaming multiprocessor; Col; 8, lines 49-67 to Col 9, lines 1-6, FIG. 4A illustrates a flowchart of method steps for allocating work among a plurality of CTAs “cooperative thread arrays” executing within a GPU when performing a matrix multiplication operation, according to one embodiment of the invention …in step 402, the result matrix is divided into tiles (also referred to herein as "result tiles")), and 
processing each image tile included in the one or more image tiles using a different thread group (Col 9, lines 7-20, in step 404, a software process determines the size of the CTAs.  As previously described herein, the CTA size is generally determined by the amount of hardware resources within the streaming multiprocessors available to the CTAs as well as by the size of the result tiles. For example, in the embodiment of FIG. 3B where the result tiles consist of 32x32 elements, additional processing efficiencies are achieved when the CTAs include 512 threads.  The software process also defines the dimensions of the CTA grid.  In the exemplary embodiment, where there are sixteen streaming multiprocessors in the GPU 200 and one CTA executing on each streaming multiprocessor, the CTA grid is defined as a four-by-four array of sixteen CTAs), wherein processing each image tile includes accessing image data at one or more of the one or more data Col 9, lines 21-67 to Col 10, lines 1-44, in step 406, a software process requests that the CTA creation logic 211 create a CTA for each position within the CTA grid; in step 408, for each CTA, a software process generates a set of tile positions within the result matrix that the CTA will traverse; in step 412, for each CTA, a software process selects a tile position from the set of tile positions generated for the CTA.  In step 414, each CTA processes the result tile associated with the tile position selected for the CTA … Col 14, lines 44-62, FIG. 9 illustrates a flowchart of method steps for allocating work among the threads of a CTA when performing a non-transposed copy operation or a result tile computation, according to one embodiment of the invention.  For purposes of discussion only, it is assumed that one CTA executing on one of the streaming multiprocessors of the graphics processing unit 200 is either copying the elements of a 32x32 source tile 1000 stored in the GMEM 202, as illustrated in FIG. 10A, to local memory to create a 32x32 local memory tile 1002, as illustrated in FIG. 10B … the tile elements are copied to the corresponding tile element positions in local memory tile 1002 using the GMEM-to-LMEM address mapping shown in FIGS. 10A and 10B), and performing operations on the image tile (Col 14, lines 59-62, for the result tile computation, the positions of the tile elements of result tile 1000 are first determined, and then the tile elements are computed according to the method of FIG. 7).
Yamamoto and Juffa are analogous art because both pertain to utilize the
image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation 

	Regarding claim 12, Yamamoto discloses everything claimed as applied above (see claim 9), and Yamamoto further disclose wherein the set of instructions, which if performed by the one or more processors (FIG. 1; paragraph [0052], reference numeral 69 denotes a ROM (Read Only Memory), which stores instructions that specify the operations of the CPU 68 and parameter data required for various calculations … reference numeral 68 denotes a CPU, which controls the operation of this apparatus as a whole), a number of parameters associated (Paragraph [0056], reference numeral 114 denotes a CPU bus access control unit, which is a bus interface required for the CPU 68 to access various registers and a memory 104 in the CNN processing unit 63.  For example, various setting data such as an address calculation parameter storage table 107 in a network composition management unit 102, weighting coefficient set 1205 (to be described later with reference to FIG. 10) in a calculation unit 106, and the like are written via that interface) with a first neural network operation included in one or more neural network Paragraphs [0053]-[0054], FIG. 2 is a block diagram showing an example of the arrangement of the hierarchical calculation processing apparatus in the CNN processing unit 63 of the first embodiment. The hierarchical calculation processing apparatus shown in FIG. 2 is used to execute hierarchical calculations shown in, for example, FIG. 3.  In FIG. 3, a processing node indicates a block which executes processing for obtaining a convolution calculation result from a convolution calculation target image and convolution kernels … For example, the fourth processing node in FIG. 3 executes convolution calculations by applying convolution kernels having different coefficients to the outputs from the first to third processing nodes.  Then, the fourth processing node adds the respective convolution calculation results, and executes nonlinear transformation to obtain a calculation result.  Furthermore, the calculation result of the fourth processing node is input to the sixth and seventh processing nodes).
	However, Yamamoto does not specifically disclose further cause the one or more processors to at least partition a virtual image matrix into a set of image tiles and copy data stored at the one more data addresses to one or more image tiles in the set of image tiles, wherein one or more dimensions of the virtual image matrix is determined based, at least in part, on a number of parameters associated with a first neural network operation included in one or more neural network operations to be performed by the one or more neural networks.
	In the similar field of endeavor, Juffa discloses (Abstract, the present invention enables efficient matrix multiplication operations on parallel processing devices.  One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations.  Another embodiment is a second method for mapping CTAs to result tiles.  Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations …) further cause the one or more processors to at least partition a virtual image matrix into a set of image tiles(Col 5, lines 4-21, FIG. 2 illustrates the graphics adapter 102 of FIG. 1, according to one embodiment of the invention.  As shown, the graphics adapter 102 includes a graphics processing unit ("GPU") 200 and a global memory ("GMEM") 202; Col 6, lines 13-44, a local memory ("LMEM") that may be included within each streaming multiprocessor; Col; 8, lines 49-67 to Col 9, lines 1-6, FIG. 4A illustrates a flowchart of method steps for allocating work among a plurality of CTAs “cooperative thread arrays” executing within a GPU when performing a matrix multiplication operation, according to one embodiment of the invention …in step 402, the result matrix is divided into tiles (also referred to herein as "result tiles")) and copy data stored at the one more data addresses to one or more image tiles in the set of image tiles (Col 9, lines 21-67 to Col 10, lines 1-44, in step 406, a software process requests that the CTA creation logic 211 create a CTA for each position within the CTA grid; in step 408, for each CTA, a software process generates a set of tile positions within the result matrix that the CTA will traverse; in step 412, for each CTA, a software process selects a tile position from the set of tile positions generated for the CTA.  In step 414, each CTA processes the result tile associated with the tile position selected for the CTA … Col 14, lines 44-62, FIG. 9 illustrates a flowchart of method steps for allocating work among the threads of a CTA when performing a non-transposed copy operation or a result tile computation, according to one embodiment of the invention.  For purposes of discussion only, it is assumed that one CTA executing on one of the streaming multiprocessors of the graphics processing unit 200 is either copying the elements of a 32x32 source tile 1000 stored in the GMEM 202, as illustrated in FIG. 10A, to local memory to create a 32x32 local memory tile 1002, as illustrated in FIG. 10B … the tile elements are copied to the corresponding tile element positions in local memory tile 1002 using the GMEM-to-LMEM address mapping shown in FIGS. 10A and 10B), wherein one or more dimensions of the virtual image matrix is determined based, at least in part, on a number of parameters associated with a first Col 8, lines 49-67 to Col 9, lines 1-6, FIG. 4A illustrates a flowchart of method steps for allocating work among a plurality of CTAs executing within a GPU when performing a matrix multiplication operation, according to one embodiment of the invention.  Although the method steps are described with respect to a plurality of CTAs executing on a plurality of the streaming multiprocessors of the graphics processing unit 200 of FIG. 2 … The method for allocating work among a plurality of CTAs begins in step 400, where a software process, such as a software driver, defines the size of the tiles into which the result matrix will be divided.  As described above in conjunction with FIGS. 3A-3B, the tile size depends on several competing factors, such as the size of the local memory within each streaming multiprocessor and the known advantages of making each tile square and as large as possible.  In step 402, the result matrix is divided into tiles (also referred to herein as "result tiles").  Persons skilled in the art will understand that when either the size of either dimension of the result matrix is not an integer multiple of the size of the tile in that same dimension, partial tiles result.  In one embodiment, the result matrix is partitioned such that the partial tiles are in the right-most column and bottom row of the matrix) Yamamoto discloses neural operations and neural networks).
Yamamoto and Juffa are analogous art because both pertain to utilize the image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation processing apparatus taught by Yamamoto incorporate the teachings of Juffa, and applying matrix multiplication operations on a parallel processing device taught by Juffa to have result matrix on a tile-by-tile basis for performing a tiled matrix multiplication operation, as it could be used to reduce the number of times the memory is accessed by the convolution calculations. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify Yamamoto according to the relied-upon teachings of Juffa to obtain the invention as specified in claim.

	Regarding claim 13, Yamamoto discloses everything claimed as applied above (see claim 9), and Yamamoto further disclose wherein the set of instructions, which if performed by the one or more processors (FIG. 1; paragraph [0052], reference numeral 69 denotes a ROM (Read Only Memory), which stores instructions that specify the operations of the CPU 68 and parameter data required for various calculations … reference numeral 68 denotes a CPU, which controls the operation of this apparatus as a whole), further cause the one or more processors to at least perform one or more neural network operations (Paragraph [0051], a CNN processing unit 63 is a feature detection processing unit including a hierarchical calculation processing apparatus.  Details of the CNN processing unit 63 will be described later with reference to FIG. 2) and select a set of virtual addresses included in a virtual image matrix (Paragraph [0080], the address calculation parameter storage table 107 held by the network composition management unit 102 holds the following pieces of information for each processing node, as shown in FIGS. 8A, 8B and 8C; paragraph [0083], read counter value: This counter value is that having line-storing areas as units with reference to the start position of the ring buffer assigned to the memory 104 (see FIG. 7; examples of counter values are described in parentheses); paragraph [0085], offset address: An address (see FIGS. 5 and 7) indicating the start position of the ring buffer assigned to that processing node in the memory 104), the one or more neural network operations include a first convolution operation (Paragraphs [0053]-[0054], FIG. 2 is a block diagram showing an example of the arrangement of the hierarchical calculation processing apparatus in the CNN processing unit 63 of the first embodiment. The hierarchical calculation processing apparatus shown in FIG. 2 is used to execute hierarchical calculations shown in, for example, FIG. 3.  In FIG. 3, a processing node indicates a block which executes processing for obtaining a convolution calculation result from a convolution calculation target image and convolution kernels … For example, the fourth processing node in FIG. 3 executes convolution calculations by applying convolution kernels having different coefficients to the outputs from the first to third processing nodes.  Then, the fourth processing node adds the respective convolution calculation results, and executes nonlinear transformation to obtain a calculation result.  Furthermore, the calculation result of the fourth processing node is input to the sixth and seventh processing nodes), and performing the one or more neural network operations comprises: 
loading, into a first memory, a portion of the input data associated with the first convolution operation from one or more of the one or more data addresses using a subset of the set of virtual addresses (Paragraph [0149], the memory access control unit 110 generates physical addresses based on the ring counter values and offset address value sent from the ring buffer management unit 103. Furthermore, the memory access control unit 110 calculates addresses required to read out calculation target pixel data required for the convolution calculations in the calculation unit 106), performing the first convolution operation on the portion of the input data (Paragraph [0149], the convolution calculations in the calculation unit 106), and removing the portion of the data from the first memory (Paragraph [0153], the unit 110 transfers data output from the memory 104 to the calculation unit 106 upon reading, and transfers the calculation result output from the calculation unit 106 to the memory 104 upon writing).

	In the similar field of endeavor, Juffa discloses (Abstract, the present invention enables efficient matrix multiplication operations on parallel processing devices.  One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations.  Another embodiment is a second method for mapping CTAs to result tiles.  Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations …) wherein each virtual address included in the set of virtual addresses is mapped to a physical address associated with input data (Col 5, lines 4-21, FIG. 2 illustrates the graphics adapter 102 of FIG. 1, according to one embodiment of the invention.  As shown, the graphics adapter 102 includes a graphics processing unit ("GPU") 200 and a global memory ("GMEM") 202; Col 6, lines 13-44, a local memory ("LMEM") that may be included within each streaming multiprocessor; Col 14, lines 44-62, FIG. 9 illustrates a flowchart of method steps for allocating work among the threads of a CTA when performing a non-transposed copy operation or a result tile computation, according to one embodiment of the invention.  For purposes of discussion only, it is assumed that one CTA executing on one of the streaming multiprocessors of the graphics processing unit 200 is either copying the elements of a 32x32 source tile 1000 stored in the GMEM 202, as illustrated in FIG. 10A, to local memory to create a 32x32 local memory tile 1002, as illustrated in FIG. 10B … the tile elements are copied to the corresponding tile element positions in local memory tile 1002 using the GMEM-to-LMEM address mapping shown in FIGS. 10A and 10B).
	Yamamoto and Juffa are analogous art because both pertain to utilize the
image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation processing apparatus taught by Yamamoto incorporate the teachings of Juffa, and applying matrix multiplication operations on a parallel processing device taught by Juffa to have result matrix on a tile-by-tile basis for performing a tiled matrix multiplication operation, as it could be used to reduce the number of times the memory is accessed by the convolution calculations. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify Yamamoto according to the relied-upon teachings of Juffa to obtain the invention as specified in claim.

	Regarding claim 14, the combination of Yamamoto in view of Juffa discloses everything claimed as applied above (see claim 13).
However, Yamamoto does not specifically disclose wherein the first memory comprises a shared memory, and the one or more physical addresses are included in a second memory.
In the similar field of endeavor, Juffa discloses wherein the first memory comprises a shared memory (Col 5, lines 4-21, FIG. 2 illustrates the graphics adapter 102 of FIG. 1, according to one embodiment of the invention.  As shown, the graphics adapter 102 includes a graphics processing unit ("GPU") 200; Col 6, lines 13-44, a local memory ("LMEM") that may be included within each streaming multiprocessor. Each LMEM is a small (e.g., 8 KB), fast (e.g., single clock cycle access time) shared memory), and the one or more physical addresses are included in a second memory (Col 5, lines 4-21, a global memory ("GMEM") 202; Col 14, lines 44-54, FIG. 9 illustrates a flowchart of method steps for allocating work among the threads of a CTA when performing a non-transposed copy operation or a result tile computation, according to one embodiment of the invention.  For purposes of discussion only, it is assumed that one CTA executing on one of the streaming multiprocessors of the graphics processing unit 200 is either copying the elements of a 32x32 source tile 1000 stored in the GMEM 202, as illustrated in FIG. 10A, to local memory to create a 32x32 local memory tile 1002, as illustrated in FIG. 10B … the tile elements are copied to the corresponding tile element positions in local memory tile 1002 using the GMEM-to-LMEM address mapping shown in FIGS. 10A and 10B).
Yamamoto and Juffa are analogous art because both pertain to utilize the
image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation processing apparatus taught by Yamamoto incorporate the teachings of Juffa, and applying matrix multiplication operations on a parallel processing device taught by Juffa to have result matrix on a tile-by-tile basis for performing a tiled matrix multiplication operation, as it could be used to reduce the number of times the memory is accessed by the convolution calculations. Therefore, it would have been obvious to a person of 

Regarding claim 15, Yamamoto discloses everything claimed as applied above (see claim 9), and Yamamoto further disclose wherein the instructions, which if performed by the one or more processors (FIG. 1; paragraph [0052], reference numeral 69 denotes a ROM (Read Only Memory), which stores instructions that specify the operations of the CPU 68 and parameter data required for various calculations … reference numeral 68 denotes a CPU, which controls the operation of this apparatus as a whole), further cause the one or more processors to at least perform one or more neural network operations (Paragraph [0051], a CNN processing unit 63 is a feature detection processing unit including a hierarchical calculation processing apparatus.  Details of the CNN processing unit 63 will be described later with reference to FIG. 2) 
However, Yamamoto does not specifically disclose perform one or more neural network operations that include: 
dividing a virtual image matrix into one or more image tiles; 
copying input data into the one or more image tiles based, at least in part, on the one or more data addresses, and 
processing each image tile included in the one or more image tiles using a different thread group.
Abstract, the present invention enables efficient matrix multiplication operations on parallel processing devices.  One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations.  Another embodiment is a second method for mapping CTAs to result tiles.  Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations …) perform one or more neural network operations that include: 
dividing a virtual image matrix into one or more image tiles (Col 5, lines 4-21, FIG. 2 illustrates the graphics adapter 102 of FIG. 1, according to one embodiment of the invention.  As shown, the graphics adapter 102 includes a graphics processing unit ("GPU") 200 and a global memory ("GMEM") 202; Col 6, lines 13-44, a local memory ("LMEM") that may be included within each streaming multiprocessor; Col; 8, lines 49-67 to Col 9, lines 1-6, FIG. 4A illustrates a flowchart of method steps for allocating work among a plurality of CTAs “cooperative thread arrays” executing within a GPU when performing a matrix multiplication operation, according to one embodiment of the invention …in step 402, the result matrix is divided into tiles (also referred to herein as "result tiles")); 
copying input data into the one or more image tiles based, at least in part, on the one or more data addresses (Col 9, lines 21-67 to Col 10, lines 1-44, in step 406, a software process requests that the CTA creation logic 211 create a CTA for each position within the CTA grid; in step 408, for each CTA, a software process generates a set of tile positions within the result matrix that the CTA will traverse; in step 412, for each CTA, a software process selects a tile position from the set of tile positions generated for the CTA.  In step 414, each CTA processes the result tile associated with the tile position selected for the CTA … Col 14, lines 44-62, FIG. 9 illustrates a flowchart of method steps for allocating work among the threads of a CTA when performing a non-transposed copy operation or a result tile computation, according to one embodiment of the invention.  For purposes of discussion only, it is assumed that one CTA executing on one of the streaming multiprocessors of the graphics processing unit 200 is either copying the elements of a 32x32 source tile 1000 stored in the GMEM 202, as illustrated in FIG. 10A, to local memory to create a 32x32 local memory tile 1002, as illustrated in FIG. 10B … the tile elements are copied to the corresponding tile element positions in local memory tile 1002 using the GMEM-to-LMEM address mapping shown in FIGS. 10A and 10B), and 
processing each image tile included in the one or more image tiles using a different thread group (Col 9, lines 7-20, in step 404, a software process determines the size of the CTAs.  As previously described herein, the CTA size is generally determined by the amount of hardware resources within the streaming multiprocessors available to the CTAs as well as by the size of the result tiles. For example, in the embodiment of FIG. 3B where the result tiles consist of 32x32 elements, additional processing efficiencies are achieved when the CTAs include 512 threads.  The software process also defines the dimensions of the CTA grid.  In the exemplary embodiment, where there are sixteen streaming multiprocessors in the GPU 200 and one CTA executing on each streaming multiprocessor, the CTA grid is defined as a four-by-four array of sixteen CTAs; Col 14, lines 59-62, for the result tile computation, the positions of the tile elements of result tile 1000 are first determined, and then the tile elements are computed according to the method of FIG. 7)).
Yamamoto and Juffa are analogous art because both pertain to utilize the
image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation processing apparatus taught by Yamamoto incorporate the teachings of Juffa, and applying matrix multiplication operations on a parallel processing device taught by Juffa to have result matrix on a tile-by-tile basis for performing a tiled matrix multiplication operation, as it could be used to reduce the number of times the memory is accessed by the convolution calculations. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify Yamamoto according to the relied-upon teachings of Juffa to obtain the invention as specified in claim.

	Regarding claim 17, Yamamoto discloses everything claimed as applied above (see claim 16), and Yamamoto further disclose wherein the one or more circuits (Paragraph [0051]; FIG. 1 is a block diagram showing an example of the arrangement of a pattern detection apparatus, which comprises a hierarchical calculation processing circuit according to the first embodiment) are to perform one or more neural network operations (Paragraph [0051], a CNN processing unit 63 is a feature detection processing unit including a hierarchical calculation processing apparatus.  Details of the CNN processing unit 63 will be described later with reference to FIG. 2) on the one or more circuits (Paragraph [0056], a calculation unit 106) and data accessed using the one or more data addresses (Paragraph [0149], the memory access control unit 110 generates physical addresses based on the ring counter values and offset address value sent from the ring buffer management unit 103. Furthermore, the memory access control unit 110 calculates addresses required to read out calculation target pixel data required for the convolution calculations in the calculation unit 106).
However, Yamamoto does not specifically disclose perform one or more operations using one or more thread groups executing on the one or more circuits.
In the similar field of endeavor, Juffa discloses (Abstract, the present invention enables efficient matrix multiplication operations on parallel processing devices.  One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations.  Another embodiment is a second method for mapping CTAs to result tiles.  Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations …) perform one or more operations using one or more thread groups (Col 14, lines 44-62, FIG. 9 illustrates a flowchart of method steps for allocating work among the threads of a CTA when performing a non-transposed copy operation or a result tile computation, according to one embodiment of the invention.  For purposes of discussion only, it is assumed that one CTA executing on one of the streaming multiprocessors of the graphics processing unit 200 is either copying the elements of a 32x32 source tile 1000 stored in the GMEM 202, as illustrated in FIG. 10A, to local memory to create a 32x32 local memory tile 1002, as illustrated in FIG. 10B … the tile elements are copied to the corresponding tile element positions in local memory tile 1002 using the GMEM-to-LMEM address mapping shown in FIGS. 10A and 10B … for the result tile computation, the positions of the tile elements of result tile 1000 are first determined, and then the tile elements are computed according to the method of FIG. 7) executing on the one or more circuits (FIGS. 1 and 2; Col 7, lines 26-33, although the graphics adapter 102 may contain additional elements, such as circuitry to generate an analog or digital video signal for display on a video display device, such additional elements were omitted for the sake of clarity.  The following sets forth how work is distributed among the different threads running on the GPU 200 when matrix multiplication operations, copy and transpose operations, or copy operations are performed by the GPU 200).
Yamamoto and Juffa are analogous art because both pertain to utilize the
image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation processing apparatus taught by Yamamoto incorporate the teachings of Juffa, and applying matrix multiplication operations on a parallel processing device taught by Juffa to have result matrix on a tile-by-tile basis for performing a tiled matrix multiplication operation, as it could be used to reduce the number of times the memory is accessed by the convolution calculations. Therefore, it would have been obvious to a person of 

	Regarding claim 18, the combination of Yamamoto in view of Juffa discloses everything claimed as applied above (see claim 17), and Yamamoto further disclose (Paragraph [0024], designation means for designating a processing node, which is to execute calculation processing, of the plurality of processing nodes) wherein a first node is configured to load at least a subset of data representing one or more input images using one or more of the one or more data addresses (FIGS. 1 and 2; paragraphs [0080]-[0086], the address calculation parameter storage table 107 held by the network composition management unit 102 holds the following pieces of information for each processing node, as shown in FIGS. 8A, 8B and 8C; read counter value: This counter value is that having line-storing areas as units with reference to the start position of the ring buffer assigned to the memory 104; the first processing node includes read counter value of “first processing node calculation read counter value (WRA0_1) of zeroth processing node ring buffer” ), and a second node is configured to perform at least one neural network operation of the one or more neural network operations on the at least the subset of the data (As shown in FIG. 8B; the fourth processing node includes read counter value of “fourth processing node calculation read counter value (MRA1_4) in first processing node assigned ring buffer”, “fourth processing node calculation read counter value (MRA2_4) in second processing node assigned ring buffer” and “fourth processing node calculation read counter value (MRA3_4) in third processing node assigned ring buffer”).
However, Yamamoto does not specifically disclose a first thread group and a second thread group.
In the similar field of endeavor, Juffa discloses (Abstract, the present invention enables efficient matrix multiplication operations on parallel processing devices.  One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations.  Another embodiment is a second method for mapping CTAs to result tiles.  Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations … Col 8, lines 49-67, FIG. 4A illustrates a flowchart of method steps for allocating work among a plurality of CTAs executing within a GPU when performing a matrix multiplication operation, according to one embodiment of the invention.  Although the method steps are described with respect to a plurality of CTAs executing on a plurality of the streaming multiprocessors of the graphics processing unit 200 of FIG. 2 … The method for allocating work among a plurality of CTAs begins in step 400, where a software process, such as a software driver, defines the size of the tiles into which the result matrix will be divided) a first thread group (Col 9, lines 40-48, in step 410, the software process determines, for each CTA, that the CTA has not exhausted its respective set of tile positions, then the method proceeds to step 412; in step 412, for each CTA, a software process selects a tile position from the set of tile positions generated for the CTA.  In step 414, each CTA processes the result tile associated with the tile position selected for the CTA) and a second thread group (Col 9, lines 40-48, for each CTA, the method returns to step 410 after the CTA processes its respective result tile. Thus, second CTA is selected).
Yamamoto and Juffa are analogous art because both pertain to utilize the
image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation processing apparatus taught by Yamamoto incorporate the teachings of Juffa, and applying matrix multiplication operations on a parallel processing device taught by Juffa to have result matrix on a tile-by-tile basis for performing a tiled matrix multiplication operation, as it could be used to reduce the number of times the memory is accessed by the convolution calculations. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify Yamamoto according to the relied-upon teachings of Juffa to obtain the invention as specified in claim.

	Regarding claim 20, Yamamoto discloses everything claimed as applied above (see claim 16), and Yamamoto further disclose wherein the one or more circuits (Paragraph [0051]; FIG. 1 is a block diagram showing an example of the arrangement of a pattern detection apparatus, which comprises a hierarchical calculation processing circuit according to the first embodiment) are to perform one or more neural network operations (Paragraph [0051], a CNN processing unit 63 is a feature detection processing unit including a hierarchical calculation processing apparatus.  Details of the CNN processing unit 63 will be described later with reference to FIG. 2) on data representing one or more input images (Paragraph [0051], the pattern detection apparatus has a function of detecting a specific object (image pattern) in image data …A CNN processing unit 63 is a feature detection processing unit including a hierarchical calculation processing apparatus.  Details of the CNN processing unit 63 will be described later with reference to FIG. 2), wherein the data is accessed from memory using the one or more data addresses and the one or more neural network operations (Paragraph [0149], the memory access control unit 110 generates physical addresses based on the ring counter values and offset address value sent from the ring buffer management unit 103. Furthermore, the memory access control unit 110 calculates addresses required to read out calculation target pixel data required for the convolution calculations in the calculation unit 106). 
	However, Yamamoto does not specifically disclose wherein the data is accessed from memory using the one or more data addresses and the one or more operations based, at least in part, on the data and one or more filter tiles. 
	In the similar field of endeavor, Juffa discloses (Abstract, the present invention enables efficient matrix multiplication operations on parallel processing devices.  One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations.  Another embodiment is a second method for mapping CTAs to result tiles.  Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations …) Col 5, lines 4-21, FIG. 2 illustrates the graphics adapter 102 of FIG. 1, according to one embodiment of the invention.  As shown, the graphics adapter 102 includes a graphics processing unit ("GPU") 200 and a global memory ("GMEM") 202; Col 6, lines 13-44, a local memory ("LMEM") that may be included within each streaming multiprocessor; Col 14, lines 44-62, FIG. 9 illustrates a flowchart of method steps for allocating work among the threads of a CTA when performing a non-transposed copy operation or a result tile computation, according to one embodiment of the invention.  For purposes of discussion only, it is assumed that one CTA executing on one of the streaming multiprocessors of the graphics processing unit 200 is either copying the elements of a 32x32 source tile 1000 stored in the GMEM 202, as illustrated in FIG. 10A, to local memory to create a 32x32 local memory tile 1002, as illustrated in FIG. 10B … the tile elements are copied to the corresponding tile element positions in local memory tile 1002 using the GMEM-to-LMEM address mapping shown in FIGS. 10A and 10B) based, at least in part, on the data and one or more filter tiles (Col 14, lines 59-62, for the result tile computation, the positions of the tile elements of result tile 1000 are first determined, and then the tile elements are computed according to the method of FIG. 7; Col; 8, lines 49-67 to Col 9, lines 1-6, FIG. 4A illustrates a flowchart of method steps for allocating work among a plurality of CTAs “cooperative thread arrays” executing within a GPU when performing a matrix multiplication operation, according to one embodiment of the invention …in step 402, the result matrix is divided into tiles (also referred to herein as "result tiles"). Thus, "result tiles" can be interpreted as filter tiles; the graphics processing unit copies the source data based on the "result tiles" for performing a result tile computation). 
Yamamoto and Juffa are analogous art because both pertain to utilize the
image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation processing apparatus taught by Yamamoto incorporate the teachings of Juffa, and applying matrix multiplication operations on a parallel processing device taught by Juffa to have result matrix on a tile-by-tile basis for performing a tiled matrix multiplication operation, as it could be used to reduce the number of times the memory is accessed by the convolution calculations. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify Yamamoto according to the relied-upon teachings of Juffa to obtain the invention as specified in claim.

Regarding claim 22, Yamamoto discloses everything claimed as applied above (see claim 16), and Yamamoto discloses wherein the one or more data offset values include one or more memory offset values (FIG. 2; paragraph [0072], the memory 104 is divided into partial areas assigned to respective processing nodes, and each partial area is used as a ring buffer.  FIG. 5 illustrates a state in which the memory 104 is divided into the partial areas upon execution of the hierarchical calculations shown in FIG. 3. FIG. 5 shows offset addresses), and the one or more circuits are to generate one or more start addresses based (Paragraph [0080], the address calculation parameter storage table 107 held by the network composition management unit 102 holds the following pieces of information for each processing node, as shown in FIGS. 8A, 8B and 8C; paragraph [0085], read counter value: This counter value is that having line-storing areas as units with reference to the start position of the ring buffer assigned to the memory 104), at least in part, on a base address of image data to be used as input (FIG. 1; paragraph [0051], the pattern detection apparatus has a function of detecting a specific object (image pattern) in image data) by the one or more the one or more neural networks (Paragraph [0051], a CNN processing unit 63 is a feature detection processing unit including a hierarchical calculation processing apparatus.  Details of the CNN processing unit 63 will be described later with reference to FIG. 2), and the one or more circuits are to generate the one or more data addresses based, at least in part, on the one or more start addresses and one or more of the one or more memory offset values (Paragraph [0151], the memory access control unit 110 calculates the start addresses of respective line-storing areas of the ring buffer based on the ring counter value and offset address value.  Note that the horizontal width of a calculation target image is set in advance.  Furthermore, the memory access control unit 110 calculates addresses required to read out pixels required for the convolution calculations from each line-storing area using the start address of that line-storing area).  
	However, Yamamoto does not specifically disclose the one or more circuits are to generate one or more start addresses based, at least in part, on a base address of 
	In the similar field of endeavor, Juffa discloses (Abstract, the present invention enables efficient matrix multiplication operations on parallel processing devices.  One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations.  Another embodiment is a second method for mapping CTAs to result tiles.  Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations …) the one or more circuits are to generate one or more start addresses (Col 14, lines 44-62, FIG. 9 illustrates a flowchart of method steps for allocating work among the threads of a CTA when performing a non-transposed copy operation or a result tile computation, according to one embodiment of the invention.  For purposes of discussion only, it is assumed that one CTA executing on one of the streaming multiprocessors of the graphics processing unit 200 is either copying the elements of a 32x32 source tile 1000 stored in the GMEM 202, as illustrated in FIG. 10A, to local memory to create a 32x32 local memory tile 1002, as illustrated in FIG. 10B … the tile elements are copied to the corresponding tile element positions in local memory tile 1002 using the GMEM-to-LMEM address mapping shown in FIGS. 10A and 10B … for the result tile computation, the positions of the tile elements of result tile 1000 are first determined, and then the tile elements are computed according to the method of FIG. 7) based, at least in part, on a base FIG. 4B illustrates a flowchart of method steps for generating a set of 
tile positions within the result matrix; Col 10, lines 5-14, position 504 represents the x-position of CTA 501 within the CTA grid 502 when the upper left corner of the CTA grid 502 is aligned with the upper left corner of the result matrix 500. Thus, position 504 represents a base address corresponding to the input image in order to generate the GMEM-to-LMEM address mapping) and one or more of the one or more column offset values (Col 10, lines 14-22, each of positions 506 and 514 is offset from position 504 in the x-dimension by an integer multiple of the x-dimension step size. Thus, X step is column offset value; Col 10, lines 41-44, in step 428, a software process generates a set of tile positions within the result matrix that the CTA will traverse by combining each of the x-coordinates computed in step 424 with each of the y-coordinates computed in step 426).
 	Yamamoto and Juffa are analogous art because both pertain to utilize the image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation processing apparatus taught by Yamamoto incorporate the teachings of Juffa, and applying matrix multiplication operations on a parallel processing device taught by Juffa to have result matrix on a tile-by-tile basis for performing a tiled matrix multiplication operation, as it could be used to reduce the number of times the memory is accessed by the convolution calculations. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify 

Regarding claim 23, Yamamoto discloses everything claimed as applied above (see claim 16).
However, Yamamoto does not specifically disclose wherein the one or more circuits are to copy image data stored at the one or more data addresses to an image tile, and perform a matrix multiplication operation using the image tile as an operand.  
	In the similar field of endeavor, Juffa discloses (Abstract, the present invention enables efficient matrix multiplication operations on parallel processing devices.  One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations.  Another embodiment is a second method for mapping CTAs to result tiles.  Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations …) wherein the one or more circuits are to copy image data stored at the one or more data addresses to an image tile (Col 5, lines 4-21, FIG. 2 illustrates the graphics adapter 102 of FIG. 1, according to one embodiment of the invention.  As shown, the graphics adapter 102 includes a graphics processing unit ("GPU") 200 and a global memory ("GMEM") 202; Col 6, lines 13-44, a local memory ("LMEM") that may be included within each streaming multiprocessor; Col; 8, lines 49-67 to Col 9, lines 1-6, FIG. 4A illustrates a flowchart of method steps for allocating work among a plurality of CTAs “cooperative thread arrays” executing within a GPU when performing a matrix multiplication operation, according to one embodiment of the invention …in step 402, the result matrix is divided into tiles (also referred to herein as "result tiles"); Col 14, lines 44-62, FIG. 9 illustrates a flowchart of method steps for allocating work among the threads of a CTA when performing a non-transposed copy operation or a result tile computation, according to one embodiment of the invention.  For purposes of discussion only, it is assumed that one CTA executing on one of the streaming multiprocessors of the graphics processing unit 200 is either copying the elements of a 32x32 source tile 1000 stored in the GMEM 202, as illustrated in FIG. 10A, to local memory to create a 32x32 local memory tile 1002, as illustrated in FIG. 10B … the tile elements are copied to the corresponding tile element positions in local memory tile 1002 using the GMEM-to-LMEM address mapping shown in FIGS. 10A and 10B), and perform a matrix multiplication operation using the image tile as an operand (FIGS. 1 and 2; Col 7, lines 26-33, although the graphics adapter 102 may contain additional elements, such as circuitry to generate an analog or digital video signal for display on a video display device, such additional elements were omitted for the sake of clarity.  The following sets forth how work is distributed among the different threads running on the GPU 200 when matrix multiplication operations, copy and transpose operations, or copy operations are performed by the GPU 200; Col 9, lines 7-20, in step 404, a software process determines the size of the CTAs.  As previously described herein, the CTA size is generally determined by the amount of hardware resources within the streaming multiprocessors available to the CTAs as well as by the size of the result tiles. For example, in the embodiment of FIG. 3B where the result tiles consist of 32x32 elements, additional processing efficiencies are achieved when the CTAs include 512 threads.  The software process also defines the dimensions of the CTA grid.  In the exemplary embodiment, where there are sixteen streaming multiprocessors in the GPU 200 and one CTA executing on each streaming multiprocessor, the CTA grid is defined as a four-by-four array of sixteen CTAs; Col 14, lines 59-62, for the result tile computation, the positions of the tile elements of result tile 1000 are first determined, and then the tile elements are computed according to the method of FIG. 7)).  
Yamamoto and Juffa are analogous art because both pertain to utilize the image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation processing apparatus taught by Yamamoto incorporate the teachings of Juffa, and applying matrix multiplication operations on a parallel processing device taught by Juffa to have result matrix on a tile-by-tile basis for performing a tiled matrix multiplication operation, as it could be used to reduce the number of times the memory is accessed by the convolution calculations. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify Yamamoto according to the relied-upon teachings of Juffa to obtain the invention as specified in claim.

Regarding claim 24, Yamamoto discloses everything claimed as applied above (see claim 16), and Yamamoto further disclose wherein the one or more data offset FIG. 2; paragraph [0072], the memory 104 is divided into partial areas assigned to respective processing nodes, and each partial area is used as a ring buffer.  FIG. 5 illustrates a state in which the memory 104 is divided into the partial areas upon execution of the hierarchical calculations shown in FIG. 3. FIG. 5 shows memory offset addresses).
However, Yamamoto does not specifically disclose one or more memory offset values included in an offset sequence based, at least in part, on one or more filter parameters and one or more image parameters.
	In the similar field of endeavor, Juffa discloses (Abstract, the present invention enables efficient matrix multiplication operations on parallel processing devices.  One embodiment is a method for mapping CTAs to result matrix tiles for matrix multiplication operations.  Another embodiment is a second method for mapping CTAs to result tiles.  Yet other embodiments are methods for mapping the individual threads of a CTA to the elements of a tile for result tile computations, source tile copy operations, and source tile copy and transpose operations …) one or more memory offset values included in an offset sequence based (FIG. 4B illustrates a flowchart of method steps for generating a set of tile positions within the result matrix; Col 10, lines 5-22, in step 424, a software process determines a set of x-coordinates that the CTA will traverse within the result matrix based on the CTA's x-position within the CTA grid, the x-dimension step size and the width of the result matrix.  Referring again to FIG. 5A, a partial set of x-coordinates that CTA 501 will traverse within the result matrix 500 is shown as positions 504, 506 and 508 within the result matrix 500.  Position 504 represents the x-position of CTA 501 within the CTA grid 502 when the upper left corner of the CTA grid 502 is aligned with the upper left corner of the result matrix 500.  Each of positions 506 and 514 is offset from position 504 in the x-dimension by an integer multiple of the x-dimension step size. Thus, the memory offset sequence is generated by an integer multiple of the x-dimension step size), at least in part, on one or more filter parameters (Col 9, lines 55-67, in the exemplary embodiment described above in FIG. 4A, the supertile is configured as a four-by-four array of tiles.  In step 422, a software process determines an x-dimension step size and a y-dimension step size for the CTAs based on the supertile size. Thus, four-by-four array of tiles defines a filter parameter for an area of the matrix) and one or more image parameters (Col 14, lines 44-59, FIG. 9 illustrates a flowchart of method steps for allocating work among the threads of a CTA when performing a non-transposed copy operation or a result tile computation … copying the elements of a 32x32 source tile 1000 stored in the GMEM 202, as illustrated in FIG. 10A, to local memory to create a 32x32 local memory tile 1002, as illustrated in FIG. 10B. Thus, 32x32 source tile provides the input image parameter to create a 32x32 local memory tile).
	Yamamoto and Juffa are analogous art because both pertain to utilize the image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation processing apparatus taught by Yamamoto incorporate the teachings of Juffa, and applying matrix multiplication operations on a parallel processing device taught by Juffa .

Claims 4 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Yamamoto et al (U.S. Patent Application Publication 2010/0215253 A1) in view of Juffa et al (U.S. Patent No. 7,912,889 B1) in view of LOEWENSTEIN (U.S. Patent Application Publication 2014/0279894 A1).

	Regarding claim 4, the combination of Yamamoto in view of Juffa discloses everything claimed as applied above (see claim 2, Juffa discloses “Col 14, lines 44-62, the GMEM-to-LMEM address mapping shown in FIGS. 10A and 10B”).
	However, the combination of Yamamoto in view of Juffa does not specifically disclose wherein the image tile includes at least two destination addresses that correspond to a first physical address that is one of the one or morePage 2 of 11 4855-9447-7833v 1 0112912-167US1Application No. 16/365,634data addresses.
	In the similar field of endeavor, LOEWENSTEIN discloses wherein the image tile includes at least two destination addresses that correspond to a first physical address that is one of the one or morePage 2 of 11 4855-9447-7833v 1 0112912-167US1Application No. 16/365,634data addresses (FIG. 1; paragraph [0024], system 100 comprises node 1 102(1), node 2 102(2), and node 3 102(3); paragraph [0026], each node 102 may comprise one or more processors 106, a main memory 108, and a storage 112; paragraph [0027], the main memory 108 of a node 102 comprises a plurality of memory locations.  For purposes of the present invention, a memory location may be of any desired size.  For example, a memory location may be as small as a single data word or as large as a page or larger.  A memory location may be accessed using a physical address.  This physical address may be mapped to one or more virtual addresses by way of an address translation table).
	It is noted that LOEWENSTEIN does not specifically disclose “the image tile” corresponding to the virtual address and the physical address. However, the teachings of LOEWENSTEIN describes the memory mapping between the virtual addresses and the physical addresses by using an address translation table. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the combination of Yamamoto in view of Juffa by adding the address translation table taught by LOEWENSTEIN to generate the memory address mapping and map one physical address to one or more virtual addresses for the convolution calculations. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify Yamamoto according to the relied-upon teachings of LOEWENSTEIN to obtain the invention as specified in claim.

	Regarding claim 11, the combination of Yamamoto in view of Juffa discloses everything claimed as applied above (see claim 10) .

In the similar field of endeavor, Juffa discloses wherein the set of instructions, which if performed by the one or more processors, further cause the one or more processors to copy data stored at the one or more data addresses to a set of destination addresses (Col 5, lines 4-21, FIG. 2 illustrates the graphics adapter 102 of FIG. 1, according to one embodiment of the invention.  As shown, the graphics adapter 102 includes a graphics processing unit ("GPU") 200 and a global memory ("GMEM") 202; Col 6, lines 13-44, a local memory ("LMEM") that may be included within each streaming multiprocessor; Col 14, lines 44-62, FIG. 9 illustrates a flowchart of method steps for allocating work among the threads of a CTA when performing a non-transposed copy operation or a result tile computation, according to one embodiment of the invention.  For purposes of discussion only, it is assumed that one CTA executing on one of the streaming multiprocessors of the graphics processing unit 200 is either copying the elements of a 32x32 source tile 1000 stored in the GMEM 202, as illustrated in FIG. 10A, to local memory to create a 32x32 local memory tile 1002, as illustrated in FIG. 10B … the tile elements are copied to the corresponding tile element positions in local memory tile 1002 using the GMEM-to-LMEM address mapping shown in FIGS. 10A and 10B).

image data processing apparatus. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the calculation processing apparatus taught by Yamamoto incorporate the teachings of Juffa, and applying matrix multiplication operations on a parallel processing device taught by Juffa to have result matrix on a tile-by-tile basis for performing a tiled matrix multiplication operation, as it could be used to reduce the number of times the memory is accessed by the convolution calculations. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify Yamamoto according to the relied-upon teachings of Juffa to obtain the invention as specified in claim.
However, the combination of Yamamoto in view of Juffa does not specifically disclose wherein at least two of the set of destination addresses correspond to a first data address of the one or more data addresses.
	In the similar field of endeavor, LOEWENSTEIN discloses wherein at least two of the set of destination addresses correspond to a first data address of the one or more data addresses (FIG. 1; paragraph [0024], system 100 comprises node 1 102(1), node 2 102(2), and node 3 102(3); paragraph [0026], each node 102 may comprise one or more processors 106, a main memory 108, and a storage 112; paragraph [0027], the main memory 108 of a node 102 comprises a plurality of memory locations.  For purposes of the present invention, a memory location may be of any desired size.  For example, a memory location may be as small as a single data word or as large as a page or larger.  A memory location may be accessed using a physical address.  This physical address may be mapped to one or more virtual addresses by way of an address translation table).
	It is noted that LOEWENSTEIN does not specifically disclose “the image data” corresponding to the virtual address and the physical address. However, the teachings of LOEWENSTEIN describes the memory mapping between the virtual addresses and the physical addresses by using an address translation table. It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the combination of Yamamoto in view of Juffa by adding the address translation table taught by LOEWENSTEIN to generate the memory address mapping and map one physical address to one or more virtual addresses for the convolution calculations. Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify Yamamoto according to the relied-upon teachings of LOEWENSTEIN to obtain the invention as specified in claim.

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Xilin Guo whose telephone number is (571)272-5786. The examiner can normally be reached Monday - Friday 9:00 AM-5:30 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/XILIN GUO/Primary Examiner, Art Unit 2616