DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Applicant’s amendment filed on February 22, 2022 has been considered and entered. 
Accordingly, claims 1-20 are pending in this application. Claims 1-3, 8-10 and 15-17 are currently amended; claims 4-4-6, 11-13, 18, and 20 are previously presented; claims 7, 14, and 19 are original.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-7 and 9-10, and 15-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.
Claim 1 recites “the at least two compute units” in line 16. There is insufficient antecedent basis for this limitation in the claim. It is unclear whether the at least two compute units is to be interpreted as the plurality of compute units or at least two compute units of the plurality of compute units. For purposes of examination the quoted limitation is interpreted as at least two compute units of the plurality of compute units. Claims 2-7 inherit the same deficiency as claim 1 by reason of dependence.
Claim 2 recites “the plurality of lanes” in line 3-4. There is insufficient antecedent basis for this limitation in the claim. Claim 3 recites a similar limitation in line 2 and is rejected for the same reason. Claims 9, 16, and 17 recite a similar limitation in line 4, 4, and 2 respectively and are rejected for the same reason. For purposes of examination, this is interpreted as each of the two or more lanes of the given SIMD unit recited in claim 1. Further, claim 2 recites “the plurality of matrix elements”. It is unclear whether this refers to the plurality of matrix elements of the source matrix or the plurality of matrix elements of the plurality of tiles. For purposes of examination, this is interpreted as the plurality of matrix elements of the plurality of tiles.
Claim 5 recites “each matrix operations unit” in line 3. It is unclear how the matrix operations unit fit in with the system of claim 1. The matrix operations unit was deleted in amended claim 1, therefore, it is unclear whether the system further comprises matrix operation units. Further, it is unclear whether the recited limitation is further limiting the loading of the plurality of matrix elements or an additional step. 
Claim 10 recites “the plurality of compute units” in line 3. There is insufficient antecedent basis for this limitation in the claim. Claim 8 recites a compute unit and not a plurality of compute units.
Claim 15 recites “a compute unit” in lines 15-16. It is unclear whether this is the same or different than the compute unit  recited in line 14. For purposes of examination this is interpreted as the same compute unit. Further, claim 15 recites “the cache” in line 17. There is insufficient antecedent basis for this limitation in the claim. For purposes of examination, this is interpreted as a cache. Claims 16-20 inherit the same deficiency as claim 15 by reason of dependence.
Claim 16 recites “further comprising a cache” in lines 1-2. It is unclear whether this is to be interpreted as a second cache or whether this refers to the cache recited in claim 15. For purposes of examination, this is interpreted as the cache recited in claim 15.

The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 3 and 17 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Claims 3 and 17 recites “wherein the number of the plurality of lanes is equal to M”. This limitation lacks written description support because, with respect to claim 1/17, M is the number of columns of the source matrix in the linear format. Figures 4-5 and paragraphs [0036-0038] describes data layout for a source matrix A and B for matrix multiplication. However, as recited in claim 1, the SIMD unit operates on tiles, and that the source matrix is divided into multiple tiles where each tile includes fewer elements than the source matrix. Therefore, the data layout shown in Figs. 4-5 are individual tiles and not the source matrix. Therefore, the disclosure fails to describe that the plurality of lanes of a SIMD unit is equal to the number of columns of a source matrix in the linear format. For these reasons, claims 3 and 17 contain subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor had possession of the claimed invention.
The Examiner notes that [0022] of the instant Specification gives support for the M number of columns of a tile of a first source matrix being equal to a number of lanes of a matrix operation unit. The Examiner recommends the Applicant amend the claim to align with this disclosure of the Specification. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 5, 8-10, 12, 15-16, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over the AMD Graphics Core Next Architecture, Generation 3, hereinafter the R1200 ISA Book, in view of Haugen (US-PGPUB 20140201450 A1).
Regarding claim 1, the R1200 ISA Book teaches a system comprising:
a memory […] (Fig 1.1, memory – System memory and device memory);
a cache comprising a plurality of channels (Figs. 1.1 and 2.1, cache – L2 R/W cache, L1 R/W cache; page 2-4 section 2.4 lines 1-4 “the device consists of multiple channels of L2 read-only cache that provides data to an L1 cache for each compute unit);
a processor […] (Fig. 1.1, processor - command processors, host CPU);
and a plurality of compute units, wherein each compute unit of the plurality of compute units comprises one or more single instruction, multiple data (SIMD) units (Fig. 1.1 and 2.1, plurality of compute units – compute units; one or more single instruction, multiple data (SIMD) units – SIMD0-SIMD3 per compute unit);
wherein during execution of a same kernel, each of two or more lanes of a given SIMD unit of the at least two compute units is configured to load from the cache, via at least two channels of the cache in parallel (Fig 2.1 shows each SIMD0 unit includes multiple lanes for example, VALU 0-15; page 1-2 last three paragraphs “The array is organized as a set of compute unit pipelines, each independent from the others, that operate in parallel on streams of floating-point or integer data. The compute unit pipelines can process data or, through the memory controller, transfer data to, or from, memory”; “Host commands request a compute unit pipeline to execute a kernel by passing it: an identifier pair (x, y), a conditional value, and the location in memory of the kernel code. When it receives a request, the compute unit pipeline loads instructions and data from memory, begins execution, and continues until the end of the kernel. As kernels are running, the GCN hardware automatically fetches instructions and data from memory into on-chip caches” page 10-2 first two paragraphs):
The R1200 ISA Book does not explicitly teach the memory storing a source matrix, comprising plurality of matrix elements, in a linear format, wherein the source matrix has M columns and N rows, wherein M and N are positive integers; the processor configured to convert the source matrix from the linear format to a tiling format, wherein in tiling format, the matrix elements of the source matrix are stored as a plurality of tiles, each of the plurality of tiles having fewer elements than the source matrix; and wherein during execution of a same kernel, each of two or more lanes of a given SIMD unit of the at least two compute units is configured to: load from the cache, via at least two channels of the cache in parallel, a given plurality of the matrix elements of the plurality of tiles; and perform a matrix operation on the given plurality of matrix elements to generate a result in the tiling format.
However, on the same field of endeavor, Haugen discloses several matrix storage format in memory including row-major format, column major format, and tile format. Further, Haugen discloses that the row-major and the column major storage format yield a high number of unnecessary data transfers between hardware registers and cache memory when applied to various calculations, and that the data transfer issue may be resolved by dividing the matrix into smaller submatrices called tiles. Further, Haugen discloses an example where a 4x4 source matrix is divided into four 2x2 submatrices in the tiled format (Haugen Fig. 3 and paragraph [0052]). Further, Haugen discloses storing the tiles in the cache (Haugen paragraph [0047]). Further, Haugen discloses performing matrix operation such as matrix-matrix multiplication, matrix vector multiplication on the matrices organized in the tile format (Haugen paragraph [0069]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify the R1200 ISA Book using Haugen, and configure the memory to store the matrix data in a linear format such as row-major and the column major storage format as, then configure the processor to include additional functionalities to convert the matrix data from the linear storage format to a tile storage format comprising of multiple submatrices such as the 2x2 tiles disclosed by Haugen when transferring the source matrix from the memory to the cache for matrix operations by the compute units. Further, perform the matrix operations using the tiles such as using a different compute unit to process the tiles in parallel to speed up the matrix operations as disclosed by Haugen in paragraph [0077], and generate a result in the tile format. 
The motivation to do so is because the data transfer issue may be resolved by dividing the matrix into smaller submatrices called tiles, and the data transfer between registers and cache may be minimized, and as a result, a larger fraction of the cycles may be spent on computations (Haugen paragraph [0052]). 
Therefore, the combination of the R1200 ISA Book as modified in view of Haugen teaches storing a source matrix, comprising plurality of matrix elements, in a linear format, wherein the source matrix has M columns and N rows, wherein M and N are positive integers; the processor configured to convert the source matrix from the linear format to a tiling format, wherein in tiling format, the matrix elements of the source matrix are stored as a plurality of tiles, each of the plurality of tiles having fewer elements than the source matrix; and wherein during execution of a same kernel, each of two or more lanes of a given SIMD unit of the at least two compute units is configured to: load from the cache, via at least two channels of the cache in parallel, a given plurality of the matrix elements of the plurality of tiles; and perform a matrix operation on the given plurality of matrix elements to generate a result in the tiling format.

Regarding claim 2, the R1200 ISA Book as modified in view of Haugen teaches all the limitations of claim 1 as stated above. Further, the R1200 ISA Book as modified in view of Haugen teaches wherein the plurality of matrix elements are loaded by the plurality of lanes in a single clock cycle (Fig. 2.1, page 2-4 section 2.4 lines 2-3, and section 10-1 page 10-2 first paragraph).

Regarding claim 5, the R1200 ISA Book as modified in view of Haugen teaches all the limitations of claim 1 as stated above. Further, the R1200 ISA Book as modified in view of Haugen teaches wherein the matrix operation is a matrix multiplication operation (Haugen paragraph [0069]), and wherein the given plurality of matrix elements are conveyed on a plurality of lanes of each matrix operations unit (R1200 ISA Book Fig 2.1 shows each of the matrix operations unit (vGPR and vALU) reads data from the L1 cache through a respective channel different from a channel used by other matrix operations unit). The motivation to combine is the same as claim 1.

Regarding claims 8-9 and 12, they are directed to a method practiced by the apparatus of claims 1-2 and 5 respectively. All steps performed by the method of claims 8-9 and 12 would be practiced by the apparatus of claims 1-2 and 5.  Claims 1-2 and 5 analysis applies equally to claims 8-9 and 12.

Regarding claim 10, the R1200 ISA Book as modified in view of Haugen teaches all the limitations of claim 1 as stated above. Further, the R1200 ISA Book as modified in view of Haugen teaches further comprising: executing, by each compute unit in parallel, a kernel which is equivalent to kernels executed by others of the plurality of compute units (page 1-1 to 1-2 text and Fig. 1.1; “The array is organized as a set of compute unit pipelines, each independent from the others, that operate in parallel on streams of floating-point or integer data”; “the compute unit pipeline loads instructions and data from memory, begins execution, and continues until the end of the kernel”).

Regarding claim 15, the R1200 ISA Book teaches an apparatus comprising:
a memory […] (Fig 1.1, memory – System memory and device memory);
and a plurality of compute units, each comprising one or more single instruction, multiple data (SIMD) units, configured to (Fig. 1.1 and 2.1 , compute units – compute units; one or more single instruction, multiple data (SIMD) units – SIMD0-SIMD3 per compute unit):
generate a request of the data (Fig. 1.1 shows the memory hierarchy of the system. Data to be processed by the compute units must be loaded from the memory to lower level cache, therefore, the compute units request the data to be operated);
while a compute unit of the plurality of compute units executes, in parallel, a same kernel, each of at least two lanes of a given SIMD unit of a compute unit of the compute units is configured to load from the cache, via at least two channels of the cache in parallel […] (Fig 2.1 shows each SIMD0 unit includes multiple lanes for example, VALU 0-15; page 1-2 last three paragraphs “The array is organized as a set of compute unit pipelines, each independent from the others, that operate in parallel on streams of floating-point or integer data. The compute unit pipelines can process data or, through the memory controller, transfer data to, or from, memory”; “Host commands request a compute unit pipeline to execute a kernel by passing it: an identifier pair (x, y), a conditional value, and the location in memory of the kernel code. When it receives a request, the compute unit pipeline loads instructions and data from memory, begins execution, and continues until the end of the kernel. As kernels are running, the GCN hardware automatically fetches instructions and data from memory into on-chip caches” page 10-2 first two paragraphs):
The R1200 ISA Book does not explicitly teach the memory storing a source matrix in a linear format, wherein the source matrix comprises a plurality of matrix elements, wherein the source matrix has M columns and N rows, and wherein M and N are positive integers; the plurality of compute units configured are to generate a request to convert the plurality of matrix elements from the linear format to a tiling format, wherein, in tiling format, the matrix elements of the source matrix are stored as a plurality of tiles, wherein each of the plurality of tiles has fewer elements than the source matrix; and load from the cache, via at least two channels of the cache in parallel, a given plurality of the matrix elements of the plurality of tiles; and perform a matrix operation on the plurality of matrix elements in the tiling format to generate a plurality of results in the tiling format.
However, on the same field of endeavor, Haugen discloses several matrix storage format in memory including row-major format, column major format, and tile format. Further, Haugen discloses that the row-major and the column major storage format yield a high number of unnecessary data transfers between hardware registers and cache memory when applied to various calculations, and that the data transfer issue may be resolved by dividing the matrix into smaller submatrices called tiles. Further, Haugen discloses an example where a 4x4 source matrix is divided into four 2x2 submatrices in the tiled format (Haugen Fig. 3 and paragraph [0052]). Further, Haugen discloses performing matrix operation such as matrix-matrix multiplication, matrix vector multiplication on the matrices organized in the tile format (Haugen paragraph [0069]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify the R1200 ISA Book using Haugen, and configure the memory to store the matrix data in a linear format such as a row-major or a column major storage format, then configure the processor to include additional functionalities to convert the matrix data from the linear storage format to a tiled storage format comprising of multiple submatrices where each submatrices includes more than one element for each row and column such as the 2x2 tiles disclosed by Haugen when transferring the source matrix from the memory to the cache for matrix operations by the compute units. Further, perform the matrix operation using the matrices in the tile format, and generate a result in the tiled format. 
The motivation to do so is because the data transfer issue may be resolved by dividing the matrix into smaller submatrices called tiles, and the data transfer between registers and cache may be minimized, and as a result, a larger fraction of the cycles may be spent on computations (Haugen paragraph [0052]).
Therefore, the combination of the R1200 ISA Book as modified in view of Haugen teaches the memory storing a source matrix in a linear format, wherein the source matrix comprises a plurality of matrix elements, wherein the source matrix has M columns and N rows, and wherein M and N are positive integers; the plurality of compute units configured are to generate a request to convert the plurality of matrix elements from the linear format to a tiling format, wherein, in tiling format, the matrix elements of the source matrix are stored as a plurality of tiles, wherein each of the plurality of tiles has fewer elements than the source matrix; and load from the cache, via at least two channels of the cache in parallel, a given plurality of the matrix elements of the plurality of tiles; and perform a matrix operation on the plurality of matrix elements in the tiling format to generate a plurality of results in the tiling format..

Regarding claim 16, the R1200 ISA Book as modified in view of Haugen teaches all the limitations of claim 15 above. Further, the R1200 ISA Book as modified in view of Haugen teaches the apparatus further comprising a cache, wherein the plurality matrix elements are loaded by the plurality of lanes in a single clock cycle (Fig. 1.1, 2.1 cache – L2 R/W cache, L1 R/W cache; load data in parallel in a single clock cycle – page 2-4 section 2.4 lines 2-3, and section 10-1 page 10-2 first paragraph).

Regarding claim 19, the R1200 ISA Book as modified in view of Haugen teaches all the limitations of claim 15 above. Further, the R1200 ISA Book as modified in view of Haugen teaches wherein matrix elements are conveyed on a plurality of lanes of each compute unit of the plurality of compute units (R1200 ISA Book Fig 2.1 shows each of the (vGPR and vALU) reads data from the L1 cache through a respective through multiple lanes of each compute unit), and wherein the matrix operation is a matrix multiplication operation (Haugen paragraph [0069]). The motivation to combine is the same as claim 15.

Claims 6, 13, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over the R1200 ISA Book, in view of Haugen as applied to claims 1, 8 and 15 respectively, and further in view of Guo et al. (NPL – “A Survey of FPGA-Based Neural Network Inference Accelerator”), hereinafter Guo.
Regarding claims 6 and 20, the R1200 ISA Book as modified in view of Haugen teaches all the limitations of claims 1 and 15 respectively as stated above. 
The combination of the R1200 ISA Book  and Haugen thus far does not explicitly teach wherein a classification of the first dataset is generated during execution of a machine learning engine application.
However, on the same field of endeavor, Guo teaches several neural network accelerator hardware designs implemented using a field-programmable gate array (FPGA) for improving speed up and energy efficiency. Further, these implementations were used for classification of a first dataset using the result of matrix multiplication during execution of machine learning application (Guo, Introduction and Pages 7-16).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify the R1200 ISA Book in view of Haugen using Guo and configure the system disclosed by the R1200 ISA Book to accelerate a neural network computation and use the result generated by the matrix operations units for classification of a first dataset such as an image processing application. One of ordinary skill in the art would have been capable of applying this known technique to a known device that was ready for improvement and the results would have been predictable to one of ordinary skill in the art. See MPEP 2141.III.D.
Therefore, the combination of the R1200 ISA Book as modified in view of Haugen and Guo teaches wherein the classification of the first dataset is generated during execution of a machine learning engine application.

Regarding claim 13, it is directed to a method practiced by the apparatus of claim 6. All steps performed by the method of claim 13 would be practiced by the apparatus of claim 6.  Claim 6 analysis applies equally to claim 13.

Claims 4, 7, 11, 14 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over the R1200 ISA Book, in view of Haugen as applied to claims 1, 8 and 15, and further in view of Carlson et al. (US-PGPUB 20100241824 A1), hereinafter Carlson.
Regarding claim 4, the R1200 ISA Book as modified in view of Haugen teaches all the limitations of claim 1 as stated above. 
The R1200 ISA Book does not explicitly teach wherein responsive to receiving a request to convert a plurality of matrix elements of a first source matrix from the linear format to a tiling format, the processor is configured to: read values from sequential locations of a first buffer in the memory, wherein the first buffer stores matrix elements in the linear format; and step through a second buffer with a stride equal to a tile height while writing the values to the second buffer, wherein the second buffer stores matrix elements in the tiling format.
However, on the same field of endeavor, Carlson teaches a system and a method for converting a matrix stored in a row-major or column major storage format into a format suitable for single instruction, multiple data (SIMD) architectures, i.e. SIMD format. The method includes reading the matrix elements from sequential locations of a first buffer in the memory in the linear format and stepping though a second buffer while writing the values to the second buffer, wherein the second buffer stores matrix elements in the SIMD format  (Carlson Figs. 4-6 and paragraphs [0038-0039]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify the R1200 ISA Book in view of Haugen using Carlson and configure the processor to perform the linear storage format to the tiled storage format by sequentially reading the matrix elements stored in memory in the linear format and writing the matrix elements to a second buffer such as a different block in memory in the tiling format. Further, step through the second buffer according to the determined tile height. As shown in Fig 3 of Haugen, the matrix elements C and D of the row-major matrix 304 which corresponds to indices 2 and 3 in the row-major format are written in the tile format on indices 4 and 5 which corresponds to an offset value equal to the tile height of the 2x2 submatrices. Both Haugen and Carlson discloses matrix storage format that can be used by hardware that supports SIMD instructions. Therefore, one of ordinary skill in the art would have been capable of applying this known technique to a known device that was ready for improvement and the results would have been predictable to one of ordinary skill in the art. See MPEP 2141.III.D.
Therefore, the combination of the R1200 ISA Book as modified in view of Haugen and Carlson teaches wherein responsive to receiving a request to convert a plurality of matrix elements of a first source matrix from the linear format to a tiling format, the processor is configured to: read values from sequential locations of a first buffer in the memory, wherein the first buffer stores matrix elements in the linear format; and step through a second buffer with a stride equal to a tile height while writing the values to the second buffer, wherein the second buffer stores matrix elements in the tiling format.

Regarding claim 7, the R1200 ISA Book as modified in view of Haugen and Carlson teaches all the limitations of claim 4 as stated above. Further, the R1200 ISA Book as modified in view of Haugen and Carlson teaches wherein elements of a first source matrix that are stored in consecutive memory locations in the linear format are stored in memory locations which are separated by a tile height in the tiling format (Haugen Figs. 3-4 and paragraph [0053]). The reason to combine is the same as claim 4.

Regarding claims 11 and 14, they are directed to a method practiced by the apparatus of claims 4 and 7 respectively. All steps performed by the method of claims 11 and 14 would be practiced by the apparatus of claims 4 and 7.  Claims 4 and 7 analysis applies equally to claims 11 and 14.

Regarding claim 18, the R1200 ISA Book as modified in view of Haugen teaches all the limitations of claim 15 above. Further, the R1200 ISA Book teaches the apparatus further comprising a command processor (Fig. 1.1 command processor – command processors).
The R1200 ISA Book does not explicitly teach wherein responsive to receiving a request to convert a plurality of matrix elements of a first source matrix from the linear format to a tiling format, the command processor is configured to: read values from sequential locations of a first buffer in the memory, wherein the first buffer stores matrix elements in the linear format; and step through a second buffer with a stride equal to a tile height while writing the values to the second buffer, wherein the second buffer stores matrix elements in the tiling format.
However, on the same field of endeavor, Carlson teaches a system and a method for converting a matrix stored in a row-major or column major storage format into a format suitable for single instruction, multiple data (SIMD) architectures, i.e. SIMD format. The method includes reading the matrix elements from sequential locations of a first buffer in the memory in the linear format and stepping though a second buffer while writing the values to the second buffer, wherein the second buffer stores matrix elements in the SIMD format  (Carlson Figs. 4-6 and paragraphs [0038-0039]).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention, to modify the R1200 ISA Book in view of Haugen using Carlson and configure the command processor to perform the linear storage format to the tiled storage format conversion by sequentially reading the matrix elements stored in memory in the linear format and writing the matrix elements to a second buffer such as a different block in memory in the tiled format. Further, step through the second buffer according to the determined tile height. As shown in Fig 3 of Haugen, the matrix elements C and D of the row-major matrix 304 which corresponds to indices 2 and 3 in the row-major format are written in the tile format on indices 4 and 5 which corresponds to an offset value equal to the tile height of the 2x2 submatrices. Both Haugen and Carlson discloses matrix storage format that can be used by hardware that supports SIMD instructions. Therefore, one of ordinary skill in the art would have been capable of applying this known technique to a known device that was ready for improvement and the results would have been predictable to one of ordinary skill in the art. See MPEP 2141.III.D.
Therefore, the combination of the R1200 ISA Book as modified in view of Haugen and Carlson teaches wherein responsive to receiving a request to convert a plurality of matrix elements of a first source matrix from the linear format to a tiling format, the command processor is configured to: read values from sequential locations of a first buffer in the memory, wherein the first buffer stores matrix elements in the linear format; and step through a second buffer with a stride equal to a tile height while writing the values to the second buffer, wherein the second buffer stores matrix elements in the tiling format.
Response to Arguments
Applicant’s arguments, see remarks pages 1-4, filed 02/22/2022, with respect to the rejections of claims 1-20 under 35 U.S.C. 103 have been fully considered and are not persuasive.  
In response to applicant’s arguments with respect to the 35 U.S.C. rejection of claim 1, applicant argued that the R12000 ISA Book and Haugen does disclose or suggest individually or in combination the features of wherein each of the plurality of lanes of a given SIMD unit load in parallel from a cache as recited in amended claim 1. 
Examiner respectfully disagrees. As shown in Fig. 2.1 of the R1200 ISA Book each SIMD unit includes multiple lanes and each SIMD unit within each compute unit reads data from the cache. Furthermore, SIMD by definition are parallel execution units that performs or executes the same instruction. See page 1-2 of the R1200 ISA book which discloses that “The array is organized as a set of compute unit pipelines, each independent from the others, that operate in parallel on streams of floating-point or integer data”. Therefore, the above features fairly suggests the claimed “wherein during execution of a same kernel, each of two or more lanes of a given SIMD unit of the at least two compute units is configured to: load from the cache, via at least two channels of the cache in parallel” as recited in claim 1. Furthermore, the claim does not recite each of the plurality of lanes of a given SIMD unit load in parallel from a cache as being argued by applicant. The claim only requires at least two or more lanes to load in parallel from the cache. See also Haugen paragraphs [0060, 0063, 0068] which discusses SIMD architectures and discloses that execution units within a SIMD executes the same instruction in parallel.
In response to applicant’s arguments with respect to the 35 U.S.C. rejection of claim 7, applicant argued Haugen does not disclose the claimed feature of “wherein elements of a first source matrix that are stored in consecutive memory locations in the linear format are stored in memory locations which are separated by a tile height in the tiling format” and pointed to Figs. 3-4 and paragraphs [0052] of Haugen.
Examiner respectfully disagrees. For example comparing 304 and 308 as shown in Fig. 3, the element C in 304 is at the third location while on the tiled format the element C is in the fifth location separated by a tile height of 2 and the individual tile 310 is of a size 2x2. Same for element D which is in fourth location in 304 and sixth location in 308 separated by 2.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Carlo Waje whose telephone number is (571)272-5767.  The examiner can normally be reached on 9:00-6:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached on (571) 270-3995.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/C.W./
Carlo WajeExaminer, Art Unit 2182                                                                                                                                                                                                        5712725767


/MICHELLE T BECHTOLD/Primary Examiner, Art Unit 2183