Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on September 26, 2022, in which claims 1, 3-10, and 19 are amended.  Claims 1-20 are currently pending.

Response to Arguments
Applicant’s arguments with respect to rejection of claims 1-20 under 35 U.S.C. 101 based on amendment have been considered and are persuasive.  The rejections to claims 1-20 under 35 U.S.C. § 101 are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the rejections.

Applicant’s arguments with respect to rejection of claims 1-20 under 35 U.S.C. 102 based on amendment have been considered and are persuasive.  The argument is moot in view of a new ground of rejection set forth below.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.


	Claims 1-3, 5, 7, 9-12, 14, 16, and 18-20 are rejected under U.S.C. §103 as being unpatentable over the combination of Woolley (US 2016/0162402 A1) and Santara (“Faster learning of deep stacked autoencoders on multi-core systems using synchronized layer-wise pre-training”, 2016). 

	 Regarding claim 1, Woolley teaches A computing device, comprising: a memory configured to store instructions; and a processor coupled to the memory and configured to execute the instructions([¶0112] "the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM)...a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.")
	to cause the computing device to: operate a neural network to perform a plurality of operations, wherein the neural network comprises a plurality of neural network layers,([¶0007] "In this regard, a CNN typically includes multiple “convolution layers,” where each convolution layer performs convolution operations across multiple dimensions of a sample data batch and multiple dimensions of a filter stack")
	perform a first operation on a first matrix M times to obtain a second matrix, wherein the first operation is performed by the Kth neural network layer;(See FIG. 2 230 for cluster of general purpose computers. [¶0062] "In the context of FIG. 4, the streaming multiprocessor (SM) 310 is configured to perform a multi-convolution operation between the image batch 410 and the filter stack 440 to produce the output batch 470" See FIG. 3 streaming multiprocessor is on GPC. [¶0107] "The convolution engine divides the virtual image matrix into separate image tiles and then assigns the processing of each image tile to a different thread group." Image tile interpreted as synonymous with matrix. Output batch of first layer interpreted as synonymous with second matrix.)
	perform a second operation on the second matrix, wherein the second operation is performed by the (K+1)th neural network layer, and wherein K is a positive integer greater than or equal to 1;([¶0007] "a CNN typically includes multiple “convolution layers,” where each convolution layer performs convolution operations across multiple dimensions of a sample data batch and multiple dimensions of a filter stack." See FIG. 2 230 for cluster of general purpose computers. [¶0062] "In the context of FIG. 4, the streaming multiprocessor (SM) 310 is configured to perform a multi-convolution operation between the image batch 410 and the filter stack 440 to produce the output batch 470" See FIG. 3 streaming multiprocessor is on GPC. [¶0107] "The convolution engine divides the virtual image matrix into separate image tiles and then assigns the processing of each image tile to a different thread group." Image tile interpreted as synonymous with matrix.  Woolley teaches that the convolution operation of FIG. 4 is performed for every convolution layer and that a CNN typically includes a second convolution layer.)
	perform an ith first operation of the M first operations on the first matrix to obtain an ith data element of the second matrix, ([¶0015] "The method includes selecting a first start address based on a first destination address included in a first image tile that is stored in a first memory; identifying a first offset based on the first destination address...and after copying the data, performing one or more matrix multiplication operations between the first image tile and a first filter tile." [¶0040] " In particular, CPU 102 issues commands that control the operation of PPU 202" See also FIG. 1, FIG. 2.  CPU 102 interpreted as synonymous with control device)
	wherein the ith first operation comprises only a portion of a total number of the M first operations, and wherein 1<i<M; store the ith data element of the second matrix;([¶0061] "to exploit the optimized matrix multiplication routine without straining the PP memory 204, for each “image tile,” the convolution subsystem 180 generates the image tile as-needed, processes the image tile, and then discards the image tile...Advantageously, only a portion of the image matrix is stored in the shared memory 382 at any given time")
	and wherein either the first operation is a convolution operation and the second operation is a convolution operation or a pooling operation, or the first operation is a pooling operation and the second operation is a convolution operation.([¶0007] "a CNN typically includes multiple “convolution layers,” where each convolution layer performs convolution operations across multiple dimensions of a sample data batch and multiple dimensions of a filter stack." [¶0017] "One advantage of the disclosed techniques is that applications may perform multi-convolution operations via an optimized matrix multiplication routinewhile optimizing parallel processing memory usage" [¶0062] "In the context of FIG. 4, the streaming multiprocessor (SM) 310 is configured to perform a multi-convolution operation between the image batch 410 and the filter stack 440 to produce the output batch 470").
	While it would be obvious to one of ordinary skill in the art that two consecutive neural network layers may be structurally adjacent, Woolley does not explicitly teach  wherein the plurality of neural network layers comprises a Kth neural network layer and a (K+1)th neural network layer, wherein the Kth neural network layer and the (K+1)th neural network layer are adjacent layers in the neural network, 
	wherein the Kth neural network layer comprises first neurons, wherein the (K+1)th neural network layer comprises second neurons, 
	wherein the first neurons are communicatively coupled to the second neurons, wherein the Kth neural network layer is configured to perform M first operations to produce data used by the (K+1)th neural network layer, and wherein M is a positive integer not less than 1;
	and before the M first operations of the first matrix are completed, trigger controlling, in response to the ith data element being stored, the (K+1)th neural network layer to perform the second operation one time when only a part of intermediate data that is between the Kth neural network layer and the (K+1)th neural network layer is stored and before all of the M first operations are performed, wherein the ith data element is sufficient for performing the second operation one time, .

	Santara, in the same field of endeavor, teaches  wherein the plurality of neural network layers comprises a Kth neural network layer and a (K+1)th neural network layer, wherein the Kth neural network layer and the (K+1)th neural network layer are adjacent layers in the neural network, ([p. 6 §3.1] "Due to data dependency Tl waits until Tl−1 has completed one epoch of training. Once Tl starts executing, every time it completes one epoch, it transforms DTl and DVl using the current weights and biases Wl[n] and bl[n] and modifies DTl+1 and DVl+1 as: [Eqn. 6]" Layer Tl-1 interpreted as synonymous with the Kth layer and Layer Tl interpreted as synonymous with the (K+1)th layer.)
	wherein the Kth neural network layer comprises first neurons, wherein the (K+1)th neural network layer comprises second neurons, ([p. 3 §2.1] "A stacked autoencoder [2, 3] has multiple hidden layers of neurons between the input and output layers. A stacked autoencoder of depth n has an input layer, 2n − 3 hidden layers, and an output layer of neurons.")
	wherein the first neurons are communicatively coupled to the second neurons, wherein the Kth neural network layer is configured to perform M first operations to produce data used by the (K+1)th neural network layer, and wherein M is a positive integer not less than 1;([p. 6 §3.1] "Due to data dependency Tl waits until Tl−1 has completed one epoch of training. Once Tl starts executing, every time it completes one epoch, it transforms DTl and DVl using the current weights and biases Wl[n] and bl[n] and modifies DTl+1 and DVl+1 as: [Eqn. 6]" Transforming DTl and DVl interpreted as first operation.  FIG. 2 shows at least M=4 epochs for the first layer each of which result in a transformation and modification operation.)
	and before the M first operations of the first matrix are completed, trigger controlling, in response to the ith data element being stored, the (K+1)th neural network layer to perform the second operation one time when only a part of intermediate data that is between the Kth neural network layer and the (K+1)th neural network layer is stored and before all of the M first operations are performed, wherein the ith data element is sufficient for performing the second operation one time, ([p. 2 §1] "The second category of algorithms carry out calculations relative to the entire network but specific to subsets of data" [p. 6 §3.1] "Due to data dependency Tl waits until Tl−1 has completed one epoch of training. Once Tl starts executing, every time it completes one epoch, it transforms DTl and DVl using the current weights and biases Wl[n] and bl[n] and modifies DTl+1 and DVl+1 as: [Eqn. 6]" See also FIG. 2.  Santara teaches that the algorithm is carried out on a subset of the data (first epoch of the first layer interpreted as synonymous with a part of the intermediate data).  End of epoch transformation and modification for layer L2 interpreted as synonymous with a second operation performed before all of the M first operations are performed.  DT_l interpreted as synonymous with ith data element.).

	Woolley as well as Santara are directed towards parallelizing neural network operations.  Therefore, Woolley as well as Santara are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to combine the teachings of Woolley with the teachings of Santara by combining the data parallel technique in Santara with the model parallelization in Santara.  Santara provides as additional motivation for combination ([Abstract] "In this paper a synchronized parallel algorithm for pre-training deep networks on multi-core machines has been proposed. Different layers are trained by parallel threads running on different cores with regular synchronization. Thus the pre-training process becomes faster and chances of over-training are reduced. This is experimentally validated using a stacked autoencoder for dimensionality reduction of MNIST handwritten digit database. The proposed algorithm achieved 26% speed-up compared to greedy layer-wise pre-training for achieving the same reconstruction accuracy substantiating its potential as an alternative.").  This motivation for combination also applies to the remaining claims which depend on this combination.

	 Regarding claim 2, the combination of Woolley, and Santara teaches The computing device of to claim 1, wherein the ith data element is stored in a first storage unit, wherein the first storage unit comprises a first line buffer, wherein the first line buffer comprises N registers, wherein the N registers in the first line buffer sequentially store elements of a third matrix in row-major order or column-major order,([¶0009] "the convolution engine converts the image batch into a column-major image matrix and expresses the filter stack as a filter matrix" [¶0045] "Memory interface 214 includes a set of D of partition units 215, where D≧1. Each partition unit 215 is coupled to one or more dynamic random access memories (DRAMs) 220 residing within PPM memory 204. In one embodiment, the number of partition units 215 equals the number of DRAMs 220... In operation, various render targets, such as texture maps and frame buffers, may be stored across DRAMs 220, allowing partition units 215 to write portions of each render target in parallel to efficiently use the available bandwidth of PP memory" [¶0098] "In general, components included in the computer system 100 may store any of the image batch 410, the filter stack 440, the offset sequence 640, and/or the output matrix 860 in any type of memory structure included in the PP memory... any number, including zero, of the image batch 410, the filter stack 440, the offset sequence 640, and/or the output matrix 860 may be included in a frame buffer.".)
	 the third matrix is a matrix that is obtained after zero adding is performed on the second matrix while performing the second operation on the second matrix, wherein N=(h−1)×(W+p)+w, wherein h represents a quantity of rows of a kernel corresponding to the second operation, w represents a quantity of columns of the kernel corresponding to the second operation, W represents a quantity of columns of the second matrix, p represents a quantity of rows or a quantity of columns of elements 0 that are to be added to the second matrix to perform the second operation on the second matrix, and wherein h, w, p, W, and N are all positive integers not less than 1. (Woolley [¶0066] " For example, in some embodiments, the parameters 465 may include a padding height and a padding width. The padding height and the padding width append, respectively, rows of zeros and columns of zeros to output images" zero padding interpreted as synonymous with zero adding. [¶0010] "suppose that the image width were W, the image height were H, the number of color planes per image were C, and the number of images in the image batch were N. Further, suppose that the dimensions of each of the output images were (P×Q). In such a scenario, the dimensions of the image matrix would be (N×P×Q)×(C×R×S)." With respect to Woolley NxPxQ=(h-1) and (CxRxS)=W. Woolley further explicitly teaches that a padding height and padding width are appended and that the operations are applied to a buffer in column-major form).
	
	 Regarding claim 3, the combination of Woolley, and Santara teaches The computing device of claim 2, further comprising a crossbar, wherein X target registers of the N registers are directly connected to X rows of the crossbar respectively, wherein the X target registers are a [1+k×(W+p)]th register to a [w+k×(W+p)]th register of the N registers, wherein a value of k is a positive integer ranging from 0 to h−1, wherein X=h×w, and  instructions further cause the computing device to:(Woolley See FIG. 2 210.  [¶0046] "A given GPCs 208 may process data to be written to any of the DRAMs 220 within PP memory 204. Crossbar unit 210 is configured to route the output of each GPC 208 to the input of any partition unit 215 or to any other GPC 208 for further processing." FIG. 3 384 L1 cache interpreted as synonymous with register. The crossbar routing to any other GPC (which is shown to contain registers) is interpreted as synonymous with the crossbar directly connected to the X registers.)
	store the ith data element of the second matrix into the first line buffer; and(Woolley [¶0015] "computing a first source address included in an image batch that is stored in a second memory based on the first start address and the first offset; copying data from the first source address to the first destination address" See also FIG. 2 202, 204.)
	control the crossbar to operate and perform the second operation on data elements stored in the X target registers in response to the data elements currently stored in the X target registers being sufficient for performing the second operation.(Woolley [¶0050] "Operation of GPC 208 is controlled via a pipeline manager 305 that distributes processing tasks received from a work distribution unit (not shown) within task/work unit 207 to one or more streaming multiprocessors (SMs) 310. Pipeline manager 305 may also be configured to control a work distribution crossbar 330 by specifying destinations for processed data output by SMs 310." FIG. 3 384 L1 cache interpreted as synonymous with register.).
	
	 Regarding claim 5, the combination of Woolley, and Santara teaches The computing device of claim 2, wherein the instructions further cause the computing device to: perform, in an nth clock cycle, the ith first operation on the first matrix to obtain the ith data element of the second matrix, (Woolley [¶0015] "The method includes selecting a first start address based on a first destination address included in a first image tile that is stored in a first memory; identifying a first offset based on the first destination address...and after copying the data, performing one or more matrix multiplication operations between the first image tile and a first filter tile." [¶0040] " In particular, CPU 102 issues commands that control the operation of PPU 202" See also FIG. 1, FIG. 2.  CPU 102 interpreted as synonymous with control device)
	wherein the ith data element of the second matrix is located in a last row of the second matrix, and wherein an (i+1)th data element of the second matrix is located at a starting location of a column next to a column in which the ith data element is located; and(Woolley [¶0009] "the convolution engine converts the image batch into a column-major image matrix and expresses the filter stack as a filter matrix." Starting location of a next column interpreted as synonymous with first row of a matrix.  Limitation interpreted as synonymous with reading matrix in column-major order.)
	perform, in an (n+t)th clock cycle, an (i+1)th first operation of the M first operations on the first matrix, wherein t is a positive integer greater than 1; and(Woolley [¶0007] "a CNN typically includes multiple “convolution layers,” where each convolution layer performs convolution operations across multiple dimensions of a sample data batch and multiple dimensions of a filter stack" [¶0009] "the convolution engine converts the image batch into a column-major image matrix and expresses the filter stack as a filter matrix." [¶0012] "One drawback of tile-based convolution engines, however, is that calculating the address sequence needed to load the image data in the correct order to expand a tile of the expanded image matrix involves performing a sequence of dependent integer operations. This sequence of integer operations typically requires a relatively large number of clock cycles to execute. Oftentimes, the number of clock cycles required to perform the integer operations can exceed the number of clock cycles required to perform the matrix multiplication operations" n is not bounded by the limitations.  Variable t is interpreted as necessarily being an integer value since a fractional clock cycle would not be understood by one of ordinary skill in the art.  Therefore the limitation is interpreted as simply being performed after the first operation for the second matrix.)
	store, in at least one clock cycle of an (n+1)th clock cycle and the (n+t)th clock cycle, an element 0 in the first line buffer.(Woolley [¶0081] "As outlined in conjunction with FIG. 5, while the serpentine pattern of each column is offset from the serpentine pattern of the other columns, the serpentine pattern represents a uniform sequence of offsets for every row of the virtual image matrix 510" [¶0082] "For example, the first column of the virtual image matrix 510 is associated with the source address sequence 0, 4, 12, 16, 26, 40, 48, 52, 72, 76, 84, and 88" [¶0098] " For example, any number, including zero, of the image batch 410, the filter stack 440, the offset sequence 640, and/or the output matrix 860 may be included in a frame buffer." Woolley explicitly teaches placing 0 into virtual image matrix offset and further that the offset may be included in the frame buffer.  Frame buffer is interpreted as synonymous with line buffer.).
	
	 Regarding claim 7, the combination of Woolley, and Santara teaches The computing device of claim 3, wherein the instructions further cause the computing device to: perform, in an nth clock cycle, the ith first operation on the first matrix to obtain the ith data element of the second matrix, (Woolley [¶0015] "The method includes selecting a first start address based on a first destination address included in a first image tile that is stored in a first memory; identifying a first offset based on the first destination address...and after copying the data, performing one or more matrix multiplication operations between the first image tile and a first filter tile." [¶0040] " In particular, CPU 102 issues commands that control the operation of PPU 202" See also FIG. 1, FIG. 2.  CPU 102 interpreted as synonymous with control device)
	wherein the ith data element of the second matrix is located in a last row of the second matrix, and wherein an (i+1)th data element of the second matrix is located at a starting location of a column next to a column in which the ith data element is located; and(Woolley [¶0009] "the convolution engine converts the image batch into a column-major image matrix and expresses the filter stack as a filter matrix." Starting location of a next column interpreted as synonymous with first row of a matrix.  Limitation interpreted as synonymous with reading matrix in column-major order.)
	perform, in an (n+t)th clock cycle, an (i+1)th first operation of the M first operations on the first matrix, wherein t is a positive integer greater than 1; and(Woolley [¶0007] "a CNN typically includes multiple “convolution layers,” where each convolution layer performs convolution operations across multiple dimensions of a sample data batch and multiple dimensions of a filter stack" [¶0009] "the convolution engine converts the image batch into a column-major image matrix and expresses the filter stack as a filter matrix." [¶0012] "One drawback of tile-based convolution engines, however, is that calculating the address sequence needed to load the image data in the correct order to expand a tile of the expanded image matrix involves performing a sequence of dependent integer operations. This sequence of integer operations typically requires a relatively large number of clock cycles to execute. Oftentimes, the number of clock cycles required to perform the integer operations can exceed the number of clock cycles required to perform the matrix multiplication operations" n is not bounded by the limitations.  Variable t is interpreted as necessarily being an integer value since a fractional clock cycle would not be understood by one of ordinary skill in the art.  Therefore the limitation is interpreted as simply being performed after the first operation for the second matrix.)
	store, in at least one clock cycle of an (n+1)th clock cycle and the (n+t)th clock cycle, an element 0 in the first line buffer.(Woolley [¶0081] "As outlined in conjunction with FIG. 5, while the serpentine pattern of each column is offset from the serpentine pattern of the other columns, the serpentine pattern represents a uniform sequence of offsets for every row of the virtual image matrix 510" [¶0082] "For example, the first column of the virtual image matrix 510 is associated with the source address sequence 0, 4, 12, 16, 26, 40, 48, 52, 72, 76, 84, and 88" [¶0098] " For example, any number, including zero, of the image batch 410, the filter stack 440, the offset sequence 640, and/or the output matrix 860 may be included in a frame buffer." Woolley explicitly teaches placing 0 into virtual image matrix offset and further that the offset may be included in the frame buffer.  Frame buffer is interpreted as synonymous with line buffer.).
	
	 Regarding claim 9, the combination of Woolley, and Santara teaches The computing device of claim 1, wherein the computing device comprises a cross-bar based computing device.(Woolley [¶0046] "GPCs 208 communicate with memory interface 214 via crossbar unit 210 to read from or write to various DRAMs 220" See also FIG. 2).
	
Claims 10-12, 14, 16, and 18 are substantially similar to claims 1-3, 5, 7, and 9, respectively.  Therefore, the rejections applied to claims 1-3, 5, 7, and 9 also apply to claims 10-12, 14, 16, and 18.

Claims 19-20 are substantially similar to claims 1-2.  Therefore, the rejections applied to claims 1-2 also apply to claims 19-20.  

	Claims 4, 6, 8, 13, 15, and 17 are rejected under U.S.C. §103 as being unpatentable over the combination of Woolley and Santara and Clemons (US 2017/0004089 A1).

	 Regarding claim 4, the combination of Woolley, and Santara teaches The computing device of claim 2, wherein the instructions further cause the computing device to: perform, in an nth clock cycle, the ith first operation on the first matrix to obtain the ith data element of the second matrix, (Woolley [¶0015] "The method includes selecting a first start address based on a first destination address included in a first image tile that is stored in a first memory; identifying a first offset based on the first destination address...and after copying the data, performing one or more matrix multiplication operations between the first image tile and a first filter tile." [¶0040] " In particular, CPU 102 issues commands that control the operation of PPU 202" See also FIG. 1, FIG. 2.  CPU 102 interpreted as synonymous with control device)
	perform, in an (n+t)th clock cycle, an (i+1)th first operation of the M first operations on the first matrix, wherein t is a positive integer greater than 1; and(Woolley [¶0012] "One drawback of tile-based convolution engines, however, is that calculating the address sequence needed to load the image data in the correct order to expand a tile of the expanded image matrix involves performing a sequence of dependent integer operations. This sequence of integer operations typically requires a relatively large number of clock cycles to execute. Oftentimes, the number of clock cycles required to perform the integer operations can exceed the number of clock cycles required to perform the matrix multiplication operations" n is not bounded by the limitations.  Variable t is interpreted as necessarily being an integer value since a fractional clock cycle would not be understood by one of ordinary skill in the art.  Therefore the limitation is interpreted as simply being performed after the first operation for the second matrix.)
	store, in at least one clock cycle of an(n+1)th clock cycle and the (n+t)th clock cycle, an element 0 in the first line buffer.(Woolley [¶0007] "a CNN typically includes multiple “convolution layers,” where each convolution layer performs convolution operations across multiple dimensions of a sample data batch and multiple dimensions of a filter stack" [¶0009] "the convolution engine converts the image batch into a column-major image matrix and expresses the filter stack as a filter matrix." [¶0012] "One drawback of tile-based convolution engines, however, is that calculating the address sequence needed to load the image data in the correct order to expand a tile of the expanded image matrix involves performing a sequence of dependent integer operations. This sequence of integer operations typically requires a relatively large number of clock cycles to execute. Oftentimes, the number of clock cycles required to perform the integer operations can exceed the number of clock cycles required to perform the matrix multiplication operations" Woolley explicitly teaches placing 0 into virtual image matrix offset and further that the offset may be included in the frame buffer.  Frame buffer is interpreted as synonymous with line buffer.).
	However, the combination of Woolley, and Santara doesn't explicitly teach wherein the ith data element of the second matrix is located in a last column of the second matrix, and wherein an (i+1)th data element of the second matrix is located at a starting location of a row next to a row in which the ith data element is located;.

	Clemons, in the same field of endeavor, teaches wherein the ith data element of the second matrix is located in a last column of the second matrix, and wherein an (i+1)th data element of the second matrix is located at a starting location of a row next to a row in which the ith data element is located;([¶0047] "A patch may be specified by a data structure that identifies the patch relative to an origin of the digital image 300. The pixel data of the digital image 300 may be stored in row-major order in a contiguous group of memory addresses, either physical addresses or virtual addresses, and the patch data structure may include a first field that specifies an origin of the patch as a location of a particular pixel in the digital image 300" Starting location of a next row interpreted as synonymous with first column of a matrix.  Limitation interpreted as synonymous with reading matrix in row-major order.).

	The combination of Woolley, and Santara as well as Clemons are directed towards accessing segmented multi-dimensional matrices in a distributed system.  Therefore, the combination of Woolley, and Santara as well as Clemons are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to substitute the column-major CNN matrices in the combination of Woolley and Santara with the row-major matrix representations in Clemons. The substitution would have been obvious because Woolley teaches that column-major order is an appropriate storage type for the convolution operations ([¶0009] “the convolution engine converts the image batch into a column-major image matrix and expresses the filter stack as a filter matrix.”).  Furthermore, Clemons teaches ([¶0003] “In conventional systems, a digital image is often stored in memory in row-major or col-major order.”) Therefore the substitution would be obvious to one of ordinary skill in the art.  This motivation for combination also applies to the remaining claims depending on this combination. 

	 Regarding claim 6, the combination of Woolley, and Santara teaches The computing device of claim 3, wherein the instructions further cause the computing device to: perform, in an nth clock cycle, the ith first operation on the first matrix to obtain the ith data element of the second matrix, ;(Woolley [¶0015] "The method includes selecting a first start address based on a first destination address included in a first image tile that is stored in a first memory; identifying a first offset based on the first destination address...and after copying the data, performing one or more matrix multiplication operations between the first image tile and a first filter tile." [¶0040] " In particular, CPU 102 issues commands that control the operation of PPU 202" See also FIG. 1, FIG. 2.  CPU 102 interpreted as synonymous with control device)
	perform, in an (n+t)th clock cycle, an (i+1)th first operation of the M first operations on the first matrix, wherein t is a positive integer greater than 1; and(Woolley [¶0012] "One drawback of tile-based convolution engines, however, is that calculating the address sequence needed to load the image data in the correct order to expand a tile of the expanded image matrix involves performing a sequence of dependent integer operations. This sequence of integer operations typically requires a relatively large number of clock cycles to execute. Oftentimes, the number of clock cycles required to perform the integer operations can exceed the number of clock cycles required to perform the matrix multiplication operations" n is not bounded by the limitations.  Variable t is interpreted as necessarily being an integer value since a fractional clock cycle would not be understood by one of ordinary skill in the art.  Therefore the limitation is interpreted as simply being performed after the first operation for the second matrix.)
	store, in at least one clock cycle of an (n+1)th clock cycle and the (n+t)th clock cycle, an element 0 in the first line buffer.(Woolley [¶0007] "a CNN typically includes multiple “convolution layers,” where each convolution layer performs convolution operations across multiple dimensions of a sample data batch and multiple dimensions of a filter stack" [¶0009] "the convolution engine converts the image batch into a column-major image matrix and expresses the filter stack as a filter matrix." [¶0012] "One drawback of tile-based convolution engines, however, is that calculating the address sequence needed to load the image data in the correct order to expand a tile of the expanded image matrix involves performing a sequence of dependent integer operations. This sequence of integer operations typically requires a relatively large number of clock cycles to execute. Oftentimes, the number of clock cycles required to perform the integer operations can exceed the number of clock cycles required to perform the matrix multiplication operations" Woolley explicitly teaches placing 0 into virtual image matrix offset and further that the offset may be included in the frame buffer.  Frame buffer is interpreted as synonymous with line buffer.).
	However, the combination of Woolley, and Santara doesn't explicitly teach wherein the ith data element of the second matrix is located in a last column of the second matrix, and wherein an (i+1)th data element of the second matrix is located at a starting location of a row next to a row in which the ith data element is located.

	Clemons, in the same field of endeavor, teaches wherein the ith data element of the second matrix is located in a last column of the second matrix, and wherein an (i+1)th data element of the second matrix is located at a starting location of a row next to a row in which the ith data element is located([¶0047] "A patch may be specified by a data structure that identifies the patch relative to an origin of the digital image 300. The pixel data of the digital image 300 may be stored in row-major order in a contiguous group of memory addresses, either physical addresses or virtual addresses, and the patch data structure may include a first field that specifies an origin of the patch as a location of a particular pixel in the digital image 300" Starting location of a next row interpreted as synonymous with first column of a matrix.  Limitation interpreted as synonymous with reading matrix in row-major order.).

		The combination of Woolley, and Santara as well as Clemons are directed towards accessing segmented multi-dimensional matrices in a distributed system.  Therefore, the combination of Woolley, and Santara as well as Clemons are analogous art in the same field of endeavor.  It would have been obvious before the effective filing date of the claimed invention to substitute the column-major CNN matrices in the combination of Woolley and Santara with the row-major matrix representations in Clemons. The substitution would have been obvious because Woolley teaches that column-major order is an appropriate storage type for the convolution operations ([¶0009] “the convolution engine converts the image batch into a column-major image matrix and expresses the filter stack as a filter matrix.”).  Furthermore, Clemons teaches ([¶0003] “In conventional systems, a digital image is often stored in memory in row-major or col-major order.”) Therefore the substitution would be obvious to one of ordinary skill in the art.  This motivation for combination also applies to the remaining claims depending on this combination. 

	 Regarding claim 8, the combination of Woolley, Santara, and Clemons teaches The computing device of claim 4, wherein t=(s−1)×(W+p)+(w−1), wherein the control device is configured to control, in the (n+1)th clock cycle to the (n+t)th clock cycle, the first line buffer to sequentially store (s−1)×(W+p)+(w−1) elements 0, and wherein s represents a sliding step of the first operation.(Woolley [¶0066] " the parameters 465 may include a padding height and a padding width. The padding height and the padding width append, respectively, rows of zeros and columns of zeros to output images" Any integer value of t is interpreted as conforming to the given equation. Appending rows in row-major order or appending columns in column-major order for zero padding is interpreted as synonymous with sequentially storing 0 in a line buffer.).

	Claims 13, 15, and 18 are substantially similar to claims 4, 6, and 8, respectively.  Therefore, the rejections applied to claims 4, 6, and 8 also apply to claims 13, 15, and 18.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Coates (“Deep learning with COTS HPC systems”, 2013) is directed towards model parallelism in artificial neural networks.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720.  The examiner can normally be reached on M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        



/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124