DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 1 and 13 is/are rejected under 35 U.S.C. 102(a) (1) as being anticipated by Ross et al USPN 9805303.
Regarding claims 1 and 13
Ross et al teaches 
accessing, from a buffer, a flattened input stream that includes a set of parallel vectors, each vector representing a set of input values of a unique kernel-sized tile of an input tensor that is to be convolved with a kernel to generate an output activation (column 2, line 1, in general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of computing a layer output for a convolutional neural network layer from a layer input for the convolutional neural network layer using a two-dimensional systolic array, the convolutional neural network layer having a plurality of kernels, each kernel having a respective matrix structure of weights, the method comprising: receiving a plurality of activation inputs, the plurality of activation inputs represented as a multi-dimensional matrix; forming a plurality of vector inputs from the plurality of activation inputs, each vector input comprising values from a distinct region within the multi-dimensional matrix; sending the plurality of vector inputs to one or more cells along a first dimension of the systolic array; generating a plurality of rotated kernel structures from each of the plurality of kernels, where generating a particular rotated kernel structure comprises shifting elements in the respective matrix structure for the kernel along one dimension; sending each kernel structure and each rotated kernel structure to one or more cells along a second dimension of the systolic array; causing the systolic array to generate an accumulated output based on the plurality of value inputs and the plurality of kernels; and generating the layer output from the accumulated output);

receiving an expanded kernel generated by permuting values from the kernel, the expanded kernel having vectors that each correspond to an output value position of a kernel-sized tile of the output activation (column 10, line 28, in some implementations, a neural network model has a stride parameter greater than one. The processor can perform computations with the stride parameter by converting matrix structures of activation input and weight inputs to respective permuted matrix structures having a larger feature dimension and smaller spatial dimensions. In some implementations, when processing images, the processor permutes, i.e., remaps, the activation matrix structure to have the following size: CEIL (X/X_stride)×CEIL (Y/Y_stride)×(Sizeof(RGB)*X_stride*Y_stride), where X and Y are the size of the matrix structure dimensions, X_stride and Y_stride are the stride parameters, and Sizeof(RGB) is three. The kernel matrix structure can also be permuted using the same formula. For example, if the stride parameter is 2×2, the activation matrix structure is originally 170×170×3 and the kernel matrix structure is 7×7×3, the permuted activation matrix structure can be 85×85×12 and the permuted kernel matrix structure can be 4×4×12. The coordinates of the activation and kernel matrix structures can be mapped to permuted coordinates using the following formula: [CEIL (X/2), CEIL (Y/2), Z+3*(X % 2)+6*(Y % 2)], where X, Y, and Z represent a coordinate in the respective matrix structure. Other formulas can include [CEIL (X/2), CEIL (Y/2), Z+3*(Y % 2)+6*(X % 2)] or [CEIL (X/2), CEIL (Y/2), 2*Z+(X % 2)+6*(Y % 2)].

receiving a control pattern that includes a set of vectors, each vector corresponding to the output value position for the kernel-sized tile of the output activation, each vector including delay values that indicate a parallel vector of the flattened input stream to access input values for the convolution (column 7, line 65, In order to effectively perform convolution calculations using the systolic array, the neural network processor parallelizes matrix multiplications having large dimensional spaces, which are generally required for convolution calculations. In particular, the neural network processor can “flatten” matrices. By way of illustration, the neural network process can flatten a set of activation inputs. For example, the set of activation inputs can be represented as a 3D matrix. The 3D matrix can be visualized as a stack of 2D matrices. Each 2D matrix can then be sent to a row of the systolic array. Kernels can then be sent to columns of the systolic array, and the systolic array can then use the kernels to perform numerous calculations on each 2D matrix at once, thereby parallelizing a convolution computation. This will be described further below in reference to FIGS. 6-8. FIG. 6 shows an example matrix structure 600 having spatial dimensions and a feature dimension. The matrix structure 600 can represent either a set of activation inputs or a set of weight inputs. A matrix structure for a set of activation inputs will be referred to in this specification as an activation matrix structure, and a matrix structure for a set of weight inputs will be referred to in this specification as a kernel matrix structure. The matrix structure 600 has three dimensions: two spatial dimensions and one feature dimension);

generating, using a hardware accelerated processor, for each output value position of each kernel-sized tile of the output activation, a dot product between a first vector that includes values of the flattened input stream as selected by the delay values of the corresponding vector of the control pattern, and a second vector corresponding to a vector in the expanded kernel corresponding to the output value position (column 2, line 25, implementations can include one or more of the following features. The first dimension of the systolic array corresponds to rows of the systolic array, and where the second dimension of the systolic array corresponds to columns of the systolic array. Sending the plurality of vector inputs to one or more cells comprises: sending, for a particular row of the systolic array, a respective element from each vector input to the particular row; and selecting, at each cell in the particular row, one of the respective elements for storage in a register in the cell based on a multiplexor control signal. Sending the plurality of vector inputs to one or more cells along a first dimension of the systolic array comprises: sending each vector input to a distinct series of shift registers, each shift register shifting an element of the vector input to a subsequent shift register on a subsequent clock cycle, each shift register corresponding to a respective row in the systolic array; and selecting, for each row, an output from the corresponding shift registers for use in the row. Forming a plurality of vector inputs from the plurality of activation inputs is based on a size of a particular kernel structure, further comprising: overlapping the particular kernel structure with the matrix representation of the plurality of activation inputs to form a first vector input from elements in the matrix representation; forming one or more other vector inputs from other elements that surround the overlapped particular kernel structure. Generating the layer output from the accumulated output comprises normalizing and pooling the accumulated output to generate the layer output. Sending the plurality of vector inputs to one or more cells along a first dimension of the systolic array comprises: at a particular clock cycle, storing a first vector input in the plurality of vector inputs in a first cell of the systolic array; and at a subsequent clock cycle, shifting the first vector input in the first cell to a second cell that is adjacent to the first cell and storing a second vector input in the plurality of vector inputs in the first cell) and  (column 8, line 66, the neural network processor “flattens” the matrix structure 600 before sending portions of the structure 600 to rows of the systolic array, as described above. That is, the neural network processor can split up the depth layers 702 of the matrix structure 600, e.g., depth layers 602, 604, and 606 of FIG. 6, and send each depth layer to a distinct cell. In some implementations, each depth layer is sent to a cell on a different row of the systolic array 706. For example, the processor can send the activation inputs from a first depth layer, e.g., a matrix of nine ‘1’ activation inputs, to a left-most cell at a first row of the systolic array 706, a second depth layer, e.g., a matrix of nine ‘2’ activation inputs, to a left-most cell at a second row, a third depth layer, e.g., a matrix of nine ‘3’ activation inputs, to a left-most cell at a third row, and so on. The given layer can have multiple kernels, e.g., Kernels A-D 710. Kernels A-D 710 can have matrix structures of dimension 3×3×10. The processor can send each kernel matrix structure to a cell at a distinct column of the systolic array 706. For example, Kernel A can be sent to a top cell in a first column, Kernel B can be sent to a top cell in a second column, and so on).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 2-7, 9-10, 13-18 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ross et al USPN 9,805,303 in view of Baskaran et al USPN 10,936,569.

Regarding claims 2 and 14
Ross et al teaches wherein the kernel has a plurality of filters, and wherein each channel of the input tensor is convolved with one or more filters of the kernel to generate an output activation with a plurality of output features but doesn’t teach explicitly the input tensor has a plurality of channels, however Baskaran et al teaches (column 3, line 58, the present invention presents various embodiments for addressing the aforementioned major challenges in computations involving sparse tensor or multi-dimensional array data. In one embodiment, new sparse tensor storage formats that provide memory storage benefits and performance benefits while executing sparse tensor computations are presented. One format is called “mode-generic” sparse tensor format that is a generic representation of tensor to conveniently store sparse and semi-sparse tensors. Another format is called “mode-specific” sparse format that is a special form of the generic representation that is suitable for performing computations along a specific mode or dimension of the tensor. These new sparse tensor storage formats may not only store the tensor data using less memory than the conventional techniques, but also arrange the data in the memory in such a manner that the data elements likely to be accessed frequently during a certain time period, or while computing a portion of a large computation, are stored relatively close to each other in the memory, so as to improve data locality in sparse tensor computations and to reduce unnecessary memory storage in the process of large data computations). Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to incorporate tensor for data input. The modification would have been obvious because one of ordinary skill in the art would have been motivated to combine teaching into deep learning which is containing numerical data so that information can be used within in the system to compute highly paralleled computation.

Regarding claims 3 and 15
Baskaran et al teaches 
padding the input tensor with padding values such that positions of values of the output of the convolution of the input tensor have matching positions to all positions of values in the input tensor, and such that the size of the output of the convolution matches the size of the input tensor (column 5, line 43, in another aspect, various embodiments feature an article of manufacture that includes instructions to configure a processor, a method, and/or a system that facilitate data reuse in a tensor transform. The tensor has at least three modes, and the tensor transform includes a number of iterations. The system includes a memory and a processor in electronic communication with the memory. Each of the memory and the processor include various types of storage devices and computing devices, respectively, as described above. The processor included in the system, or configured by the instructions on the article of manufacture and/or the method, is configured to perform a first iteration that includes a first operation (e.g., an n-Mode matrix product) on the tensor to obtain a first intermediate result. The first intermediate result includes a first intermediate-tensor. The first intermediate result is stored in the memory, and the processor is configured to perform a second iteration that includes a second operation on the first intermediate result accessed from the memory. Because the second operation is performed on the intermediate result accessed from the memory, a third operation is avoided. For a required computation involving the tensor, the third operation is required if the first intermediate result is not accessed from the memory.

Regarding claims 4 and 16
Ross et al teaches 
 padding the input tensor with padding values such that the size of each dimension of the input tensor is a whole number multiple of the corresponding dimension of the kernel (column 2, line 25, Implementations can include one or more of the following features. The first dimension of the systolic array corresponds to rows of the systolic array, and where the second dimension of the systolic array corresponds to columns of the systolic array. Sending the plurality of vector inputs to one or more cells comprises: sending, for a particular row of the systolic array, a respective element from each vector input to the particular row; and selecting, at each cell in the particular row, one of the respective elements for storage in a register in the cell based on a multiplexor control signal. Sending the plurality of vector inputs to one or more cells along a first dimension of the systolic array comprises: sending each vector input to a distinct series of shift registers, each shift register shifting an element of the vector input to a subsequent shift register on a subsequent clock cycle, each shift register corresponding to a respective row in the systolic array; and selecting, for each row, an output from the corresponding shift registers for use in the row. Forming a plurality of vector inputs from the plurality of activation inputs is based on a size of a particular kernel structure, further comprising: overlapping the particular kernel structure with the matrix representation of the plurality of activation inputs to form a first vector input from elements in the matrix representation; forming one or more other vector inputs from other elements that surround the overlapped particular kernel structure. Generating the layer output from the accumulated output comprises normalizing and pooling the accumulated output to generate the layer output. Sending the plurality of vector inputs to one or more cells along a first dimension of the systolic array comprises: at a particular clock cycle, storing a first vector input in the plurality of vector inputs in a first cell of the systolic array; and at a subsequent clock cycle, shifting the first vector input in the first cell to a second cell that is adjacent to the first cell and storing a second vector input in the plurality of vector inputs in the first cell).    

Regarding claims 5 and 17
Ross et al teaches 
 padding a trailing edge of each dimension of the input tensor with padding values having a width equal to the size of the kernel in the corresponding dimension (see also claim 12 and (column 2, line 26 implementations can include one or more of the following features. The first dimension of the systolic array corresponds to rows of the systolic array, and where the second dimension of the systolic array corresponds to columns of the systolic array. Sending the plurality of vector inputs to one or more cells comprises: sending, for a particular row of the systolic array, a respective element from each vector input to the particular row; and selecting, at each cell in the particular row, one of the respective elements for storage in a register in the cell based on a multiplexor control signal. Sending the plurality of vector inputs to one or more cells along a first dimension of the systolic array comprises: sending each vector input to a distinct series of shift registers, each shift register shifting an element of the vector input to a subsequent shift register on a subsequent clock cycle, each shift register corresponding to a respective row in the systolic array; and selecting, for each row, an output from the corresponding shift registers for use in the row. Forming a plurality of vector inputs from the plurality of activation inputs is based on a size of a particular kernel structure, further comprising: overlapping the particular kernel structure with the matrix representation of the plurality of activation inputs to form a first vector input from elements in the matrix representation; forming one or more other vector inputs from other elements that surround the overlapped particular kernel structure. Generating the layer output from the accumulated output comprises normalizing and pooling the accumulated output to generate the layer output. Sending the plurality of vector inputs to one or more cells along a first dimension of the systolic array comprises: at a particular clock cycle, storing a first vector input in the plurality of vector inputs in a first cell of the systolic array; and at a subsequent clock cycle, shifting the first vector input in the first cell to a second cell that is adjacent to the first cell and storing a second vector input in the plurality of vector inputs in the first cell)

Regarding claims 6 and 18
Ross et al teaches 
 flattened input stream is generated by for each of the one or more kernel-sized tiles of the input tensor accessing the values of the tile in a defined order arranging the values in a vector according to the defined order and arranging the one or more vectors corresponding to each of the one or more tiles in a parallel arrangements to generate the parallel vectors of the flattened input stream see figs 8-11 and (column 15, line 30, Returning to the description of FIG. 10, the system generates a layer output from the accumulated outputs (step 1014). The accumulated outputs can be sent to a vector computation unit, e.g., the vector computation unit described in reference to FIG. 3. The vector computation unit can process the accumulated outputs and generate the layer output, e.g., as described above in reference to FIG. 2. The layer output can be sent and stored at the unified buffer. The systolic array continues the convolution calculations over the entire set of activation inputs, i.e., the entire 170×170 image. In some implementations, the convolution calculations are performed in a pseudo-rasterized order. That is, because convolution calculations are performed in parallel, performing convolution calculations in a normal raster order can cause convolution calculations to be repeated, which would be inefficient. Instead, the neural network processor can proceed in an order from left to right, top to down order that skips convolution calculations that have already been performed in previous parallel convolution calculations. In effect, the processor can output chunks at a time, as opposed to single outputs in a normal raster order).  Therefore, it would have been obvious to a person of ordinary skill in the art before the effective filing date of the claimed invention to incorporate arranging of the tiles or matrix. The modification would have been obvious because one of ordinary skill in the art would have been motivated to combine teaching into transforming the matrices and arranging them as needed for parallel processing to achieve efficiency with flattened the input stream of the data to be processed.

Regarding claim 7
Ross et al teaches 
   wherein the defined order is at least one of a row- major order, a column-major order, and an aisle-major order, wherein the aisle-major order accesses elements in a three-dimensional (3D) tile first along an axis corresponding to the depth of the 3D tile and subsequently along axes corresponding to the width and height of the 3D tile (column 8, line 13, FIG. 6 shows an example matrix structure 600 having spatial dimensions and a feature dimension. The matrix structure 600 can represent either a set of activation inputs or a set of weight inputs. A matrix structure for a set of activation inputs will be referred to in this specification as an activation matrix structure, and a matrix structure for a set of weight inputs will be referred to in this specification as a kernel matrix structure. The matrix structure 600 has three dimensions: two spatial dimensions and one feature dimension).

Regarding claim 9
Ross et al teaches 
wherein each single dimensional vector is a unique vector that is at least one of a row of the kernel, a column of the kernel, a diagonal of the kernel, and an aisle of the kernel, wherein an aisle of a kernel is a vector of the kernel aligned along an axis corresponding to a depth (a third dimension) of the kernel (column 8, line 29, the feature dimension corresponds to features from an activation input. Each feature dimension can have depth levels; for example, the matrix structure 600 has depth levels 602, 604, and 606. By way of illustration, if matrix structure 600 represents a 3×3×3 image sent as a set of activation inputs to a first layer, the X and Y dimensions of the image (3×3) can be the spatial dimensions, and the Z dimension (3) can be the feature dimension corresponding to R, G, and B values. That is, depth level 602 can correspond to a feature of nine ‘1’ activation inputs, e.g., red values, depth level 604 can correspond to a feature of nine ‘2’ activation inputs, e.g., green values, and depth level 606 can correspond to a feature of nine ‘3’ activation inputs, e.g., blue values).

Regarding claims 10 and 20
Ross et al teaches 
the control pattern includes a plurality of vectors, the number of vectors of the plurality of vectors corresponding to a number of output value positions in a kernel-sized tile of the output activation (column 1, line 65, in general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of computing a layer output for a convolutional neural network layer from a layer input for the convolutional neural network layer using a two-dimensional systolic array, the convolutional neural network layer having a plurality of kernels, each kernel having a respective matrix structure of weights, the method comprising: receiving a plurality of activation inputs, the plurality of activation inputs represented as a multi-dimensional matrix; forming a plurality of vector inputs from the plurality of activation inputs, each vector input comprising values from a distinct region within the multi-dimensional matrix; sending the plurality of vector inputs to one or more cells along a first dimension of the systolic array; generating a plurality of rotated kernel structures from each of the plurality of kernels, where generating a particular rotated kernel structure comprises shifting elements in the respective matrix structure for the kernel along one dimension; sending each kernel structure and each rotated kernel structure to one or more cells along a second dimension of the systolic array; causing the systolic array to generate an accumulated output based on the plurality of value inputs and the plurality of kernels; and generating the layer output from the accumulated output).

Allowable Subject Matter
Claims 8, 11-12 and 19 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Relevant Prior Art
US 10776110 B2 Pearce et al teaches Apparatus And Method For Adaptable And Efficient Lane-wise Tensor Processing
US 5179702 A Spix et al teaches System And Method For Controlling A Highly Parallel Multiprocessor Using An Anarchy Based Scheduler For Parallel Execution Thread Scheduling
US 5146543 A Vassiliadis et al teaches Scalable Neural Array Processor
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Anil Khatri whose telephone number is (571)272-3725. The examiner can normally be reached M-F 8:30-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, W Zhen can be reached on 571-272-3708. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/ANIL KHATRI/            Primary Examiner, Art Unit 2191