DETAILED ACTION

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1 - 20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by DEISHER et al (US 2018/0121796)
As to claim 1, DEISHER et al teaches a method for classifying information using a fully-connected layer of a convolutional neural network (paragraph [0044]...spoken utterance classification may be based on deep neural networks (DNNs) as described herein. Such neural networks may be, or may have layers of, convolutional neural networks (CNNs)), the method comprising, at a computing device: 
receiving (paragraph [0046]...NN Accelerator (NNA) 202)  a two-dimensional input matrix that includes a plurality of elements (paragraph [0066]...the input elements in the input vector), wherein each row of the two-dimensional input matrix (paragraph [0053]... a two dimensional matrix with the input per iteration being a column in the matrix and in sequential order in memory, and the rows being one element per input, and changed to a structure arranged so that a set of inputs of the same iteration can be executed at once which is practically using a column of the matrix) corresponds to a batch of elements (paragraph [0163]... neural network layer input can be viewed as a 2D matrix. One of the dimensions is the input vector length and the other dimension is the grouping factor (i.e., batch size) where each group forms a different output of a layer); 
paragraph [0034]...a weight matrix) corresponding to the two-dimensional input matrix, the two-dimensional weight matrix including a plurality of weight values (paragraph [0048]...external memory 248 also may have one or more pre-allocated NN buffers (or application buffers) 256 including buffers for a matrix of input values, weights, scale factors, bias values, and other constants. These NN buffers 256 initially hold the data for the neural network before running the neural network or at least before a layer associated with the data is being processed. Eventually, the data in the NN buffers 256 are read by the NNA 202 to be placed into the internal buffers 238 to be used to compute NN outputs as explained below. The data for each layer in the NN buffers 256, such as the input values, scale factors, weights, and other data, also may be pre-ordered in the NN buffers 256, such as in pre-ordered single or two dimensional arrays);
identifying a first block of elements (paragraph [0044]...the components (logic elements) of individual or each logic block are arranged to give a programmer the option to use weights with different bit lengths and a scale factor may be applied to the weights depending on the bit length of the weights as explained herein) of the two-dimensional input matrix; 
loading a first weight block of the two-dimensional weight matrix (paragraph [0061]...the input array may be provided in a de-interleaved form or an interleaved form. In most cases, the input array will be provided in an interleaved form. When a neural network has an RNN layer, in this case, the de-interleaved form may be provided. In the de-interleaved form, and when the memory uses row-major storage, the input elements are divided into groups along rows, and as shown in FIG. 16, where input array 1600 is shown in de-interleaved form. In this case, the memory stores the groups group after group. Thus, when the input array is uploaded from external memory to the input buffer at internal memory 314, the data of a first group is loaded, or at least as much as will fit in the input buffer, and then the next group, and so on. Again, this may be used only in the case of an RNN layer where the order of the processing of the layers in the neural network is important, by one example); 
calculating a first partial output (paragraph [0032]...the NNA also may provide partial (subset) output computation-supporting active state lists processing such that a selected portion of a layer that provides outputs for less than all of the nodes on a neural network layer may be processed when processing of the entire layer is not desired) for the first block of elements by performing a first dot product operation (paragraph [0080]...an MAC 401 is shown to determine a dot product (or sum output) of the weights and input values for a single node or output of a layer of a neural network. The MAC 401 may have mathematically and/or logically parallel logic blocks 402-0 to 402-N (or generally referred to as logic blocks 402), and by one example, 48 logic blocks are provided but more or less may be provided. The logic blocks 402 are fixed function hardware logic blocks formed of well understood transistors or other semiconductor components. Fixed function here refers to the use of an MAC 401 with particular logic components or elements in an arrangement that does not change) using a first row of elements of the first block of elements and the first weight block (paragraph [0082]... neural network propagation, the input to the MAC 401 may include an input set or feature vector from the input buffer and from the input array as explained above, a weight vector from the weight buffer, and a scale factor. These are all used to compute a single output (for a single node)), wherein the first row of elements of the first block of elements corresponds to a first batch of elements (paragraph [0162]...one of the dimensions is the input vector length and the other dimension is the grouping factor (i.e., batch size) where each group forms a different output of a layer. Thus, the transpose layer groups input data from multiple groups into a single array so that this array can be fetched from memory together thereby reducing the number of memory transactions);
storing the first partial output (paragraph [0067]...the intermediate sum stored in the sum buffer 326 is saved to memory as an intermediate sum to allow handling of additional outputs); 
and generating a first output element using the first partial output for the first block of elements and at least one other partial output corresponding to the first batch of elements (paragraph [0076]...the weighted inputs are computed in parallel, and are provided to an accumulator section of the MAC 319 to provide a single weighted input sum (also referred to as a sum output (or more likely a partial sum when more than one input set is included in an input vector (group) or more than one iteration is provided))).

As to claim 2, DEISHER et al teaches the method, wherein a number of rows in the first weight block corresponds to a number of columns in the first block of elements (paragraph [0070]...the weight matrix may be a 2-D matrix in this case with one row for each filter and one column for each filter element. By another format)

As to claim 3, DEISHER et al teaches the method, wherein the two-dimensional weight matrix is arranged in column-major order (paragraph [0069]...the weight matrix held at the weight buffer 320 may represent slightly different values depending on the type of layer that is being processed. For affine and recurrent layers, the weight matrix may have one row for each input to a layer and column for each output (or node) of the layer that is to be obtained. This assumes row major organization of memory. It is possible to use the transverse with a column major organization instead. This may be kept consistent for any of the arrays that provide a row or column major option).

As to claim 4, DEISHER et al teaches the method, wherein the method is implemented by at least one processor (paragraph [0259]...a processor, may provide the functionality described) of the computing device, and the at least one processor includes a vector processing unit (paragraph [0037]...processing operations such as weight functions, feature vector stacking and transformations).

As to claim 5, DEISHER et al teaches the method, further comprising, in response to storing the first partial output (paragraph [0067]...the intermediate sum stored in the sum buffer 326 is saved to memory as an intermediate sum to allow handling of additional outputs) for the first block of elements reloading the first weight block (paragraph [0145]...fully connected layers are operating on an interleaved array, where multiple groups of data (each group from a different output) are interleaved to improve efficiency of memory bandwidth via re-use of the weight matrix read for all groups).

As to claim 6, DEISHER et al teaches the method, further comprising, calculating a second partial output (paragraph [0032]...the NNA also may provide partial (subset) output computation-supporting active state lists processing such that a selected portion of a layer that provides outputs for less than all of the nodes on a neural network layer may be processed when processing of the entire layer is not desired) for the first block of elements (paragraph [0044]...the components (logic elements) of individual or each logic block are arranged to give a programmer the option to use weights with different bit lengths and a scale factor may be applied to the weights depending on the bit length of the weights as explained herein) by performing a second dot product operation (paragraph [0080]...an MAC 401 is shown to determine a dot product (or sum output) of the weights and input values for a single node or output of a layer of a neural network. The MAC 401 may have mathematically and/or logically parallel logic blocks 402-0 to 402-N (or generally referred to as logic blocks 402), and by one example, 48 logic blocks are provided but more or less may be provided. The logic blocks 402 are fixed function hardware logic blocks formed of well understood transistors or other semiconductor components. Fixed function here refers to the use of an MAC 401 with particular logic components or elements in an arrangement that does not change)  using a second row of elements of the first block of elements and the first weight block, wherein the second row of elements of the first block of elements corresponds to a second batch of elements (paragraph [0223]... the process 1020 may include "accumulate weighted inputs with accumulator circuit to obtain sum for an output" 1038, and as explained above, by a tree structure of adders to obtain a single sum (or dot-product) of weighted inputs for a single output referred to as a sum output herein. When the input vector has a number of input values that is the same or less than the number of parallel logic blocks, the sum output is a final sum output).

As to claim 7, DEISHER et al teaches the method, further comprising, generating a second output element using the second partial output for the first block of elements and at least one other partial output corresponding to the second batch of elements (paragraph [0076]...the weighted inputs are computed in parallel, and are provided to an accumulator section of the MAC 319 to provide a single weighted input sum (also referred to as a sum output (or more likely a partial sum when more than one input set is included in an input vector (group) or more than one iteration is provided))).

Claim 9 has similar limitations as claim 1. Therefore, the claim is rejected for the same reasons as above. 

Claim 10 has similar limitations as claim 2. Therefore, the claim is rejected for the same reasons as above. 

Claim 11 has similar limitations as claim 3. Therefore, the claim is rejected for the same reasons as above. 

Claim 12 has similar limitations as claim 4. Therefore, the claim is rejected for the same reasons as above. 


Claim 13 has similar limitations as claim 5. Therefore, the claim is rejected for the same reasons as above. 

Claim 14 has similar limitations as claim 6. Therefore, the claim is rejected for the same reasons as above. 

Claim 15 has similar limitations as claim 7. Therefore, the claim is rejected for the same reasons as above. 

Claim 15 has similar limitations as claim 7. Therefore, the claim is rejected for the same reasons as above. 

As to claim 17, DEISHER et al teaches a computing device configured to classify information using a fully-connected layer of a convolutional neural network (paragraph [0044]...spoken utterance classification may be based on deep neural networks (DNNs) as described herein. Such neural networks may be, or may have layers of, convolutional neural networks (CNNs)), the computing device comprising: 
at least one a memory (paragraph [0067]... memory 248), storing:
a two-dimensional input matrix that includes a plurality of elements (paragraph [0066]...the input elements in the input vector), wherein each row of the two-dimensional input matrix (paragraph [0053]... a two dimensional matrix with the input per iteration being a column in the matrix and in sequential order in memory, and the rows being one element per input, and changed to a structure arranged so that a set of inputs of the same iteration can be executed at once which is practically using a column of the matrix) corresponds to a batch of elements (paragraph [0163]... neural network layer input can be viewed as a 2D matrix. One of the dimensions is the input vector length and the other dimension is the grouping factor (i.e., batch size) where each group forms a different output of a layer), and
a two-dimensional weight matrix (paragraph [0034]...a weight matrix)  corresponding to the two-dimensional input matrix, the two-dimensional weight matrix including a plurality of weight values (paragraph [0048]...external memory 248 also may have one or more pre-allocated NN buffers (or application buffers) 256 including buffers for a matrix of input values, weights, scale factors, bias values, and other constants. These NN buffers 256 initially hold the data for the neural network before running the neural network or at least before a layer associated with the data is being processed. Eventually, the data in the NN buffers 256 are read by the NNA 202 to be placed into the internal buffers 238 to be used to compute NN outputs as explained below. The data for each layer in the NN buffers 256, such as the input values, scale factors, weights, and other data, also may be pre-ordered in the NN buffers 256, such as in pre-ordered single or two dimensional arrays), and
a vector processor (paragraph [0037]...processing operations such as weight functions, feature vector stacking and transformations) coupled to the at least one memory and configured to cause the computing device to:
identify a first block of elements (paragraph [0044]...the components (logic elements) of individual or each logic block are arranged to give a programmer the option to use weights with different bit lengths and a scale factor may be applied to the weights depending on the bit length of the weights as explained herein) of the two-dimensional input matrix,
load a first weight block of the two-dimensional weight matrix (paragraph [0061]...the input array may be provided in a de-interleaved form or an interleaved form. In most cases, the input array will be provided in an interleaved form. When a neural network has an RNN layer, in this case, the de-interleaved form may be provided. In the de-interleaved form, and when the memory uses row-major storage, the input elements are divided into groups along rows, and as shown in FIG. 16, where input array 1600 is shown in de-interleaved form. In this case, the memory stores the groups group after group. Thus, when the input array is uploaded from external memory to the input buffer at internal memory 314, the data of a first group is loaded, or at least as much as will fit in the input buffer, and then the next group, and so on. Again, this may be used only in the case of an RNN layer where the order of the processing of the layers in the neural network is important, by one example);,
calculate a first partial output (paragraph [0032]...the NNA also may provide partial (subset) output computation-supporting active state lists processing such that a selected portion of a layer that provides outputs for less than all of the nodes on a neural network layer may be processed when processing of the entire layer is not desired) for the first block of elements by performing a dot product operation (paragraph [0080]...an MAC 401 is shown to determine a dot product (or sum output) of the weights and input values for a single node or output of a layer of a neural network. The MAC 401 may have mathematically and/or logically parallel logic blocks 402-0 to 402-N (or generally referred to as logic blocks 402), and by one example, 48 logic blocks are provided but more or less may be provided. The logic blocks 402 are fixed function hardware logic blocks formed of well understood transistors or other semiconductor components. Fixed function here refers to the use of an MAC 401 with particular logic components or elements in an arrangement that does not change) using a first row of elements of the first block of elements and the first weight block (paragraph [0082]... neural network propagation, the input to the MAC 401 may include an input set or feature vector from the input buffer and from the input array as explained above, a weight vector from the weight buffer, and a scale factor. These are all used to compute a single output (for a single node)), wherein the first row of elements of the first block of elements corresponds to a first batch of elements wherein the first row of elements of the first block of elements corresponds to a first batch of elements (paragraph [0162]...one of the dimensions is the input vector length and the other dimension is the grouping factor (i.e., batch size) where each group forms a different output of a layer. Thus, the transpose layer groups input data from multiple groups into a single array so that this array can be fetched from memory together thereby reducing the number of memory transactions),
store the first partial output (paragraph [0067]...the intermediate sum stored in the sum buffer 326 is saved to memory as an intermediate sum to allow handling of additional outputs), and paragraph [0076]...the weighted inputs are computed in parallel, and are provided to an accumulator section of the MAC 319 to provide a single weighted input sum (also referred to as a sum output (or more likely a partial sum when more than one input set is included in an input vector (group) or more than one iteration is provided))).

As to claim 18, DEISHER et al teaches a computing device, wherein a number of rows in the first weight block corresponds to a number of columns in the first block of elements (paragraph [0070]...the weight matrix may be a 2-D matrix in this case with one row for each filter and one column for each filter element. By another format).

As to claim 19, DEISHER et al teaches a computing device, wherein the two-dimensional weight matrix is arranged in column-major order (paragraph [0069]...the weight matrix held at the weight buffer 320 may represent slightly different values depending on the type of layer that is being processed. For affine and recurrent layers, the weight matrix may have one row for each input to a layer and column for each output (or node) of the layer that is to be obtained. This assumes row major organization of memory. It is possible to use the transverse with a column major organization instead. This may be kept consistent for any of the arrays that provide a row or column major option).

As to claim 20, DEISHER et al teaches a computing device, wherein the vector processor (paragraph [0037]...processing operations such as weight functions, feature vector stacking and transformations). s further configured to cause the computing device to, in response to storing the first partial output (paragraph [0067]...the intermediate sum stored in the sum buffer 326 is saved to memory as an intermediate sum to allow handling of additional outputs) for the first block of elements: reload the first weight block (paragraph [0145]...fully connected layers are operating on an interleaved array, where multiple groups of data (each group from a different output) are interleaved to improve efficiency of memory bandwidth via re-use of the weight matrix read for all groups).


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRANDON S COLE whose telephone number is (571)270-5075. The examiner can normally be reached Mon - Fri 7:30pm - 5pm EST (Alternate Friday's Off).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Omar Fernandez can be reached on 571-272-2589. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, 

/BRANDON S COLE/           Primary Examiner, Art Unit 2128