DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of Claims
	This Office Action is in response to the communication filed on 01/03/2020.
	Claims 1-22 are being considered on the merits.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 01/03/2020 has been considered. The submission is in compliance with the provisions of 37 CFR 1.97. Accordingly, initialed and dated copies of Applicant's IDS forms 1449 filed 01/03/2020 is attached to the instant Office action. 

Drawings
	The drawings filed on 01/03/2020 are accepted. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over in Wiedemann, et. al. (WO 2019/086104 A1, hereinafter “Wiedemann”) view of Diamant, et. al. (US 2019/0180183, hereinafter “Diamant”) and further in view of Kumar, et al. (US 2019/0303750 A1, hereinafter “Kumar”)

Regarding claim 1, Wiedemann teaches a neural inference chip, comprising: 
wherein the neural inference chip is adapted to store in the global weight memory a compressed weight block comprising at least one compressed weight matrix… (Wiedemann, pg. 50, 9th para: “Here we convert the weight matrices into the desired compressed format” “Input: Compressed matrix formats + activation lookup tables + extra-info. Output: Compressed domain representation of neural network”) 
the at least one core is adapted to decode the at least one compressed weight matrix into a decoded weight matrix and (Wiedemann, pg. 12, 3rd para; pg. 18, 3rd para; pg. 42, 3rd para; and pg. 118, 2nd  para: “The compressed sparse row format also represents the sparse matrix by using three arrays. It stores the non-zero values and column indices in row major order and adds a pointer, which indicates where a new row starts” “/ can store the positions either using the lexographical index or any of the sparse-like formats” “an encoding-decoding scheme, where we explain all possible scenarios of how to convert the neural network into it's compressed representation (encoder), and subsequently interpret it (decoder)” “In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein.”)
the at least one neural core is adapted to apply the decoded weight matrix to a plurality of input activations to produce a plurality of output activations. (Wiedemann, pg. 15, 1st para and pg. 51, 4th para: “Matrix A controls the amplification or strength at which activations of neurons 16 are forwarded downstream to the respective neuron 12. Each row of matrix A is assigned to a certain neuron 12. In Fig. 1 , for instance, the first row of weights is assigned to reference sign 22. The weights 24 of matrix 10 within this row 22 control the amplification of the activations along connections 20 as depicted in Fig. 2.” “The decoder should be able to reconstruct 350 the neural network into it's uncompressed format. That is, he should know methods of converting compressed matrices back into dense format, replace tables with their corresponding activation functions (consequently dequantizing the activation values) and reconvert the low precision numerical representations back into high precision formats.” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded) 
Wiedemann does not explicitly disclose:
a global weight memory
at least one neural core, the at least one neural core comprising a local weight memory, the local weight memory comprising a plurality of memory banks, each of the plurality of memory banks being uniquely addressable by at least one index 
store the decoded weight matrix in its local weight memory 
However, Diamant teaches:
a global weight memory (Diamant, para 0058: “In some cases, particularly for small neural networks, it may be possible for all of the weight values for the neural network to be stored in on-chip memory”)
at least one neural core, the at least one neural core comprising a local weight memory, the local weight memory comprising a plurality of memory banks, each of the plurality of memory banks being uniquely addressable by at least one index (Diamant, para. 0059 and 0076: “The neural network processing engine can further include a set of memory banks local to the array of processing engines, where local can mean physically close to and/or directly accessible by the array of processing engines” “In various implementations, the memory subsystem 504 can include control logic. The control logic can, for example, keep track of the address spaces of each of the memory banks 514, identify memory banks 514 to read from or write to, and/or move data between memory banks 514, if needed”)
store the decoded weight matrix in its local weight memory (Diamant, para. 0059 and 0076: “The neural network processing engine can further include a set of memory banks local to the array of processing engines, where local can mean physically close to and/or directly accessible by the array of processing engines” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded) 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Diamant into Wiedemann. Wiedemann teaches efficient neural network representations using compression, sparsity, and matrices; Diamant teaches use of multiple processors and memory banks for more efficient neural network computations. One of ordinary skill would have been motivated to combine the teachings of Diamant into Wiedemann in order to have memory banks that are individually accessible and able to read in parallel (Diamant, para. 0058). 

Neither Wiedemann nor Diamant explicitly discloses:
a network connecting the global weight memory to the at least one neural core 
…the neural inference chip is adapted to transmit the compressed weight block from the global weight memory to the at least one neural core via the network 
However, Kumar teaches:
a network connecting the global weight memory to the at least one neural core (Kumar, paras. 0017, 0021, and 0054: “Accordingly, for a DNN, a portion or entirety of a weight matrix for one or more hidden layers can be compressed into a smaller set of weights.” “Memory 202 can store real weight values for use in generating one or more virtual weights in a matrix during training or inference. Storage of weight values on-chip can refer to storage of the weights values in a memory device on the same motherboard, die, or socket as that of the central processing unit (CPU) 204, graphics processing unit (GPU) 206, or accelerator 208 that is to access the weights and perform computation on input values using the weights.” “Network interface 950 can transmit data to a remote device, which can include sending data stored in memory”). 
…the neural inference chip is adapted to transmit the compressed weight block from the global weight memory to the at least one neural core via the network (Kumar, paras. 0017, 0021 and 0054 : “Accordingly, for a DNN, a portion or entirety of a weight matrix for one or more hidden layers can be compressed into a smaller set of weights.” “FIG. 2 depicts a system in which embodiments can be used….An off-chip memory 220 or storage device 222 can be accessed via a bus or interface (e.g., PCIe) and the off-chip memory or storage device is mounted on a separate motherboard, die, or socket as that of the processor, accelerator, GPU, CPU, or core that is to access the weights and perform computation on input values using the weights.” “In one example, system 900 includes interface 914, which can be coupled to interface 912. In one example, interface 914 represents an interface circuit, which can include standalone components and integrated circuitry. Network interface 950 can transmit data to a remote device, which can include sending data stored in memory. Network interface 950 can receive data from a remote device, which can include storing received data into memory…Various embodiments can be used in connection with network interface 950, processor 910, and memory subsystem 920.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Kumar into Wiedemann and Diamant. Wiedemann teaches efficient neural network representations using compression, sparsity, and matrices; Diamant teaches use of multiple processors and memory banks for more efficient neural network computations; Kumar teaches storing neural network weights in a matrix that is fewer than the number of weights in the network. One of ordinary skill would have been motivated to combine the teachings of Kumar into Wiedemann and Diamant in order to receive memory compression benefits during both training and inference stages (Kumar, para. 0017). 

Regarding claim 2, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 1 (above). Wiedemann further teaches: 
the at least one compressed weight matrix comprises a plurality of column indices and associated values, the plurality of column indices corresponding to each position within the decoded weight matrix containing a non-zero value. (Wiedemann, pg. 11, 5th para: “The coordinate format or COO in short, stores the information of a sparse matrix within three arrays: the weights-, row Index-, collndex-array. The weights-array contains the values of all non zero elements in the matrix, and rowlndex- and collndex-array their respective row/column positions”. Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded)

Regarding claim 3, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 1 (above). Diamant further teaches: 
each of the plurality of memory banks is adapted to selectively store elements of the decoded weight matrix according to its at least one index (Diamant, para. 0076: “In some implementations, the memory subsystem 504 can include multiplexors for selecting which memory bank to output to a particular client and/or to receive input from a particular client…For example, a set of memory banks 514 can be hardwired to provide weights 506 and state 508 to the rows of the processing engine array 510.” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Diamant into Wiedemann as set forth above with respect to claim 1.

Regarding claim 4, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 3 (above). Wiedemann further teaches: 
…by comparing the plurality of column indices to a column index associated with each memory bank (Wiedemann, pg. 19, 3rd para: “The data field 40 associated with data field 30 indicates the positions of the weights assuming any of the discrete weight values 36 by way of column indices 60 which merely indicate the column index of each search position, wherein the association of the column index values within data set 40 and the columns of matrix 10 are indicated in Fig. 5 at 62…During the first traversal, the column index of each weight 24 is entered into the list 64 of column indices whenever a weight 24 is encountered during this traversal or scan which corresponds to the first discrete weight value 36 indicated in the list 66 of discrete weight values 36 of data field 30, which is 4 in the present case.” Examiner notes that the broadest reasonable interpretation of comparing column indices to a column index includes searching a plurality of column indices for a particular column index during which search, a comparison is necessarily made to indicate if a particular column index matches the search criteria.)
Wiedemann does not explicitly disclose:
each memory bank is adapted to selectively store elements of the decoded weight matrix….
However, Diamant teaches:
each memory bank is adapted to selectively store elements of the decoded weight matrix…. (Diamant, para. 0076: “In some implementations, the memory subsystem 504 can include multiplexors for selecting which memory bank to output to a particular client and/or to receive input from a particular client…For example, a set of memory banks 514 can be hardwired to provide weights 506 and state 508 to the rows of the processing engine array 510.” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Diamant into Wiedemann as set forth above with respect to claim 1.

Regarding claim 5, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 1 (above). Wiedemann further teaches: 
the at least one compressed weight matrix comprises a plurality of rows, each of the plurality of rows comprising a column index and associated value for each position within that row of the decoded weight matrix containing a non-zero value (Wiedemann, pg. 11, 6th para; pg. 12, 3rd para; and pg. 12, 6th para: “The weights-array contains the values of all non zero elements in the matrix, and rowlndex- and collndex-array their respective row/column positions” “The compressed sparse row format also represents the sparse matrix by using three arrays. It stores the non-zero values and column indices in row major order and adds a pointer, which indicates where a new row starts.” “This format stores the non zero values and their corresponding column indices, but fixes the number of non zeros per row and pads with an additional symbol * empty spaces. Subsequently, it transposes the entries in order to allow for coalesced memory access” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded). 

Regarding claim 6, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 1 (above). Wiedemann further teaches: 
the decoded weight matrix is sparse (Wiedemann, pg. 11, 5th para: “The coordinate format or COO in short, stores the information of a sparse matrix within three arrays: the weights-, row Index-, collndex-array. The weights-array contains the values of all non zero elements in the matrix, and rowlndex- and collndex-array their respective row/column positions”. Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a matrix containing weight information that is not encoded). 

Regarding claim 7, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 1 (above). Wiedemann further teaches: 
the at least one compressed weight matrix contains fewer zero values than the decoded weight matrix (Wiedemann, pg. 11, 6th para; pg. 12, 3rd para; and pg. 12, 6th para: “The weights-array contains the values of all non zero elements in the matrix, and rowlndex- and collndex-array their respective row/column positions” “The compressed sparse row format also represents the sparse matrix by using three arrays. It stores the non-zero values and column indices in row major order and adds a pointer, which indicates where a new row starts.” “This format stores the non zero values and their corresponding column indices, but fixes the number of non zeros per row and pads with an additional symbol * empty spaces. Subsequently, it transposes the entries in order to allow for coalesced memory access” “the decoder can fully predict the positions of the last weight. Hence, for all coming formats we can choose to not send the positions of one specific weight i.e., weight value 34. Obviously, the right choice for the encoder will be to spare the information of the positions of the most frequent weight. Alternatively, zero could be chosen by default.” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded)
the decoded weight matrix comprises at least one zero value (Wiedemann pg. 49, last paragraph; “It also gives information about the format under which the matrix content is stored and hence, the decoder will be able to know how to correctly perform the instructed matrix operation. For example, if a sparse matrix-vector multiplication is instructed, then the CG will load weight matrix in sparse format from the LV instance and the dot product will be performed using this loaded matrix representation.” Examiner notes that the broadest reasonable interpretation of a matrix in sparse format means that the matrix includes at least one zero value; Examiner additionally notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded). 


Regarding claim 8, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 7 (above). Wiedemann further teaches: 
decoding the at least one compressed weight matrix comprises inserting each value of the at least one compressed weight matrix into a zero-filled matrix (Wiedemann : “If the map                                 
                                    ϕ
                                
                             is known by the decoder, the values can be reordered with respect to                                 
                                    ϕ
                                
                             and consequently we spare the storage requirements of saving the positions of the elements…We can leverage in the same way this property if we can define a map                                 
                                    
                                        
                                             
                                            ϕ
                                             
                                        
                                        
                                            k
                                        
                                    
                                
                             for each of the alphabet elements k. In this way, only the unknown parameters of each map need to be signalized instead of the actual index positions…Thus, we still would have to send the information of their positions. Nevertheless, we can still gain some storage savings by sparing the storage requirement of the positions of the most frequent alphabet value and subsequently reducing the bit-size overhead of the index positions, by storing the                                 
                                    
                                        
                                            0
                                            ,
                                            a
                                        
                                    
                                
                             values instead of their actual values.”)


Regarding claim 9, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 1 (above). Wiedemann further teaches: 
the compressed weight block comprises a plurality of compressed weight matrices (Wiedemann, pg. 50, 9th para: “Here we convert the weight matrices into the desired compressed format” “Input: Compressed matrix formats + activation lookup tables + extra-info. Output: Compressed domain representation of neural network” );
the at least one core is adapted to decode the compressed weight block into a plurality of decoded weight matrices… (Wiedemann pg. 51, 4th para: “The decoder should be able to reconstruct 350 the neural network into it's uncompressed format. That is, he should know methods of converting compressed matrices back into dense format, replace tables with their corresponding activation functions (consequently dequantizing the activation values) and reconvert the low precision numerical representations back into high precision formats.” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded)
…decoded weight matrices to… (Wiedemann pg. 51, 4th para: “The decoder should be able to reconstruct 350 the neural network into it's uncompressed format. That is, he should know methods of converting compressed matrices back into dense format, replace tables with their corresponding activation functions (consequently dequantizing the activation values) and reconvert the low precision numerical representations back into high precision formats.” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded) 
the at least one neural core is adapted to apply the plurality of decoded weight matrices to a plurality of input activations to produce a plurality of output activations (Wiedemann, pg. 15, 1st para and pg. 51, 4th para: “Matrix A controls the amplification or strength at which activations of neurons 16 are forwarded downstream to the respective neuron 12. Each row of matrix A is assigned to a certain neuron 12. In Fig. 1 , for instance, the first row of weights is assigned to reference sign 22. The weights 24 of matrix 10 within this row 22 control the amplification of the activations along connections 20 as depicted in Fig. 2.” Examiner notes that the broadest reasonable interpretation of “uncompressed weight matrix” is a weight matrix that is not compressed)
Wiedemann does not explicitly disclose:
…and store the plurality of decoded weight matrices in its local weight memory  
However, Diamant teaches:
and store the plurality of decoded weight matrices in its local weight memory (Diamant, para. 0059: “The neural network processing engine can further include a set of memory banks local to the array of processing engines, where local can mean physically close to and/or directly accessible by the array of processing engines” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded); 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Diamant into Wiedemann as set forth above with respect to claim 1.

Regarding claim 10, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 1 (above). Wiedemann further teaches: 
the compressed weight block comprises a matrix index associated with each of the plurality of compressed weight matrices (Wiedemann, pg. 34, 3rd and 4th paras: “We consider an array-of-structure-of-array-like (AoSoA) format which consists of an array containing the weight values and their respective row and col positions, stored in 2 further arrays respectively. This idea can be extended to any other efficient dynamic sparse format (e.g. Iist-of-lists, dictionary format, etc) for storing the indices of each matrix.”)
each of the compressed weight matrices comprises a plurality of column indices and associated values, (Wiedemann, : “A possible format could be an array-of-arrays-of-list like format, where the first array 66 comprised by the data field 36 contains the alphabet values, the second array 140 their respective column index positions where they appear at least once, here ordered in n column scans where n is the cardinality of set 36, thereby subdividing list 140 into three subsequences 142 one for each value 36 and the list lists the corresponding values that appear per row. Or, if we represent it by a set of arrays, 2 additional pointed arrays 50 and 160 would be needed that indicate start-end parts of each sequence of the previous entities. Hence, the above example would be stored in the manner depicted in Fig. 11. That is, when represented using the representation 50 of Fig. 11 , a recipient of representation 50 may obtain the set of discrete weight values, i.e., 36, from list 66 comprised by the first data field 30, while list 140 comprised by second data field 40 indicates, for each discrete weight value 36 within list 66 by way of a separate subsequence 142, the column indices of those columns where at least one weight 24 within matrix 10 is positioned which assumes the respective discrete weight value 36”
the plurality of column indices corresponding to each position within the associated decoded weight matrix containing a non-zero value (Wiedemann : “That is, when represented using the representation 50 of Fig. 11 , a recipient of representation 50 may obtain the set of discrete weight values, i.e., 36, from list 66 comprised by the first data field 30, while list 140 comprised by second data field 40 indicates, for each discrete weight value 36 within list 66 by way of a separate subsequence 142, the column indices of those columns where at least one weight 24 within matrix 10 is positioned which assumes the respective discrete weight value 36”. Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded). 

Regarding claim 11, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 1 (above). Wiedemann further teaches: 
…elements of the decoded weight matrix according to its associated matrix index and column index (Wiedemann, pg. 34, 3rd and 4th paras: “We consider an array-of-structure-of-array-like (AoSoA) format which consists of an array containing the weight values and their respective row and col positions, stored in 2 further arrays respectively. This idea can be extended to any other efficient dynamic sparse format (e.g. Iist-of-lists, dictionary format, etc) for storing the indices of each matrix.” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded)
Wiedemann does not explicitly disclose:
each of the plurality of memory banks is adapted to selectively store… 
However, Diamant teaches: 
each of the plurality of memory banks is adapted to selectively store… (Diamant, para. 0076: “In some implementations, the memory subsystem 504 can include multiplexors for selecting which memory bank to output to a particular client and/or to receive input from a particular client…For example, a set of memory banks 514 can be hardwired to provide weights 506 and state 508 to the rows of the processing engine array 510.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Diamant into Wiedemann as set forth above with respect to claim 1.

Regarding claim 12, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 1 (above). Wiedemann further teaches: 
the at least one neural core is adapted to apply the uncompressed weight matrix to a plurality of input activations to produce a plurality of output activations (Wiedemann, pg. 15, 1st para and pg. 51, 4th para: “Matrix A controls the amplification or strength at which activations of neurons 16 are forwarded downstream to the respective neuron 12. Each row of matrix A is assigned to a certain neuron 12. In Fig. 1 , for instance, the first row of weights is assigned to reference sign 22. The weights 24 of matrix 10 within this row 22 control the amplification of the activations along connections 20 as depicted in Fig. 2.” Examiner notes that the broadest reasonable interpretation of “uncompressed weight matrix” is a weight matrix that is not compressed) 
Wiedemann does not explicitly disclose:
the neural inference chip is adapted to store in the global weight memory an uncompressed weight matrix 
the neural inference chip is adapted to transmit the uncompressed weight matrix from the global weight memory to the at least one neural core via the network 
the at least one core is adapted to store the uncompressed weight matrix in its memory; 
However, Diamant teaches:
the neural inference chip is adapted to store in the global weight memory an uncompressed weight matrix (Diamant, para 0058 and 0042: “In some cases, particularly for small neural networks, it may be possible for all of the weight values for the neural network to be stored in on-chip memory” “Windowing and weight sharing in a neural network layer can be accomplished by structuring the computation executed at each node as a convolution. FIG. 3A illustrates an example of a model 310 of a 2-dimensional convolution as applied to image processing. In this example model, a filter plane 304 is a set of weights arranged in a matrix having a height R and a width S”)
the neural inference chip is adapted to transmit the uncompressed weight matrix from the global weight memory to the at least one neural core via the network (Diamant 0059: “particularly when the neural network is small, all of the weight values for the neural network can also be stored in the memory banks of the neural network processing engine. In these cases, it may be possible for the array of processing engines to sustain full utilization in every clock cycle.”)
the at least one core is adapted to store the uncompressed weight matrix in its memory; (Diamant, para 0058 and 0042: “In some cases, particularly for small neural networks, it may be possible for all of the weight values for the neural network to be stored in on-chip memory” “Windowing and weight sharing in a neural network layer can be accomplished by structuring the computation executed at each node as a convolution. FIG. 3A illustrates an example of a model 310 of a 2-dimensional convolution as applied to image processing. In this example model, a filter plane 304 is a set of weights arranged in a matrix having a height R and a width S”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Diamant into Wiedemann as set forth above with respect to claim 1.


Regarding claim 12, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 1 (above). Diamant further teaches: 
the neural inference chip is adapted to store in the global weight memory an uncompressed weight matrix  (Diamant 0058 and 0042: “In some cases, particularly for small neural networks, it may be possible for all of the weight values for the neural network to be stored in on-chip memory” “Windowing and weight sharing in a neural network layer can be accomplished by structuring the computation executed at each node as a convolution. FIG. 3A illustrates an example of a model 310 of a 2-dimensional convolution as applied to image processing. In this example model, a filter plane 304 is a set of weights arranged in a matrix having a height R and a width S”) 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Diamant into Wiedemann as set forth above with respect to claim 1.

Regarding claim 13, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 12 (above). Wiedemann further teaches: 
…when in compressed mode, the compressed weight block being transmitted (Wiedemann pg. 52, last para.: “Finally, given the desired changes (maybe from an outside source), the decoder should also provide a set of methods that allow him to modify 358 the neural network by any means. That is, it should be able to replace parts of a matrix with desired values (matrix stays in the same format), replace an entire matrix by another one (here the type of format may change)”; Examiner notes that the broadest reasonable interpretation of transmitted means removing and applying (i.e. replacing).)
when in uncompressed mode the uncompressed weight matrix being transmitted (Wiedemann pg. 52, last para.: “Finally, given the desired changes (maybe from an outside source), the decoder should also provide a set of methods that allow him to modify 358 the neural network by any means. That is, it should be able to replace parts of a matrix with desired values (matrix stays in the same format), replace an entire matrix by another one (here the type of format may change)”; Examiner notes that the broadest reasonable interpretation of transmitted means removing and applying (i.e. replacing).)
Wiedemann does not explicitly disclose:
the neural inference chip is operable to switch between a compressed and an uncompressed mode at runtime 
However, Kumar teaches: 
the neural inference chip is operable to switch between a compressed and an uncompressed mode at runtime (Kumar, para 0017: “The weight matrix can be constructed during runtime of training or inference through use of a hash table, exclusive or (XOR) of one or more entries of the hash table, followed by a weight lookup operation from the smaller set of weights based on the output from the XOR operation. Certain elements in a weight matrix can share the same value thereby reducing the amount of weights that are stored and used during training or inference. During a training phase, a compression scheme can be set for the weight matrix. The same compression scheme can be applied during an inference stage, thereby providing memory compression benefits across both training and inference stages.”) 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Kumar into Wiedemann and Diamant as set forth above with respect to claim 1.

Regarding claim 14, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 1 (above). Diamant further teaches: 
the network interconnecting each of the plurality of memory banks having a common row index. (Diamant 0061 and 0076: “In some implementations, a neural network processor can be constructed with multiple neural network processing engines, each having an independent array of processing engines and local memory banks. In these implementations, each neural network processing engine can execute a neural network, so that multiple neural networks can be run at the same time…When the designated neural network processing engine needs the weights that are stored with another neural network processing engine, the weights can be read from the memory banks of the other neural network processing and loaded into the memory banks of the designated neural network processing engine.” “For example, a set of memory banks 514 can be hardwired to provide weights 506 and state 508 to the rows of the processing engine array 510. In these examples, the control logic can move data between memory banks 514, for example, to move intermediate results from the memory banks 514 to which the intermediate results are written, to the memory banks 514 from which the intermediate results will be read for the next round of computation.” Examiner notes that the broadest reasonable interpretation of an “index” is a record such that memory banks are able to access rows of a shared record (i.e. to move data between them)).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Diamant into Wiedemann as set forth above with respect to claim 1.

Regarding claim 15, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 1 (above). Diamant further teaches: 
the global weight memory is external to the at least one neural core (Diamant, para. 0060: “Thus, in some implementations, as a computation progresses and memory space becomes available, the neural network processing engine can load additional weights into the available space. In some cases, the weights can come from an off-chip memory”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Diamant into Wiedemann as set forth above with respect to claim 1.


Regarding claim 16, Wiedemann, Diamant, and Kumar teaches the neural inference chip of claim 1 (above). Diamant further teaches: 
the global weight memory is distributed among the at least one neural core. (Diamant, para. 0060: “Thus, in some implementations, as a computation progresses and memory space becomes available, the neural network processing engine can load additional weights into the available space. In some cases, the weights can come from an off-chip memory”; Examiner notes that the broadest reasonable interpretation of a “distribution among the at least neural one” core includes a distribution (i.e. transmission) to that one such core). 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Diamant into Wiedemann as set forth above with respect to claim 1.

Regarding claim 17, Wiedemann teaches a method comprising: 
storing a compressed weight block comprising at least one weight matrix in a global weight memory of… (Wiedemann, pg. 50, 9th para: “Here we convert the weight matrices into the desired compressed format” “Input: Compressed matrix formats + activation lookup tables + extra-info. Output: Compressed domain representation of neural network”) 
decoding the at least one compressed weight matrix into a decoded weight matrix; (Wiedemann, pg. 12, 3rd para; pg. 18, 3rd para; pg. 42, 3rd para; and pg. 118, 2nd  para: “The compressed sparse row format also represents the sparse matrix by using three arrays. It stores the non-zero values and column indices in row major order and adds a pointer, which indicates where a new row starts” “/ can store the positions either using the lexographical index or any of the sparse-like formats” “an encoding-decoding scheme, where we explain all possible scenarios of how to convert the neural network into it's compressed representation (encoder), and subsequently interpret it (decoder)” “In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein.” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded)
applying the decoded weight matrix to a plurality of input activations to produce a plurality of output activations. (Wiedemann, pg. 15, 1st para and pg. 51, 4th para: “Matrix A controls the amplification or strength at which activations of neurons 16 are forwarded downstream to the respective neuron 12. Each row of matrix A is assigned to a certain neuron 12. In Fig. 1 , for instance, the first row of weights is assigned to reference sign 22. The weights 24 of matrix 10 within this row 22 control the amplification of the activations along connections 20 as depicted in Fig. 2.” “The decoder should be able to reconstruct 350 the neural network into it's uncompressed format. That is, he should know methods of converting compressed matrices back into dense format, replace tables with their corresponding activation functions (consequently dequantizing the activation values) and reconvert the low precision numerical representations back into high precision formats.” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded) 
Wiedemann does not explicitly disclose: 
the at least one neural core comprising a local weight memory, the local weight memory comprising a plurality of memory banks, each of the plurality of memory banks being uniquely addressable by at least one index 
the network interconnecting each of the plurality of memory banks; 
storing the decoded weight matrix in a local weight memory of a neural core

However, Diamant teaches:
the at least one neural core comprising a local weight memory, the local weight memory comprising a plurality of memory banks, each of the plurality of memory banks being uniquely addressable by at least one index (Diamant, para. 0059 and 0076: “The neural network processing engine can further include a set of memory banks local to the array of processing engines, where local can mean physically close to and/or directly accessible by the array of processing engines” “In various implementations, the memory subsystem 504 can include control logic. The control logic can, for example, keep track of the address spaces of each of the memory banks 514, identify memory banks 514 to read from or write to, and/or move data between memory banks 514, if needed”)
the network interconnecting each of the plurality of memory banks; (Diamant 0061 and 0076: “In some implementations, a neural network processor can be constructed with multiple neural network processing engines, each having an independent array of processing engines and local memory banks. In these implementations, each neural network processing engine can execute a neural network, so that multiple neural networks can be run at the same time…When the designated neural network processing engine needs the weights that are stored with another neural network processing engine, the weights can be read from the memory banks of the other neural network processing and loaded into the memory banks of the designated neural network processing engine.” “For example, a set of memory banks 514 can be hardwired to provide weights 506 and state 508 to the rows of the processing engine array 510. In these examples, the control logic can move data between memory banks 514, for example, to move intermediate results from the memory banks 514 to which the intermediate results are written, to the memory banks 514 from which the intermediate results will be read for the next round of computation.”)
storing the decoded weight matrix in a local weight memory of a neural core; (Diamant, para. 0059 and 0076: “The neural network processing engine can further include a set of memory banks local to the array of processing engines, where local can mean physically close to and/or directly accessible by the array of processing engines” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Diamant into Wiedemann as set forth above with respect to claim 1.

Neither Wiedemann nor Diamant explicitly disclose:
transmitting the compressed weight block from the global weight memory to at least one neural core on the neural inference chip via a network 
…a neural inference chip. 
However, Kumar teaches: 
transmitting the compressed weight block from the global weight memory to at least one neural core on the neural inference chip via a network (Kumar, paras. 0017, 0021 and 0054: “Accordingly, for a DNN, a portion or entirety of a weight matrix for one or more hidden layers can be compressed into a smaller set of weights.” “FIG. 2 depicts a system in which embodiments can be used….An off-chip memory 220 or storage device 222 can be accessed via a bus or interface (e.g., PCIe) and the off-chip memory or storage device is mounted on a separate motherboard, die, or socket as that of the processor, accelerator, GPU, CPU, or core that is to access the weights and perform computation on input values using the weights.” “In one example, system 900 includes interface 914, which can be coupled to interface 912. In one example, interface 914 represents an interface circuit, which can include standalone components and integrated circuitry. Network interface 950 can transmit data to a remote device, which can include sending data stored in memory. Network interface 950 can receive data from a remote device, which can include storing received data into memory…Various embodiments can be used in connection with network interface 950, processor 910, and memory subsystem 920.”)
…a neural inference chip. (Kumar, para 0054: “In one example, system 900 includes interface 914, which can be coupled to interface 912. In one example, interface 914 represents an interface circuit, which can include standalone components and integrated circuitry. Network interface 950 can transmit data to a remote device, which can include sending data stored in memory. Network interface 950 can receive data from a remote device, which can include storing received data into memory…Various embodiments can be used in connection with network interface 950, processor 910, and memory subsystem 920.”)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Kumar into Wiedemann and Diamant as set forth above with respect to claim 1.

Regarding claim 18, Wiedemann, Diamant, and Kumar teaches the method of 17 (above). Diamant further teaches: 
each of the plurality of memory banks selectively storing elements of the decoded weight matrix according to its associated column index. (Diamant, para. 0076: “In some implementations, the memory subsystem 504 can include multiplexors for selecting which memory bank to output to a particular client and/or to receive input from a particular client…For example, a set of memory banks 514 can be hardwired to provide weights 506 and state 508 to the rows of the processing engine array 510.” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Diamant into Wiedemann as set forth above with respect to claim 1.

Regarding claim 19, Wiedemann, Diamant, and Kumar teaches the method of 17 (above). Wiedemann further teaches: 
the compressed weight matrix comprises a plurality of column indices and associated values, the plurality of column indices corresponding to each position within the decoded weight matrix containing a non-zero value. (Wiedemann, pg. 11, 6th para; pg. 12, 3rd para; and pg. 12, 6th para: “The weights-array contains the values of all non zero elements in the matrix, and rowlndex- and collndex-array their respective row/column positions” “The compressed sparse row format also represents the sparse matrix by using three arrays. It stores the non-zero values and column indices in row major order and adds a pointer, which indicates where a new row starts.” “This format stores the non zero values and their corresponding column indices, but fixes the number of non zeros per row and pads with an additional symbol * empty spaces. Subsequently, it transposes the entries in order to allow for coalesced memory access”. Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded).

Regarding claim 20, Wiedemann, Diamant, and Kumar teaches the method of 17 (above). Wiedemann further teaches: 
…by comparing the plurality of column indices to a column index associated with each memory bank (Wiedemann, pg. 19, 3rd para: “The data field 40 associated with data field 30 indicates the positions of the weights assuming any of the discrete weight values 36 by way of column indices 60 which merely indicate the column index of each search position, wherein the association of the column index values within data set 40 and the columns of matrix 10 are indicated in Fig. 5 at 62…During the first traversal, the column index of each weight 24 is entered into the list 64 of column indices whenever a weight 24 is encountered during this traversal or scan which corresponds to the first discrete weight value 36 indicated in the list 66 of discrete weight values 36 of data field 30, which is 4 in the present case.” Examiner notes that the broadest reasonable interpretation of comparing column indices to a column index includes searching a plurality of column indices for a particular column index during which search, a comparison is necessarily made to indicate if a particular column index matches the search criteria.)
Wiedemann does not explicitly disclose:
each memory bank selectively storing elements of the decoded weight matrix... 
However, Diamant teaches:
each memory bank selectively storing elements of the decoded weight matrix... (Diamant, para. 0076: “In some implementations, the memory subsystem 504 can include multiplexors for selecting which memory bank to output to a particular client and/or to receive input from a particular client…For example, a set of memory banks 514 can be hardwired to provide weights 506 and state 508 to the rows of the processing engine array 510.” Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a weight matrix that is not encoded.)
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Diamant into Wiedemann as set forth above with respect to claim 1.

Regarding claim 21, Wiedemann, Diamant, and Kumar teaches the method of 17 (above). Wiedemann further teaches: 
the decoded weight matrix is sparse. (Wiedemann, pg. 11, 5th para: “The coordinate format or COO in short, stores the information of a sparse matrix within three arrays: the weights-, row Index-, collndex-array. The weights-array contains the values of all non zero elements in the matrix, and rowlndex- and collndex-array their respective row/column positions”. Examiner notes that the broadest reasonable interpretation of “decoded weight matrix” is a matrix containing weight information that is not encoded). 

Regarding claim 22, Wiedemann, Diamant, and Kumar teaches the method of 17 (above). Wiedemann further teaches: 
the compressed weight matrix comprises a plurality of rows, each of the plurality of rows comprising a column index and associated value for each position within that row of the decoded weight matrix containing a non-zero value. (Wiedemann, pg. 11, 6th para; pg. 12, 3rd para; and pg. 12, 6th para: “The weights-array contains the values of all non zero elements in the matrix, and rowlndex- and collndex-array their respective row/column positions” “The compressed sparse row format also represents the sparse matrix by using three arrays. It stores the non-zero values and column indices in row major order and adds a pointer, which indicates where a new row starts.” “This format stores the non zero values and their corresponding column indices, but fixes the number of non zeros per row and pads with an additional symbol * empty spaces. Subsequently, it transposes the entries in order to allow for coalesced memory access”).

Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Cassidy, et. al. (WO 2019/207376 A1) teaches neural inference processors with memory and a global scheduler adapted to provide synaptic weights from the neural network model to each processor core. 
Nandakumar, et. al. (“Mixed-precision training of deep neural networks using computational memory”, 4 Dec. 2017, ArXiv: 1712.0119v1) teaches mixed-precision architecture that combines a computational memory unit storing the synaptic weights with a digital processing unit and an additional memory unit.
Phan, et. al. (US 20200293876 A1) teaches optimization model to compress a DNN

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Sally T. Nguyen whose telephone number is (571) 272-3406. The examiner can normally be reached Monday - Thursday, 9:00am - 5:00pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached on (571) 270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/STN/Examiner, Art Unit 2123                                                                                                                                                                                                        

/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123