DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination
2.	A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 18 September 2020 [hereinafter Response] has been entered, where:
	Claims 1 and 8 have been amended.
	Claims 1-17 are pending.
	Claims 1-17 are rejected.
Examiner notes that the Response was placed on file prior to the mailing of the Advisory Action on 24 September 2020, in which the Response includes a Request for Continued Examination (RCE) Transmittal submitting the Applicant’s Amendment After Final of 11 August 2020 for consideration. 
Examiner, in his discretion, submits this communication as non-final so that Applicant has an opportunity to respond the Examiner’s response in a view towards advancing the prosecution of the instant application. 
Specification
3.	The Response submits that “[i]n response to Examiner’s suggestion, Applicant hereby submits a substitute specification with this paper.” (Response at p .10). Examiner does not see such substitute specification in the file wrapper of this application; however, Examiner notes Applicant submitted a new abstract of the disclosure on a separate sheet, apart from any other text, in accordance with 37 CFR §§ 1.52(b)(4) and 1.72(b), on 04 December 2019; however, it appears that the original abstract text remains embedded in the first page of Applicant’s disclosure.
	Examiner suggests Applicant submit a substitute specification that removes the original abstract text at paragraph [0002], and that also sequentially renumbers the paragraphs accordingly.
Claim Rejections - 35 USC § 103
4.	The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
5.	The factual inquiries for determining obviousness under 35 U.S.C. § 103 are summarized as follows:
1. 	Determining the scope and contents of the prior art.
2. 	Ascertaining the differences between the prior art and the claims at issue.
3. 	Resolving the level of ordinary skill in the pertinent art.
4. 	Considering objective evidence present in the application indicating obviousness or nonobviousness.
6.	This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. § 102(b)(2)(C) for any potential 35 U.S.C. § 102(a)(2) prior art against the later invention.
7.	Claims 1-4, 8-11, and 13-17 are rejected under 35 U.S.C. § 103 as being unpatentable over US Published Application 20180032860 to Tan et al. [hereinafter Tan], in view of Han et al., “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, pp. 1-14 (Conference Paper ICLR 2016) [hereinafter Han], and further in view of Mathew et al., “Sparse, Quantized, Full Frame CNN for Low Power Embedded Devices,” pp. 11-19 (CVPR 2017) [hereinafter Mathew].
Regarding claim 1, Tan teaches [a]n integrated circuit chip device (Tan ¶ 0071 teaches processor 810 (integrated circuit chip device) . . . configured to interpret and/or to execute program instructions and/or to process data) for training a neural network that includes n layers and n being an integer greater than 1, comprising:
an external interface (Tan ¶¶ 0072-73 & FIG. 8 teaches processor 810 may interpret and/or execute program instructions and/or process data stored in storage device 820, memory 830, or storage device 820 and memory 830 (boundary between processor 810 and memory 840 construed as an external interface) configured to receive one or more training instructions (Tan ¶¶ 0022-23 & FIG. 1 teaches neural networks (neural network), which may be referred to herein as a dictionary-based, self-adaptive networks (DSN). Various embodiments may relate to DSN architectures, training algorithms, and/or training schemes (training instructions for a neural network); 
a processing circuit (Tan ¶ 0071 teaches a processor 810) configured to:
determine a first layer input data (Tan ¶ 0060 & FIG. 7 teaches at block 706 [of FIG. 7] an input may be prepared (first layer input data)) and a first layer weight group data (Tan ¶ 0019 teaches Mapping of neural networks may include two different types of dictionaries, one type of dictionary for intermediate activations, and another type of dictionary for weights (a first layer weight group data). . . . Weights dictionaries may be generated by offline K-means (e.g., Lloyd's algorithms); see also Tan ¶ 0021, which teaches a “clustering method” (e.g. k-means) is a method of receiving input data and outputting representative index and a corresponding collection of centroids. In the instant context, such resultant clustering output is a group of weights, or simply, a weight dictionary),
 * * *
query a first layer output data corresponding to the first layer quantized input data and the first layer quantized weight group data from a preset output result table (Tan ¶ 0018 teaches [a] dictionary (a preset output result table) may be queried by one of the entries (query a first layer output data) (e.g., the simple electronic message) and receive as an output the correlated entry. For example, a query by a simple electronic message may result in the output of a more complex data structure and a query by a more complex data structure may result in the output of a simple electronic message;
Examiner interprets “first layer output data” as necessarily, without more, “corresponding to the first layer . . . input data and the first layer . . . weight group data” in a forward progression of a neural network),
determine the first layer output data as a second layer input data, and input the second layer input data into n-1 layers to execute forward operations to obtain nth layer output data (Tan ¶ 0029 teaches [w]ith reference to FIG. 1, and equation (1) below, an output of a layer during a feed-forward phase may be computed (execute forward operations to obtain nth layer output data);
Examiner points out that a feed forward operation by definition would entail providing a previous layer output data as a second layer input data, which would be input to a second layer input data into n-1 layers (see, e.g., Tan Fig. 1)),
determine nth layer output data gradients of the nth layer output data (Tan ¶ 0021 teaches “stochastic gradient descent” (SGD) is an optimization method for machine learning; Tan ¶ 0033 teaches more specifically, operating matrix 250 may data receive from the current layer and/or other layers (e.g., previous layer, next layer, ect. [sic]), and the gradient (ΔW) may be computed (determine nth layer output data gradients of the nth layer output data)),
obtain nth layer back operations among the back operations of n layers of the training instructions (Tan ¶ 0063 & FIGs. 1 & 5 teaches gradients in each layer may be backward computed (obtain nth layer back operations among the back operations of the n layers of the training instructions). For example, gradients of a dictionary of each layer and a new operating matrix (e.g., new operating matrix 250/350) of each layer may be computed),
* * *
query nth layer input data gradients corresponding to the nth layer . . . output data gradients and a nth layer . . . input data (a first layer quantized input data) from the preset output result table (Tan ¶ 0018 teaches, as described above, that [a] dictionary (the preset output result table) may be queried by one of the entries (query nth layer input data gradients) . . . and receive as an output the correlated entry; as also described above, Tan ¶ 0019 teaches a type of dictionary for weights . . . [w]eights dictionaries may be generated by offline K-means (e.g., Lloyd’s algorithms)),
query nth layer weight group gradients corresponding to the nth layer . . . output data gradients and a nth layer . . . weight group data from the preset output result table (Tan ¶ 0018-19 teaches, as described above, that [a] dictionary (the preset output result table) may be queried by one of the entries (which one of the entries, in this instance, is nth layer weight group gradients) . . . and receive as an output the correlated entry; as also described above, Tan ¶ 0019 teaches a type of dictionary for weights . . . [w]eights dictionaries may be generated by offline K-means (e.g., Lloyd’s algorithms). Mapping of neural networks may include two different types of dictionaries, one type . . . for intermediate (that is, nth layer), and another type . . . for weights (that is, weight group gradients); Examiner points out that a dictionary of Tan may include multiples types, which teaches the preset output result table); with respect to weight group data, Tan ¶ 0024 teaches a gradient of each centroid in a dictionary in a particular layer may be computed by summarizing (that is, grouping) the gradients from indexing positions which map to the same specific centroid (nth layer . . . weight group data)),
update a weight group data of n layers according to the nth layer weight group gradients (Tan ¶ 0064 teaches a mapping dictionary and an index matrix for each layer may be updated (update a weight group data of n layers according to the nth layer weight group gradients); as noted above, Tan ¶ 0021, teaches a “clustering method” (e.g. k-means) is a method of receiving input data and outputting representative index and a corresponding collection of centroids. In the instant context, such resultant clustering output is a group of weights, or simply, a weight dictionary)), 
determine the nth input data gradients as (n-1)th output data gradients, input the nth input data gradients into n-1 layers to execute back operations to obtain n-1 weight group data gradients (Tan ¶ 0027 & FIG. 1 teaches a portion of a [neural] network 100, which may also be referred to herein as “system 100.” System 100 includes an activation array al 102 in layer l, a 2-D indexing matrix Ll in layer l, and a mapping dictionary (1-D array) Dl 105 in layer l. System 100 further includes activation array al-2 112, and activation array al-1 122, an activation array al+1 132, and indexing matrix Ll-1 114, an indexing matrix Ll+1 124, a mapping dictionary Dl-1 116, and a mapping dictionary Dl+1 126; Tan ¶ 0048 teaches gradients in each layer may be backward computed (determine the nth input data gradients as (n-1)th output data gradients); Tan ¶ 0032 & FIG. 1 teach that the backward gradients are determined by an input of a layer L1 data gradient (input the nth input data gradients) into layer Ll-1 for backward gradient of dictionary computing (into n-1 layers to execute back operations) . . . wherein Loss denotes the loss function for a neural network, δl denotes a propagating error from the final layer to layer l, and equation (5) may be used for updating dictionary D (to obtain n-1 weight group data gradients)), and
update n-1 weight group data corresponding to the n-1 weight group data gradients of the n-1 weight group data gradients, wherein the weight group data of each layer comprises at least two weights (Tan ¶ 0064 teaches a mapping dictionary and an index matrix for each layer may be updated (update a weight group data of n layers of the nth layer weight group gradients); as noted above, Tan ¶ 0021, teaches a “clustering method” (e.g. k-means) is a weight group data of each layer comprises at least two weights)).
Though Tan teaches the feature of gradient generation for neural network layers; however, Tan does not explicitly teach quanitizing such weights, in that Tan does not explicitly teach -
to quantize the first layer input data and the first layer weight group data to obtain a first layer quantized input data and a first layer quantized weight group data . . . , and to quantize the nth layer output data gradients to obtain nth layer quantized output data gradients . . . .
But Han teaches quantize . . . the first layer weight group to obtain . . . first layer quantized weight group data (Han at page 2, third full paragraph, teaches the weights are quantized so that multiple connections share the same weight, thus only the codebook (effective weights) and the indices need to be stored; see also Han Fig. 3, which discloses [w]eight sharing by scalar quantization (top) and centroids fine-tuning (bottom)), . . . and quantize the nth layer output data gradients to obtain nth layer quantized output data gradients (Han at page 2, third full paragraph, teaches the weights are quantized so that multiple connections share the same weight, thus only the codebook (effective weights) and the indices need to be stored; see also Han ,
Tan and Han are analogous art because both disclose improving training efficiency for neural networks. Tan teaches embodiments may relate to dictionary-based, self-adaptive network architectures, training algorithms, and/or training schemes. Han teaches deep compression to reduce the storage required by neural networks. Thus, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention to implement the teachings of Han pertaining to deep compression of neural networks with the dictionary-based, self-adaptive network architectures of Tan.
The motivation for doing so is because neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources (Han, Abstract). Further, the claimed limitations, subsequent to the quantization, would be performed on any resulting quantized data by the combination of Tan and Han.
However, the combination of Tan and Han fails to explicitly teach to quantize the first layer input data . . . to obtain a first layer quantized input data . . . . 
But Mathew teaches to quantize the first layer input data . . . to obtain a first layer quantized input data . . . . (Mathew right column at page 12, Section 2.4, first full paragraph, teaches that [s]tudies have shown that quantization of floating point coefficients to dynamic 8-bit fixed point is sufficient to retain the accuracy in Image classification problems (quantize the first layer input data) as shown in Caffe Ristretto [13] and in Tensorflow [14]; Examiner points out quantization reduce the complexities of neural networks such as convolutional neural networks).
Mathew, Tan, and Han are analogous art because each disclose improving training efficiency for neural networks. Tan teaches embodiments may relate to dictionary-based, self-adaptive network architectures, training algorithms, and/or training schemes. Han teaches deep compression to reduce the storage required by neural networks. Mathew teaches reducing the complexity of convolutional neural networks. Thus, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention to implement the teachings of Mathew pertaining to sparse quantized training of a neural network with the improved training efficiency of Tan and Han.
The motivation for doing so is because of reducing the complexity of convolutional networks that includes quantizing the network to use 8-bit fixed point multiplications efficiently (Mathew, Abstract).
Regarding claim 2, the combination of Tan, Han, and Mathew teaches all of the limitations of claim 1, as described above.
.	Tan further teaches wherein for quantizing the first layer weight group data, the processing circuit comprises:
a control unit configured to obtain quantization instructions and decode the quantization instructions to obtain query control information,
wherein the query control information includes address information corresponding to the first layer weight group data in a preset weight dictionary (Tan ¶ 0023 teaches each layer of a DSN architecture (n layers of the neural network) may include at least one indexing matrix, which may be paired with a mapping dictionary (e.g., a multi-dimensional mapping dictionary). An indexing matrix may include an address (query control information includes address information) indexing to particular mapping dictionary (in a preset weight dictionary) that may include a plurality (e.g. 32 or 64) of floating-point numbers (corresponding to the first layer weight group data); querying includes accessing stored information, see, e.g., Tan ¶ 0018), and
wherein the preset weight dictionary comprising encodings corresponding to all the weights in weight group data of n layers of the neural network (as described above, Tan ¶ 0019 teaches that [w]eights dictionaries may be generated by offline K-means (e.g., Lloyd's algorithms) (encodings corresponding to all the weights in a weight group data); also as described above, Tan ¶ 0021 teaches a “clustering method” (e.g. k-means) is a method of receiving input data and outputting representative index and a corresponding collection of centroids);
a dictionary query unit configured to query K encodings corresponding to K weights in the first layer weight group data from the preset weight dictionary of the query control information, wherein K is an integer greater than 1 (as described above, Tan ¶ 0018 teaches that [a] dictionary (the preset output result table, which is the preset weight dictionary) may be queried by one of the entries (which one of the entries, in this instance, is nth layer weight group gradients) . . . and receive as an output the correlated entry; as also described above, Tan ¶ 0019 teaches a type of dictionary ; and
a codebook query unit configured to query K quantized weights in the first layer quantized weight group data from the preset codebook of the K encodings, wherein the preset codebook includes Q encodings and Q central weights corresponding to the Q encodings, and wherein Q is an integer greater than 1 (Han at p. 2, third full paragraph, teaches the weights are quantized (quantized weights) so that multiple connections share the same weight (Q central weights), thus only the codebook (effective weights) (the preset codebook) and the indices (Q encodings) need to be stored; the number of quantized weights are an integer K greater than 1, and the number of central weights are an integer Q greater than 1).
Tan, Han and Mathew are analogous art because both disclose improving training efficiency for neural networks. Tan teaches embodiments may relate to dictionary-based, self-adaptive network architectures, training algorithms, and/or training schemes. Han teaches deep compression to reduce the storage required by neural networks. Mathew teaches reducing the complexity of convolutional neural networks. Thus, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention to implement the teachings of Han pertaining to deep compression of neural networks including a codebook structure that stores shared weights with the dictionary-based, self-adaptive network architectures of Tan and the quantization of Mathew.
Han, Abstract). 
Regarding claim 3, the combination of Tan, Han, and Mathew teaches all of the limitations of claim 2, as described above.
	Han further teaches wherein the integrated circuit chip device further comprises a weight dictionary establishment unit configured to:
determine one or more closest central weights of each weight in the weight group data of n layers of the neural network (Han at page 3, Section 3, second full paragraph, teaches all the weights in the same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights (that is, to do so requires determine one or more closest central weights. . . of the neural network) to the Q central weights in the preset codebook prior to quantizing the first layer weight group data (Han at page 2, FIG. 1, teaches to generate the code book and then quantize the weights with the code book),
obtain the central weights corresponding to each weight in weight group data of n layers (Han at page 2, FIG. 1, teaches to cluster the weights (where clustering necessarily, without more, includes obtain the central weights corresponding to each weight in weight group data of n layers)),
determine encodings of the central weights corresponding to each weight in the weight group data of n layers of the preset codebook (Han at page 2, FIG. 3, illustrates the centroids fine tuning), and
obtain the encoding corresponding to each weight in the weight group data of n layers of the neural network and generate a weight dictionary (as described above, Han at page 3, Section 3, second full paragraph, teaches all the weights in the same bin share the same value, thus for each weight, we then need to store only a small index (encodings of the central weights) into a table of shared weights (generate a weight dictionary)).
Regarding claim 4, the combination of Tan, Han, and Mathew teaches all of the limitations of claim 3, as described above.
Han further teaches wherein the processing circuit is configured to perform one or more of the steps from a group consisting of:
grouping a plurality of weights to obtain a plurality of groups (Han at page 3, Section 3, second full paragraph, teaches all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration);
clustering weights in each group in the plurality of groups of a clustering algorithm to obtain a plurality of clusters (Han at page 4, Section 3.1, first paragraph, teaches us[ing] k-means clustering (clustering algorithm) to identify the shared weights for each layer of a trained network, so that all the weights that fall into the same cluster will share the same weight (clustering weights in each group in the plurality of groups of a clustering algorithm to obtain a plurality of clusters));
computing a central weight of each cluster in the plurality of clusters (Tan ¶ 0030 teaches a gradient of each centroid in a dictionary in a particular layer may be computed); and
encoding the central weight of each cluster in the plurality of clusters and generating the codebook (Han at page 5, Section 5, second full paragraph, teaches a codebook structure that stores the shared weight, and group-by-index (encoding the central weight of each cluster) after calculating the gradient of each layer. Each shared weight is updated with all the gradients that fall into that bucket (cluster)).
Regarding claim 8, Tan teaches [a] neural network training method for executing neural network training, the neural network comprising n layers with n being an integer greater than 1, wherein the neural network training method comprises:
receiving training instructions (Tan ¶¶ 0022-23 & FIG. 1 teaches neural networks (neural network), which may be referred to herein as a dictionary-based, self-adaptive networks (DSN). Various embodiments may relate to DSN architectures, training algorithms, and/or training schemes (training instructions for a neural network);
determining a first layer input data (Tan ¶ 0060 & FIG. 7 teaches at block 706 [of FIG. 7] an input may be prepared (determining first layer input data)) and a first layer weight group data (Tan ¶ 0019 teaches [m]apping of neural networks may include two different types of dictionaries, one type of dictionary for intermediate activations, and another type of dictionary for weights (a first layer weight group data). . . . Weights dictionaries may be generated by offline K-means (e.g., Lloyd's algorithms); see also Tan ¶ 0021, which teaches a “clustering method” (e.g. k-means) is a method of receiving input data and outputting representative index and a corresponding collection of centroids. In the instant context, such resultant clustering output is a group of weights, or simply, a weight dictionary) . . . ; 
querying a first layer output data corresponding to the first layer . . . input data and the first layer . . . weight group data from the preset output result table (Tan ¶ 0018 teaches [a] dictionary (a preset output result table) may be queried by one of the entries (querying a first layer output data) (e.g., the simple electronic message) and receive as an output the correlated entry. For example, a query by a simple electronic message may result in the output of a more complex data structure and a query by a more complex data structure may result in the output of a simple electronic message;
Examiner notes “first layer output data” as necessarily, without more, “corresponding to the first layer . . . input data and the first layer . . . weight group data” in a forward progression of a neural network), 
determining the first layer output data as the second layer input data and inputting the second layer input data into n-1 layers to execute forward operations to obtain the nth layer output data (Tan ¶ 0029 teaches [w]ith reference to FIG. 1, and equation (1) below, an output of a layer during a feed-forward phase may be computed (execute forward operations to obtain nth layer output data);
Examiner points out that a feed forward operation by definition would entail providing a previous layer output data as a second layer input data, which would be input to a second layer input data into n-1 layers (see, e.g., Tan Fig. 1));
determining nth layer output data gradients of the nth layer output data (Tan ¶ 0021 teaches “stochastic gradient descent” (SGD) is an optimization method for machine learning; Tan ¶ 0033 teaches more specifically, operating matrix 250 may data receive from the current layer and/or other layers (e.g., previous layer, next layer, etc. sic]), and the gradient (ΔW) may be computed (determining nth layer output data gradients of the nth layer output data)), 
obtaining the nth layer back operations among back operations of n layers of the training instructions (Tan ¶ 0063 & FIGs. 1 & 5 teaches gradients in each layer may be backward computed (obtaining the nth layer back operations among the back operations of the n layers of the training instructions). For example, gradients of a dictionary of each layer and a new operating matrix (e.g., new operating matrix 250/350) of each layer may be computed), . . . ;
querying nth layer input data gradients corresponding to the nth layer . . . output data gradients and a nth layer . . . input data from the preset output result table (Tan ¶ 0018 teaches, as described above, that [a] dictionary (the preset output result table) may be queried by one of the entries (query nth layer input data gradients) . . . and receive as an output the correlated entry; as also described above, Tan ¶ 0019 teaches a type of dictionary for weights . . . [w]eights dictionaries may be generated by offline K-means (e.g., Lloyd’s algorithms)), 
querying nth layer weight group gradients corresponding to the nth layer . . . output data gradients and a nth layer . . . weight group data from the preset output result table (Tan ¶ 0018 teaches, as described above, that [a] dictionary (the preset output result table) may be queried by one of the entries (which one of the entries, in this instance, is nth layer weight group gradients) . . . and receive as an output the correlated entry; as also described above, Tan ¶ 0019 teaches a type of dictionary for weights . . . [w]eights dictionaries may be generated by offline K-means (e.g., Lloyd’s algorithms)), and 
updating the weight group data of n layers according to the nth layer weight group gradients (Tan ¶ 0064 teaches a mapping dictionary and an index matrix for each layer may be updated (updating a weight group data of n layers according to the nth layer weight group gradients); as noted above, Tan ¶ 0021, teaches a “clustering method” (e.g. k-means) is a method of receiving input data and outputting representative index and a corresponding collection of centroids. In the instant context, such resultant clustering output is a group of weights, or simply, a weight dictionary); 
determining the nth input data gradients as the (n-1)th output data gradients, inputting the (n-1)th output data gradients into n-1 layers to execute back operations to obtain the n-1 weight group data gradients (Tan ¶ 0027 & FIG. 1 teaches a portion of a [neural] network 100, which may also be referred to herein as “system 100.” System 100 includes an activation array al 102 in layer l, a 2-D indexing matrix Ll in layer l, and a mapping dictionary (1-D array) Dl 105 in layer l. System 100 further includes activation array al-2 112, and activation array al-1 122, an activation array al+1 132, and indexing matrix Ll-1 114, an indexing matrix Ll+1 124, a mapping dictionary Dl-1 116, and a mapping dictionary Dl+1 126; Tan ¶ 0048 teaches gradients in each layer may be backward computed (determine the nth input data gradients as (n-1)th output data gradients); Tan ¶ 0032 & FIG. 1 teach that the backward gradients are determined by an input of a layer L1 data gradient (input the nth input data gradients) into layer Ll-1 for backward gradient of dictionary computing (into n-1 layers to execute back operations) . . . wherein Loss denotes the loss function for a neural network, δl denotes a propagating error from the final layer to layer l, and equation (5) may be used for updating dictionary D (to obtain n-1 weight group data gradients)), 
updating the n-1 weight group data corresponding to the n-1 weight group data gradients of the n-1 weight group data gradients, wherein the weight group data of each layer comprises at least two weights (Tan ¶ 0064 teaches a mapping dictionary and an index matrix for each layer may be updated (updating a weight group data of n layers of the nth layer weight group gradients); as noted above, Tan ¶ 0021, teaches a “clustering method” (e.g. k-means) is a method of receiving input data and outputting representative index and a corresponding collection of centroids. In the instant context, such resultant clustering output is a group of weights, or simply, a weight dictionary (for clustering to apply, centroids are based on weight group data of each layer comprises at least two weights)).
Though Tan teaches the feature of gradient generation for neural network layers; however, Tan does not explicitly teach quanitizing such weights, in that Tan does not explicitly teach -
quantizing the first layer input data and the first layer weight group data to obtain a first layer quantized input data and a first layer quantized weight group data . . . , and quantizing the nth layer output data gradients to obtain nth layer quantized output data gradients . . . .
But Han teaches quantizing . . . the first layer weight group to obtain . . . first layer quantized weight group data (Han at page 2, third full paragraph, teaches the weights are quantized so that multiple connections share the same weight, thus only the codebook (effective weights) and the indices need to be stored; see also Han Fig. 3, which discloses [w]eight sharing by scalar quantization (top) and centroids fine-tuning (bottom)), . . . and quantizing the nth layer output data gradients to obtain nth layer quantized output data gradients (Han at page 2, third full paragraph, teaches the weights are quantized so that multiple connections share the same weight, thus only the codebook (effective weights) and the indices need to be stored; see also Han Fig. 3, which discloses [w]eight sharing by scalar quantization (top) and centroids fine-tuning (bottom)),
Tan and Han are analogous art because both disclose improving training efficiency for neural networks. Tan teaches embodiments may relate to dictionary-based, self-adaptive network architectures, training algorithms, and/or training schemes. Han teaches deep compression to reduce the storage required by neural networks. Thus, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention to implement the teachings of Han pertaining to deep compression of neural networks with the dictionary-based, self-adaptive network architectures of Tan.
The motivation for doing so is because neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources (Han, Abstract). Further, the claimed limitations, subsequent to the quantization, would be performed on any resulting quantized data by the combination of Tan and Han.
However, the combination of Tan and Han fails to explicitly teach to quantizing the first layer input data . . . to obtain a first layer quantized input data . . . . 
But Mathew teaches to quantizing the first layer input data . . . to obtain a first layer quantized input data . . . . (Mathew right column at page 12, Section 2.4, first full paragraph, teaches that [s]tudies have shown that quantization of floating point quantizing the first layer input data) as shown in Caffe Ristretto [13] and in Tensorflow [14]; Examiner points out quantization reduce the complexities of neural networks such as convolutional neural networks).
Mathew, Tan, and Han are analogous art because each disclose improving training efficiency for neural networks. Tan teaches embodiments may relate to dictionary-based, self-adaptive network architectures, training algorithms, and/or training schemes. Han teaches deep compression to reduce the storage required by neural networks. Mathew teaches reducing the complexity of convolutional neural networks. Thus, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention to implement the teachings of Mathew pertaining to sparse quantized training of a neural network with the improved training efficiency of Tan and Han.
The motivation for doing so is because of reducing the complexity of convolutional networks that includes quantizing the network to use 8-bit fixed point multiplications efficiently (Mathew, Abstract).
Regarding claim 9, the combination of Tan, Han, and Mathew teaches all of the limitations of claim 8, as described above.
Tan further teaches wherein the quantizing the first layer weight group data comprises:
obtaining quantization instructions and decoding the quantization instructions to obtain query control information, the query control information comprising address information corresponding to the first layer weight group data in a preset weight dictionary (Tan ¶ 0023 teaches each layer of a DSN architecture (n layers of the neural network) may include at least one indexing matrix, which may be paired with a mapping dictionary (e.g., a multi-dimensional mapping dictionary). An indexing matrix may include an address (obtain query control information includes address information) indexing to particular mapping dictionary (in a preset weight dictionary) that may include a plurality (e.g. 32 or 64) of floating-point numbers (corresponding to the first layer weight group data)) and the preset weight dictionary including encodings corresponding to all the weights in the weight group data of n layers of the neural network (as described above, Tan ¶ 0019 teaches that [w]eights dictionaries may be generated by offline K-means (e.g., Lloyd's algorithms) (encodings corresponding to all the weights in a weight group data); also as described above, Tan ¶ 0021 teaches a “clustering method” (e.g. k-means) is a method of receiving input data and outputting representative index and a corresponding collection of centroids);
querying K encodings corresponding to K weights in the first layer weight group data from the preset weight dictionary of the query control information, K being an integer greater than 1 (as described above, Tan ¶ 0018 teaches that [a] dictionary (the preset output result table, which is the preset weight dictionary) may be queried by one of the entries (which one of the entries, in this instance, is nth layer weight group gradients) . . . and receive as an output the correlated entry; as also described above, Tan ¶ 0019 teaches a type of dictionary for weights . . . [w]eights dictionaries may be generated by offline K-means (e.g., Lloyd’s algorithms)); and
querying K quantized weights in the first layer quantized weight group data from the preset codebook of the K encodings, the preset codebook including Q encodings and Q central weights corresponding to the Q encodings, and Q is an integer greater than 1 (Han at p. 2, third full paragraph, teaches the weights are quantized (quantized weights) so that multiple connections share the same weight (Q central weights), thus only the codebook (effective weights) (the preset codebook) and the indices (Q encodings) need to be stored; the number of quantized weights are an integer K greater than 1, and the number of central weights are an integer Q greater than 1).
Tan, Han and Mathew are analogous art because both disclose improving training efficiency for neural networks. Tan teaches embodiments may relate to dictionary-based, self-adaptive network architectures, training algorithms, and/or training schemes. Han teaches deep compression to reduce the storage required by neural networks. Mathew teaches reducing the complexity of convolutional neural networks. Thus, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention to implement the teachings of Han pertaining to deep compression of neural networks including a codebook structure that stores shared weights with the dictionary-based, self-adaptive network architectures of Tan and the quantization of Mathew.
The motivation for doing so is because neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources (Han, Abstract). 
Regarding claim 10, the combination of Tan, Han, and Mathew teaches all of the limitations of claim 9, as described above.
Han further teaches wherein the preset weight dictionary is obtained according to the following steps:
determining one or more closest central weights of each weight in the weight group data of n layers of the neural network (Han at page 3, Section 3, second full paragraph, teaches all the weights in the same bin share the same value, thus for each weight, we then need to store only a small index into a table of shared weights (that is, to do so requires determining one or more closest central weights. . . of the neural network) to the Q central weights in the preset codebook, prior to quantizing the first layer weight group data (Han at page 2, FIG. 1, teaches to generate the code book and then quantize the weights with the code book),
obtaining the central weights corresponding to each weight in the weight group data of n layers (Han at page 2, FIG. 1, teaches to cluster the weights (where clustering necessarily, without more, includes obtaining the central weights corresponding to each weight in the weight group data of n layers)); and
determining encodings of the central weights corresponding to each weight in the weight group data of n layers of the preset codebook (Han at page 2, FIG. 3, illustrates the centroids fine tuning),
obtaining the encoding corresponding to each weight in the weight group data of n layers of the neural network and generating a weight dictionary (as described above, Han at page 3, Section 3, second full paragraph, teaches all the weights in the same bin share the same value, thus for each weight, we then need to encodings of the central weights) into a table of shared weights (generate a weight dictionary)).
Regarding claim 11, the combination of Tan, Han, and Mathew teaches all of the limitations of claim 10, as described above.
Han further teaches wherein the preset codebook is obtained according to the following steps:
grouping a plurality of weights to obtain a plurality of groups (Han at page 3, Section 3, second full paragraph, teaches all the gradients are grouped by the color and summed together, multiplied by the learning rate and subtracted from the shared centroids from last iteration);
clustering weights in each group in the plurality of groups of the clustering algorithm to obtain a plurality of clusters (Han at page 4, Section 3.1, first paragraph, teaches us[ing] k-means clustering (clustering algorithm) to identify the shared weights for each layer of a trained network, so that all the weights that fall into the same cluster will share the same weight (clustering weights in each group in the plurality of groups of a clustering algorithm to obtain a plurality of clusters));
computing the central weight of each cluster in the plurality of clusters (Tan ¶ 0030 teaches a gradient of each centroid in a dictionary in a particular layer may be computed (computing the central weight . . .);
encoding the central weight of each cluster in the plurality of clusters and generating the codebook (Han at page 5, Section 5, second full paragraph, teaches a codebook structure that stores the shared weight, and group-by-index (encoding the cluster)).
Regarding claim 13, the combination of Tan, Han, and Mathew teaches all of the limitations of claim 11, as described above.
Han further teaches wherein the neural network comprises a convolution layers, b full connection layers (Han at page 6, Section 5.1, first partial paragraph, teaches a convolutional network that has two convolutional layers and two fully connected layers) and
Tan further teaches c long short-term memory network layers (Tan ¶ 0016 teaches [t]he most popular architectures, such as convolutional neural networks (CNN) and long short term memory (LSTM), are end-to-end systems, which minimize human interference; Examiner submits it is fair to say an LSTM architecture includes c long short-term memory layers),
wherein the grouping a plurality of weights to obtain a plurality of groups comprises grouping weights in each convolution layer of the plurality of weights into a group, weights in each full connection layer of the plurality of weights into a group and weights in each long short-term memory network layer of the plurality of weights into a group to obtain (a+b+c) groups (Tan ¶ 0048 teaches gradients in each layer may be backward computed. For example, gradients of a dictionary of each layer (obtain (a+b+c) groups) and a new operating matrix (e.g., new operating matrix 250/350) of each layer may be computed; Examiner points out that Tan Fig. 1 teaches backward propagation to generate weights at each layer. See also Tan ¶ 0021, which teaches a corresponding collection of centroids (that is, group weights . . . into a group to obtain (a+b+c) groups)); and
wherein the clustering weights in each group in the plurality of groups of a clustering algorithm comprises clustering weights in each of the (a+b+c) groups of a K- medoids algorithm (Tan ¶ 0054 teaches an indexing matrix and a mapping dictionary may be generated via a clustering method (e.g. K-means clustering method); Examiner notes a k-medoids algorithm is a clustering algorithm related to the k-means algorithm, which is a matter of design choice in the selection of a desired clustering algorithm (see, e.g., Hu ¶¶ 0130-31, which teaches one or more clustering algorithms, including k-means, k-medoids, CLARA . . . [etc.])).
Regarding claim 14, the combination of Tan, Han, and Mathew teaches all of the limitations of claim 13, as described above.
Mathew further teaches wherein the quantizing the first layer input data comprises: preprocessing any element value in the first layer input data using a clip (-zone, zone) operation (Mathew left column at page 14, Section 3.3, fifth full paragraph, teaches clip() functions restricts the values to the given range (clip (-zone, zone) operation)) to obtain the first layer preprocessing data in the preset section [-zone, zone], zone being greater than 0 (Mathew right column at page 13, Section 3.3, first full paragraph, teaches the quantization approach used in this work uses a middle ground between complexity and accuracy. Similar to Ristretto’s approach, the minimum and maximum ranges of various tensors (weights, inputs and outputs) are computed (obtain the first layer preprocessing data in the preset section [-zone, zone]) preprocess element values in the first layer input data), from a subset of the training data); and
determining M values in the preset section [-zone, zone], M being a positive integer, computing absolute values of differences between the first layer preprocessing data and the M values respectively to obtain M absolute values (Mathew right column at page 13, Section 3.3, first full paragraph, teaches [s]igned tensors are to be quantized between -128 and +127; while unsigned tensors are to be quantized between 0 and 255 (computing absolute values of differences between the first layer processing data and the M values respectively to obtain M absolute values), and
determining the minimum absolute value of the M absolute values as the quantized element value corresponding to the element value (Mathew right column at page 12, Section 2.3, first partial paragraph, teaches [a] network is sparsified by training in iterations with high weight decay, and whenever the absolute value of a weight falls below predefined threshold (determine a minimum absolute value of the M absolute values), it is thresholded to zero (as the quantized element value)).
Tan, Han, and Mathew are analogous art because each disclose improving training efficiency for neural networks. Tan teaches embodiments may relate to dictionary-based, self-adaptive network architectures, training algorithms, and/or training schemes. Han teaches deep compression to reduce the storage required by neural networks. Mathew teaches reducing the complexity of convolutional neural networks. Thus, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention to implement the teachings of Mathew Tan and Han.
The motivation for doing so is because of reducing the complexity of convolutional networks that includes quantizing the network to use 8-bit fixed point multiplications efficiently (Mathew, Abstract).
Regarding claim 15, the combination of Tan, Han, and Mathew teaches all of the limitations of claim 8, as described above.
Han teaches wherein the at least two weights are grouped based on a layer type of a corresponding layer (Han at p. 3, Section 3, second paragraph, & FIG. 3, teaches a layer that has 4 input neurons and 4 output neurons, the weight is a 4 x 4 matrix. . . . The weights are quantized to 4 bins (denoted with 4 colors), all the weights in the same bin share the same value . . . During update, all the gradients are grouped by the color (that is, are grouped based on a layer type of a corresponding layer).
Mathew, Tan, and Han are analogous art because each disclose improving training efficiency for neural networks. Tan teaches embodiments may relate to dictionary-based, self-adaptive network architectures, training algorithms, and/or training schemes. Han teaches deep compression to reduce the storage required by neural networks. Mathew teaches reducing the complexity of convolutional neural networks. Thus, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention to implement the teachings of Mathew pertaining to sparse quantized training of a neural network with the improved training efficiency of Tan and Han.
Han, Abstract).
Regarding claim 16, the combination of Tan, Han, and Mathew teaches all of the limitations of claim 8, as described above.
Han teaches, wherein the at least two weights are grouped based on an inter-layer structure of a corresponding layer (Han, at p.3, Section 3, second paragraph, teaches we are able to quantize to 8-bits (256 shared weights) for each CONV layers, and 5-bits (32 shared weights) for each FC layer (CONV layer and FC layer being inter-layer) without any loss of accuracy (are grouped based on an inter-layer structure of a corresponding layer).
Mathew, Tan, and Han are analogous art because each disclose improving training efficiency for neural networks. Tan teaches embodiments may relate to dictionary-based, self-adaptive network architectures, training algorithms, and/or training schemes. Han teaches deep compression to reduce the storage required by neural networks. Mathew teaches reducing the complexity of convolutional neural networks. Thus, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention to implement the teachings of Mathew pertaining to sparse quantized training of a neural network with the improved training efficiency of Tan and Han.
The motivation for doing so is because of reducing the complexity of neural networks through deep compression that operates to reduce the size of training data and the IC real-estate required by the neural network. (Han, Abstract).
Regarding claim 17, the combination of Tan, Han, and Mathew teaches all of the limitations of claim 8, as described above.
Han teaches wherein the at least two weights are grouped based on an intra-layer structure of a corresponding layer (Han at p. 3, Section 3, second paragraph, teaches a layer that has 4 input neurons and 4 output neurons (that is, an inter-layer structure of a corresponding layer), the weight is a 4 x 4 matrix. . . . The weights are quantized to 4 bins (denoted with 4 colors), all the weights in the same bin share the same value . . . During update, all the gradients are grouped by the color (that is, are grouped based on an intra-layer structure of a corresponding layer)).
Mathew, Tan, and Han are analogous art because each disclose improving training efficiency for neural networks. Tan teaches embodiments may relate to dictionary-based, self-adaptive network architectures, training algorithms, and/or training schemes. Han teaches deep compression to reduce the storage required by neural networks. Mathew teaches reducing the complexity of convolutional neural networks. Thus, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention to implement the teachings of Mathew pertaining to sparse quantized training of a neural network with the improved training efficiency of Tan and Han.
The motivation for doing so is because of reducing the complexity of neural networks through deep compression that operates to reduce the size of training data and the IC real-estate required by the neural network. (Han, Abstract).
8.	Claims 5-7 and 12 are rejected under 35 U.S.C. § 103 as being unpatentable over US Published Application 20180032860 to Tan et al. [hereinafter Tan], in view of Han et al., “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, pp. 1-14 (Conference Paper ICLR 2016) [hereinafter Han], and further in view of Mathew et al., “Sparse, Quantized, Full Frame CNN for Low Power Embedded Devices,” pp. 11-19 (CVPR 2017) [hereinafter Mathew], and US Published Application 20170228683 to Hu et al. [hereinafter Hu].
Regarding claim 5, the combination of Tan, Han, and Mathew teaches all of the limitations of claim 4, as described above.
However, the combination of Tan, Han, and Mathew fails to explicitly teach wherein the clustering algorithm comprises one or more of a group consisting of K-means algorithm, K-medoids algorithm, Clara algorithm and Clarans algorithm.
But Hu teaches wherein the clustering algorithm comprises one or more of a group consisting of K-means algorithm, K-medoids algorithm, Clara algorithm and Clarans algorithm (Hu ¶ 0131 teaches algorithms may also include: a partitioning method such as K-means, K-medoids, CLARA (Clustering LARge Application), CLARANS (Clustering Large Application based upon RANdomized Search) . . . .).
Hu is analogous art to Tan, Han, and Mathew because each disclose compression techniques for system efficiency. Tan teaches embodiments may relate to dictionary-based, self-adaptive network architectures, training algorithms, and/or training schemes. Han teaches deep compression to reduce the storage required by neural networks. Mathew teaches reducing the complexity of convolutional neural networks. Hu teaches the use of clustering algorithms to customize a particular distance and a particular time period. Thus, it would have been obvious to one of ordinary skill in Hu pertaining to clustering algorithms with the improved training efficiencies of Tan, Han, and Mathew.
The motivation for doing so is for versatility of using other partitioning methods for clustering algorithms. (Hu ¶ 0131).
Regarding claim 6, the combination of Tan, Han, Mathew, and Hu, teaches all of the limitations of claim 5, as described above.
Han further teaches wherein the neural network comprises a convolution layers, b full connection layers (Han at page 6, Section 5.1, first partial paragraph, teaches a convolutional network that has two convolutional layers and two fully connected layers) and 
Tan further teaches c long short-term memory network layers (Tan ¶ 0016 teaches [t]he most popular architectures, such as convolutional neural networks (CNN) and long short term memory (LSTM), are end-to-end systems, which minimize human interference; Examiner submits it is fair to say an LSTM architecture includes long short-term memory layers), and
wherein the processing circuit is further configured to
group weights in each convolution layer of the plurality of weights into a group, weights in each full connection layer of the plurality of weights into a group and weights in each long short-term memory network layer of the plurality of weights into a group to obtain (a+b+c) groups (Tan ¶ 0048 teaches gradients in each layer may be backward computed. For example, gradients of a dictionary of each layer and a new operating matrix (e.g., new Tan Fig. 1 teaches backward propagation to generate weights at each layer. See also Tan ¶ 0021, which teaches a corresponding collection of centroids (that is, group weights . . . into a group to obtain (a+b+c) groups)), and
cluster weights in each of the (a+b+c) groups of the K-medoids algorithm (Tan ¶ 0054 teaches an indexing matrix and a mapping dictionary may be generated via a clustering method (e.g. K-means clustering method); Examiner notes a k-medoids algorithm is a clustering algorithm related to the k-means algorithm, which is a matter of design choice in the selection of a desired clustering algorithm (see, e.g., Hu ¶¶ 0130-31, which teaches one or more clustering algorithms, including k-means, k-medoids, CLARA . . . [etc.])).
Regarding claim 7, the combination of Tan, Han, Mathew, and Hu teaches all of the limitations of claim 6, as described above.
Mathew further teaches wherein the processing circuit further comprises:
a preprocessing unit configured to preprocess element values in the first layer input data using a clip (-zone, zone) operation (Mathew left column at page 14, Section 3.3, fifth full paragraph, teaches clip() functions restricts the values to the given range (clip (-zone, zone) operation)) to obtain the first layer preprocessing data in the preset section [-zone, zone], zone being greater than 0 (Mathew right column at page 13, Section 3.3, first full paragraph, teaches the quantization approach used in this work uses a middle ground between complexity and accuracy. Similar to Ristretto’s approach, the minimum and maximum ranges of various tensors (weights, inputs and outputs) are computed (obtain the first layer preprocessing data in the preset section [-zone, zone]) for all the layers during training time (preprocess element values in the first layer input data), from a subset of the training data); and
a determination unit configured to
determine M values in the preset section [-zone, zone], M being a positive integer, compute absolute values of differences between the first layer preprocessing data and the M values respectively to obtain M absolute values (Mathew right column at page 13, Section 3.3, first full paragraph, teaches [s]igned tensors are to be quantized between -128 and +127; while unsigned tensors are to be quantized between 0 and 255 (compute absolute values of differences between the first layer processing data and the M values respectively to obtain M absolute values), and
determine a minimum absolute value of the M absolute values as the quantized element value corresponding to the element value (Mathew right column at page 12, Section 2.3, first partial paragraph, teaches [a] network is sparsified by training in iterations with high weight decay, and whenever the absolute value of a weight falls below predefined threshold (determine a minimum absolute value of the M absolute values), it is thresholded to zero (as the quantized element value)).
Mathew, Tan, and Han are analogous art because each disclose improving training efficiency for neural networks. Tan teaches embodiments may relate to dictionary-based, self-adaptive network architectures, training algorithms, and/or training schemes. Han teaches deep compression to reduce the storage required by neural networks. Mathew teaches reducing the complexity of convolutional neural networks. Thus, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention to implement the teachings of Mathew Tan and Han.
The motivation for doing so is because of reducing the complexity of convolutional networks that includes quantizing the network to use 8-bit fixed point multiplications efficiently (Mathew, Abstract).
Regarding claim 12, the combination of Tan, Han, and Mathew teaches all of the limitations of claim 11, as described above.
However, the combination of Tan, Han, and Mathew fails to explicitly teach wherein the clustering algorithm comprises one or more of a group consisting of K-means algorithm, K-medoids algorithm, Clara algorithm and Clarans algorithm.
	But Hu teaches wherein the clustering algorithm comprises one or more of a group consisting of K-means algorithm, K-medoids algorithm, Clara algorithm and Clarans algorithm (Hu ¶ 0131 teaches algorithms may also include: a partitioning method such as K-means, K-medoids, CLARA (Clustering LARge Application), CLARANS (Clustering Large Application based upon RANdomized Search) . . . .).
Hu is analogous art to Tan, Han, and Mathew because each disclose compression techniques for system efficiency. Tan teaches embodiments may relate to dictionary-based, self-adaptive network architectures, training algorithms, and/or training schemes. Han teaches deep compression to reduce the storage required by neural networks. Mathew teaches reducing the complexity of convolutional neural networks. Hu teaches the use of clustering algorithms to customize a particular distance and a particular time period. Thus, it would have been obvious to one of ordinary skill in the art at the time of the effective filing date of the invention to implement the teachings Hu pertaining to clustering algorithms with the improved training efficiencies of Tan, Han, and Mathew.
The motivation for doing so is for versatility of using other partitioning methods for clustering algorithms. (Hu ¶ 0131).
Response to Arguments
9.	Applicant’s arguments have been fully considered but they are not persuasive. Examiner responds below.
10.	Applicant argues that the combination of Tan, Han, and Mathew, either alone or in combination fails to teach or suggest a processing circuit configured to "query nth layer weight group gradients corresponding to the nth layer quantized output data gradients and a nth layer quantized weight group data from the preset output result table," as recited (emphasis in original). Applicant submits that [c]ontrary to the allegation, paragraphs [0018]-[0019] of Tan at most show two types of dictionaries for intermediate activations and weights respectively and are completely silent regarding any operations performed on weight group gradients. As such, it is respectfully submitted that the cited portion of Tan does not disclose the cited feature. (Applicant’s Response at p. 11 (emphasis in original)).
	Examiner points to Applicant’s feature of a “preset output result table,” which from the Applicant’s claims is a table for queries. Tan ¶ 0001 teaches dictionary-based, self-adaptive neural networks. Tan ¶ 0018 teaches a “dictionary” may include any computer-readable data that includes one or more “entries” to associate one item with another. Fig. 1 of Tan teaches a neural network system:

    PNG
    media_image1.png
    442
    1036
    media_image1.png
    Greyscale

Tan ¶ 0027 teaches that [s]ystem 100 includes an activation array al 102 in layer l, a 2-D indexing matrix Ll in layer 1, and a mapping dictionary (1-D array) Dl 106 in layer 1. System 100 further includes activation array al-2 112, an activation array al-1 122, an activation array al+1 132, an indexing matrix Ll-1 114, an indexing matrix Ll+1 124, a mapping dictionary Dl-1 116, and a mapping dictionary Dl+1 126. Tan ¶ 0024 teaches a gradient of each centroid in a dictionary in a particular layer may be computed by summarizing the gradients from indexing positions which map to the same specific centroid (a centroid is weight group data)).
Expansively, Tan ¶ 0028 teaches that arrays L and dictionaries D can vary dimensionally, and mapping between arrays L and dictionaries D may vary based on applications. Examiner notes that queries may be made of the mappings of Tan based on its dictionary (that is, a preset output result table). Examiner also notes that the Applicant’s claims or the Applicant’s Specification do not define a preset output result table, and accordingly, the plain and ordinary meaning of such term applies - that is, the BRI includes the mappings of Tan based on its dictionary. 
Also, Applicant appears to argue that the cited prior art references fail to show certain features of Applicant’s invention. The Applicant argues that Tan teaches centroid-based clustering, and that Applicant’s claims relate to weight group data. As 
But Applicant’s claims and specification are silent as to whether “group weight data” pertains, for example, to centroid-based clustering or some other basis to “determine . . . a first layer weight group data” (claim 1, line 5). Applicant’s claims simply recite, for example, “determine . . . group weight data.” (see, e.g., claim 1, line 5 (“determine . . . a first layer weight group data”)).
As described in rejections set out herein, Tan teaches, inter alia, the feature of to “determine . . . a first layer weight group data” (Tan ¶¶ 0019, 0021).
Moreover, the rejection above clearly sets forth which claim limitations are taught by each of the references, and the reason why it would be obvious to one of ordinary skill in the art as of the effective filing date of the Applicant’s invention to combine their teachings, and Applicant has not explained why they cannot be combined in the manner set forth in the rejection.
11.	Applicant argues that “amended claim 1 . . . recite[s] ‘update a weight group data of n layers according to the nth layer weight group gradients.’ The Office Action points to paragraph [0064] of Tan to teach "update a weight group data" while paragraph [0064] states ‘a mapping dictionary and an indexing matrix for each layer maybe updated.’ First, as argued above, the mapping dictionary of Tan at most includes centroids, one set of input data as the result of performing clustering method on the received input data, rather than weights or weight data. Thus, updating the mapping dictionary in [Tan] paragraph [0064] does not teach or suggest updating a weight group data. Further, the Office Action points to [Tan] paragraph [0021] and alleges that "resultant clustering output is a group of weights, or simply, a weight dictionary." However, the allegation is not supported by the cited portion of Tan. It is respectfully submitted that the Office Action fails to show how a clustering method converts received input data into weight data (i.e., centroids) as outputs. . . . [E]ven assuming, arguendo, that the centroid in Tan is weight group data, it is not shown that the dictionary [of Tan] is updated based on the gradient of the centroid. As such, it is respectfully submitted that the cited portion of Tan does not disclose the cited features.” (Response at p. 12 (emphasis in original)).
Examiner respectfully disagrees. As above, Applicant appears to argue that the cited prior art references fail to show certain features of Applicant’s invention. For example, Applicant argues that Tan teaches centroid-based clustering, and that Applicant’s claims instead relate to weight group data. As such, Applicant appears to argue that the prior art reference of Tan fails to show the feature of Applicant’s claims. 
But Applicant’s claims and specification are silent as to the manner in which “group weight data” is arrived at. For example, without such “group weight data” may be realized by centroid-based clustering, or on some other basis, to “update a group weight group data of n layers”. Applicant’s claims simply and broadly recite, for example, to “update a group weight data of n layers.” (See, claim 1, line 26).
As described in the rejections set out in detail herein, Tan teaches, inter alia, the feature of to “update a first layer weight group data” (Tan ¶¶ 0019, 0021). Moreover, the rejections set out in the Final Office Action clearly set forth which claim limitations are taught by each of the references, and the reason why it would be obvious to a person 
Conclusion
12.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
(US Published Application 20180046916 to Dally et al.) teaches the number of weights that can be eliminated in a neural network varies widely across the layers of the neural network, but that eliminating weights results in a neural network with a substantial number of zero values, which can potentially reduce the computational requirements of inference.
13.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to KEVIN L. SMITH whose telephone number is (571) 272-5964. Normally, the examiner is available on Monday-Thursday 0730-1730. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USSPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KAKALI CHAKI can be reached on 571-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.


/K.L.S./
Examiner, Art Unit 2122

/BABOUCARR FAAL/Primary Examiner, Art Unit 2184