Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on June 27, 2022, in which claims 1, 4, and 19 are currently amended. Claims 1-20 are pending.

Specification
Applicant's amendments made to the specification are acknowledged. Examiner’s objection to the specification are hereby withdrawn, as necessitated by Applicant’s amendments made to the specification.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on November 8, 2019 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Response to Arguments
 With respect to Applicant’s remarks regarding foreign priority, Examiner acknowledges that the conditions for foreign priority have been met and a new bibliographic data sheet has been included.

The rejections to claims 1 and 4 under 35 U.S.C. § 112(b) are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the rejections.
Applicant’s arguments with respect to rejection of claims 1 and 13 under 35 U.S.C. 103(a) based on amendment have been considered, however, have not been deemed persuasive. 
With respect to Applicant's arguments that Courbariaux does not disclose "forming a plurality of disjoint subsets from the plurality of layers", Examiner respectfully disagrees.  Courariaux explicitly teaches that the subsets are derived from the layers of the neural network.  Examiner asserts that it would be improper to conflate this claim language to equivalent in meaning to "dividing the layers of a DNN into subsets", which is taught by the secondary reference as outlined in the previous office action.  
With respect to Applicant's argument that a right bit shift is not synonymous with dividing by two, Examiner respectfully disagrees.  It is well known in the art that right shifting has the same effect as dividing by two.  Furthermore, it would be obvious to one of ordinary skill in the art that right bit shifting would leave an unused element in the mantissa which could readily be quantized.  Examiner therefore maintains that it would be obvious to one of ordinary skill in the art that if the floating point mantissa were shifted right and the radix point were necessarily moved (mantissa bit length changed), this would be exactly synonymous with changing the scaling factor by 2.  This is fully supported by the disclosure of Courbariaux ([p. 3 §4] "The scaling factor can be seen as the position of the radix point. It is usually fixed, hence the name ”fixed point”. Reducing the scaling factor reduces the range and augments the precision of the format. The scaling factor is typically a power of two for computational efficiency (the scaling multiplications are replaced with shifts)."
With respect to Applicant's arguments that the dynamic-precision number format of Courariaux is not directed towards changing the precision of a range, Examiner respectfully disagrees.  ([Abstract] "For each of those datasets and for each of those formats, we assess the impact of the precision of the multiplications on the final error after training. We find that very low precision is sufficient not just for running trained networks but also for training them").  
With respect to Applicant's arguments that Alstrom does not teach the use of fixed point number formats, Examiner respectfully asserts that as outlined in the office action, the combination of Alstrom and Courbariaux is used to teach the claims and not Alstrom alone.  
In light of the amended claims, Alstrom is used to reinforce the obviousness of a threshold for determining subsets of a neural network.  
A new grounds of rejection has been deemed appropriate in light of the amended claim limitations.  
 
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are:
“hardware logic configured to” in claims 19 and 20.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.  Structure for “hardware logic” is seen as being provided in at least [¶0041] “a hardware implementation of a DNN comprises hardware logic configured to process the input data to each layer in accordance with that layer and generate output data for that layer which either becomes the input data to another layer or becomes the output of the DNN.”.  
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: 
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


	Claims 1-2, 5, 7, 8, 11, and 13-20  are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Courbariaux (“TRAINING DEEP NEURAL NETWORKS WITH LOW PRECISION MULTIPLICATIONS”, 2015), and Gysel (“Ristretto: Hardware-oriented approximation of convolutional neural networks”, 2016) and in further view of Alstrom (US5857177A).  

	Regarding claim 1, Courbariaux teaches A computer-implemented method of selecting a fixed point number format for representing values input to, and/or output from, ([p. 3 Sec. 4] "Fixed point formats consist in a signed mantissa and a global scaling factor shared between all fixed point variables. The scaling factor can be seen as the position of the radix point" [p. 4 Sec. 5] "The dynamic fixed point format (Williamson, 1991) is a variant of the fixed point format in which there are several scaling factors instead of a single global one")
	a plurality of layers of a Deep Neural Network (DNN) for use in configuring a hardware implementation of the DNN, the method comprising: ([p. 1 Sec. 1] "The training of deep neural networks is very often limited by hardware." [p. 4 Sec. 7] "A Maxout network is a multi-layer neural network" [p. 5 Sec. 8] "We train Maxout networks" [p. 8 Sec. 11] "We have shown that: Very low precision multipliers are sufficient for training deep neural networks.")
	receiving an instantiation of the DNN ([p. 5 Sec. 8] "we use the same hyperparameters as in this section to train Maxout networks with low precision multiplications").
	configured to represent the values of each of the plurality of layers using one or more initial fixed point number formats for that layer, ([p. 4 Sec. 5] "In practice, we associate each layer’s weights, bias, weighted sum, outputs (post-nonlinearity) and the respective gradients vectors and matrices with a different scaling factor. Those scaling factors are initialized with a global value")
	each initial fixed point number format comprising an exponent ([p. 3 Sec. 4] "fixed point format can also be seen as a floating point format with a unique shared fixed exponent")
	and a mantissa bit length; ([p. 3 Sec. 4] "Fixed point formats consist in a signed mantissa and a global scaling factor shared between all fixed point variables").
	forming a plurality of disjoint subsets from the plurality of layers; ([p. 4 Sec. 5] "With dynamic fixed point, a few grouped variables share a scaling factor which is updated from time to time to reflect the statistics of values in the group." Grouped variable sharing a scaling factor is interpreted as synonymous with a disjoint subset.  Sec. 5 explicitly teaches that the grouped variables are representative of layers parameters.)
	for each subset of the plurality of subsets, (See Algorithm 2 in Sec. 5. Scaling factors are handled individually.)
	iteratively adjusting the fixed point number formats for the layers ([p. 8 Sec. 9.3] "We update the scaling factors once every 10000 examples")
	in the subset to fixed point number formats with a next lowest mantissa bit length ([p. 3 Sec. 4] "The scaling factor is typically a power of two for computational efficiency (the scaling multiplications are replaced with shifts)." See also Algorithm 2 where the scaling factor is reduced by half in the case of overflow condition being satisfied.  Dividing by 2 interpreted as synonymous with multiplying with .5 such that a shift to next lowest bit length is expected.)
	until the output error of the instantiation of the DNN exceeds an error threshold; (See Algorithm 2 in Sec. 5 determining whether overflow rate of M>rmax.   Overflow rate interpreted as synonymous with error, rmax interpreted as synonymous with error threshold.)
	outputting the fixed point number formats for the plurality of layers. (See Algorithm 2: "ensure: an updated scaling factor").
	However, Courbariaux does not explicitly teach in response to determining that the subsets comprise greater than a lower threshold number of layers, forming a higher number of disjoint subsets than the plurality of disjoint subsets from the plurality of layers and repeating the iterative adjusting 
	in response to determining that the subsets comprise less than or equal to the lower threshold number of layers.  

Gysel, in the same field of endeavor, teaches in response to determining that the subsets comprise greater than a lower threshold number of layers, forming a higher number of disjoint subsets than the plurality of disjoint subsets from the plurality of layers and repeating the iterative adjusting ([p. 33 §5.2] "Since the intermediate values in a network have different ranges, it is desirable to group fixed point numbers into groups with constant FL. So the number of bits allocated to the fractional part is constant within that group, but different compared to other groups. Each network layer is split into two groups: one for the layer outputs, one for the layer weights. This allows to better cover the dynamic range of both layer outputs and weights, as weights are normally significantly smaller" Gysel explicitly teaches that there are two disjoint subsets for each layer of the neural network, such that the lower threshold is interpreted as 1 with respect to the instant specification ([¶0077] "In some cases, the lower threshold (LTh) is set to 1 so that eventually an attempt is made to reduce the mantissa bit length of the fixed point number formats of each layer individually").  For this reason, in view of the disclosure of Gysel who teaches that subsets are formed with respect to the number of layers, using a threshold number of layers per subset would lead to obvious and expected outcomes.)
	in response to determining that the subsets comprise less than or equal to the lower threshold number of layers, ([p. 33 §5.2] "Since the intermediate values in a network have different ranges, it is desirable to group fixed point numbers into groups with constant FL. So the number of bits allocated to the fractional part is constant within that group, but different compared to other groups. Each network layer is split into two groups: one for the layer outputs, one for the layer weights. This allows to better cover the dynamic range of both layer outputs and weights, as weights are normally significantly smaller" Gysel explicitly teaches that there are two disjoint subsets for each layer of the neural network, such that the lower threshold is interpreted as 1 with respect to the instant specification ([¶0077] "In some cases, the lower threshold (LTh) is set to 1 so that eventually an attempt is made to reduce the mantissa bit length of the fixed point number formats of each layer individually").  For this reason, in view of the disclosure of Gysel who teaches that subsets are formed with respect to the number of layers, using a threshold number of layers per subset would lead to obvious and expected outcomes.). 

	Courbariaux and Gysel are both directed towards using a different scaling factor for dynamic fixed point representations of subsets of a neural network.  Therefore, Courbariaux and Gysel are analogous art in the same field of endeavor. It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Courbariaux with the teachings of Gysel by explicitly splitting the network into subsets based on the layers and/or parameters. Gysel teaches that the subsets may include all of the output activations for the entire network such that a single subset represents all of the layers of the neural network, which would be the same as using a single floating point representation for the entire network, which would be obvious to one of ordinary skill in the art.  Gysel also teaches splitting each layer up into multiple subsets with dynamic fixed point representations for each subset.  Gysel explicitly teaches as a motivation for combination with Courbariaux ([p. 31 §5.1] "The different parts of a CNN have a significant dynamic range. In large layers, the outputs are the result of thousands of accumulations, thus the network parameters are much smaller than the layer outputs. Fixed point has only limited capability to cover a wide dynamic range. Dynamic fixed point can be a good solution to overcome this problem, as shown by Courbariaux et al. (2014)").   While Gysel does not explicitly teach splitting neural network layers based on a threshold, it would be obvious to one of ordinary skill in the art that the layers of the network could be split in any number of ways, and using a threshold number of layers per subset would lead to obvious and expected outcomes.  This obviousness is further reinforced by Alstrom, who, in the same field of endeavor, teaches splitting neural networks into subsets based on a threshold ([Col. 3 l. 11-25] “The threshold value can hereby be controlled so that the number of firing neurons approaches a number of the same order as the number of layers. This means that the number of firing neurons will be small with respect to the total number of neurons of the network, but will be slightly larger than the number of neuron layers in the network.”).  This motivation for combination also applies to the remaining claims depending on this combination.  

	Regarding claim 2, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 1, wherein iteratively adjusting the fixed point number formats for the layers in the subset to fixed point number formats with the next lowest mantissa bit length comprises: determining a fixed point number format with the next lowest mantissa bit length for the fixed point number formats for each layer of the subset; (Courbariaux [p. 3 Sec. 4] "The scaling factor is typically a power of two for computational efficiency (the scaling multiplications are replaced with shifts)." See also Algorithm 2 where the scaling factor is reduced by half in the case of overflow condition being satisfied.  Dividing by 2 interpreted as synonymous with multiplying with .5 such that a shift to next lowest bit length is expected.)
	adjusting the fixed point number formats used by the instantiation of the DNN for each layer in the subset to the determined fixed point number formats with the next lowest mantissa bit length; (Courbariaux See Algorithm 1 and 2.  Algorithm 1 shows quantizing all parameters of each layer, and algorithm 2 shows that the quantization involves bit shifting.)
	determining an output of the adjusted instantiation of the DNN in response to test input data; (Courbariaux [p. 5 Sec. 8 Table 4] "Test set error rates of single and half floating point formats, fixed and dynamic fixed point formats on the permutation invariant (PI) MNIST, MNIST (with convolutions, no distortions), CIFAR-10 and SVHN datasets.")
	determining an output error of the adjusted instantiation of the DNN; (Courbariaux [p. 5 Sec. 8 Table 4] "Test set error rates of single and half floating point formats, fixed and dynamic fixed point formats on the permutation invariant (PI) MNIST, MNIST (with convolutions, no distortions), CIFAR-10 and SVHN datasets.")
	in response to determining that the output error exceeds the error threshold, reversing the adjustment of the instantiation of the DNN; and ( See Algorithm 2.  Checking if the overflow rate is greater than rmax and doubling the scaling factor is interpreted as synonymous with reversing the adjustment (halving the scaling factor).).
	in response to determining that the output error does not exceed the error threshold, repeating the determining the fixed point number formats, adjusting the fixed point number formats, (Courbariaux [p. 4 Sec. 5] "During the training, we update those scaling factors at a given frequency, following the policy described in Algorithm 2." Algorithm 2 shows that the determination and adjustment of the fixed point number format occurs after comparing error to threshold. Determination / adjustment during update interpreted as synonymous with repeating the determination.)
	determining the output, and determining the output error. (Courbariaux [p. 5 Sec. 8 Table 4] "Test set error rates of single and half floating point formats, fixed and dynamic fixed point formats on the permutation invariant (PI) MNIST, MNIST (with convolutions, no distortions), CIFAR-10 and SVHN datasets." Courbariaux explicitly teaches determining final test error which is temporally subsequent to updating and therefore interpreted as synonymous with in response to.  Algorithm 2 also teaches using update error to determine bit width.). 

	Regarding claim 5, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 1, wherein a first adjustment of the fixed point number formats is made for all of the subsets before a second adjustment of the fixed point number formats is made for any of the subsets. (Courbariaux [p. 4 Sec. 5] "With dynamic fixed point, a few grouped variables share a scaling factor which is updated from time to time to reflect the statistics of values in the group...During the training, we update those scaling factors at a given frequency, following the policy described in Algorithm 2." Courbariaux explicitly teaches updating all of the scaling factors at a given frequency.). 

	Regarding claim 7, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 1, wherein there is an initial fixed point number format for input data values of at least one layer of the plurality of layers and there is an initial fixed point number format for weights of at least one layer of the plurality of layers, and iteratively adjusting the fixed point number formats for the layers in the subset to fixed point number formats with the next lowest mantissa bit length until the output error of the instantiation of the DNN exceeds the error threshold comprises: (Courbariaux [p. 4 Sec. 5] "In practice, we associate each layer’s weights, bias, weighted sum, outputs (post-nonlinearity) and the respective gradients vectors and matrices with a different scaling factor. Those scaling factors are initialized with a global value" See also Algorithm 1 where the inputs and parameters are reduced from an initial fixed point number format. See Algorithm 2 for the bit length being reduced in response to the error threshold.)
	iteratively adjusting the fixed point number formats for the input data values for the layers in the subset to fixed point number formats with the next lowest mantissa bit length until the output error of the instantiation of the DNN exceeds the error threshold; and (Courbariaux See Algorithm 1 where the inputs and parameters are reduced from an initial fixed point number format. See Algorithm 2 for the bit length being reduced in response to the error threshold.)
	subsequent to iteratively adjusting the fixed point number formats for the input data values, iteratively adjusting the fixed point number formats for the weights for the layers in the subset to fixed point number formats with the next lowest mantissa bit length until the output error of the instantiation of the DNN exceeds the error threshold. (Courbariaux See Algorithm 1 where the precision of the weighted sums is reduced subsequent to the inputs. See Algorithm 2 for the bit length being reduced in response to the error threshold.). 

	Regarding claim 8, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 7, wherein there is an initial fixed point number format for output data values of at least one layer of the plurality of layers, and iteratively adjusting the fixed point number formats for the layers in the subset to a fixed point number format with the next lowest mantissa bit length until the output error of the instantiation of the DNN exceeds the error threshold further comprises: (Courbariaux [p. 4 Sec. 5] "In practice, we associate each layer’s weights, bias, weighted sum, outputs (post-nonlinearity) and the respective gradients vectors and matrices with a different scaling factor. Those scaling factors are initialized with a global value" See Algorithm 2 for the bit length being reduced in response to the error threshold.)
	subsequent to iteratively adjusting the fixed point number formats for the input data values, iteratively adjusting the fixed point number formats for the output data values for the layers in the subset to fixed point number formats with the next lowest mantissa bit length until the output error of the instantiation of the DNN exceeds the error threshold. (Courbariaux See Algorithm 1, precision of outputs is reduced at the end of the algorithm subsequent to inputs, parameters, and weights. See Algorithm 2 for the bit length being reduced in response to the error threshold.). 

	Regarding claim 11, the combination of Courbariaux, Gysel, Alstrom, and He teaches The method of claim 10, further comprising generating the baseline output by applying the test input data to an instantiation of the DNN configured to represent values input to and output from each layer of the DNN using a floating point number format. (Courbariaux [p. 5 Sec. 8 Table 4] "Test set error rates of single and half floating point formats, fixed and dynamic fixed point formats on the permutation invariant (PI) MNIST, MNIST (with convolutions, no distortions), CIFAR-10 and SVHN datasets...It serves as a baseline to evaluate the degradation brought by lower precision"). 

	Regarding claim 13, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 1, wherein the lower threshold number of layers is greater than one. (Gysel [p. 32-33 §5.1] "The layer activations are multiplied with the network weights, and these multiplication results are accumulated to form the output. As shown by Lin et al. (2015); Qiu et al. (2016), it is a good approach to use mixed precision, i.e., different parts of a CNN use different bit-widths. In Figure 5.1, m and n refer to the number of bits used to represent layer outputs and layer weights, respectively." Gysel teaches according to Lin and Qui all layer outputs may use the same dynamic fixed point precision which is interpreted as the threshold number of layers being equal to the number of layers in the neural network which Gysel explicitly teaches as being greater than 1.). 

	Regarding claim 14, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 1, wherein forming a higher number of disjoint subsets from the plurality of layers comprises: dividing the layers in each subset into a plurality of disjoint subsets and/or forming twice as many disjoint subsets from the plurality of layers. (Gysel [p. 33 §5.2] "Since the intermediate values in a network have different ranges, it is desirable to group fixed point numbers into groups with constant FL. So the number of bits allocated to the fractional part is constant within that group, but different compared to other groups. Each network layer is split into two groups: one for the layer outputs, one for the layer weights. This allows to better cover the dynamic range of both layer outputs and weights, as weights are normally significantly smaller"). 

	Regarding claim 15, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 1, wherein the values input to and/or output from the plurality of layers comprise one or more of input data values, output data values, weights and biases. (Courbariaux [p. 4 Sec. 5] "In practice, we associate each layer’s weights, bias, weighted sum, outputs (post-nonlinearity) and the respective gradients vectors and matrices with a different scaling factor"). 

	Regarding claim 16, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 1, wherein the DNN is a convolutional neural network. (Courbariaux [p. 4 Sec. 7] "which corresponds to a maxout unit when k = 2 and one of the filters is forced at 0 (Goodfellow et al., 2013a). Combined with dropout, a very effective regularization method (Hinton et al., 2012), maxout networks achieved state-of-the-art results on a number of benchmarks (Goodfellow et al., 2013a), both as part of fully connected feedforward deep nets and as part of deep convolutional nets" [p. 5 Sec. 8.1] "The second model consists in three convolutional maxout hidden layers...This is the same procedure as in Goodfellow et al. (2013a), except that we do not train our model on the validation examples"). 

	Regarding claim 17, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 1, further comprising configuring a hardware implementation of the DNN to represent values of at least one of the plurality of layers using a fixed point number format output for the at least one layer. (Courbariaux [p. 5 Sec. 8] "Table 4: Test set error rates of single and half floating point formats, fixed and dynamic fixed point formats on the permutation invariant (PI) MNIST, MNIST (with convolutions, no distortions), CIFAR-10 and SVHN datasets" Table 4 shows that at least one of the DNN implementations used full fixed point number format for the model.). 

	Regarding claim 18, the combination of Courbariaux, Gysel, and Alstrom teaches A non-transitory computer readable storage medium having stored thereon computer readable instructions that, when executed at a computer system, cause the computer system to perform the method as set forth in claim 1. (Courbariaux [p. 9 Sec. 12] "We thank the developers of Theano (Bergstra et al., 2010; Bastien et al., 2012), a Python library which allowed us to easily develop a fast and optimized code for GPU.  We also thank the developers of Pylearn2 (Goodfellow et al., 2013b), a Python library built on the top of Theano which allowed us to easily interface the datasets with our Theano code" Courbariaux explicitly teaches that the instructions are run on a GPU.). 

	Regarding claim 19, the combination of Courbariaux, Gysel, and Alstrom teaches A hardware implementation of a Deep Neural Network (DNN) comprising: hardware logic configured to: (Courbariaux [p. 1 Sec. 1] "The training of deep neural networks is very often limited by hardware." [p. 4 Sec. 7] "A Maxout network is a multi-layer neural network" [p. 5 Sec. 8] "We train Maxout networks" [p. 8 Sec. 11] "We have shown that: Very low precision multipliers are sufficient for training deep neural networks.")
	receive input data values, a set of weight or a set of biases for a layer of the DNN; (Courbariaux [p. 4 Sec. 7] "A Maxout network is a multi-layer neural network that uses maxout units in its hidden layers. A maxout unit outputs the maximum of a set of k dot products between k weight vectors and the input vector of the unit")
	receive information indicating a fixed point number format for the input data values, the set of weights, or the set of biases of the layer, the fixed point number format for the input data values, the set of weights, or the set of (Courbariaux [p. 4 Sec. 5] "With dynamic fixed point, a few grouped variables share a scaling factor which is updated from time to time to reflect the statistics of values in the group. In practice, we associate each layer’s weights, bias, weighted sum, outputs (post-nonlinearity) and the respective gradients vectors and matrices with a different scaling factor. Those scaling factors are initialized with a global value").
	biases of the layer having been selected in accordance with the method as set forth in claim 1; (Courbariaux [p. 4 Sec. 7] "where hl is the vector of activations at layer l and weight vectors wl i;j and biases bl i;j are the parameters of the j-th filter of unit i on layer l")
	interpret the input data values, the set of weights or the set of biases based on the fixed point number format for the input data values, the set of weights or the set of biases of the layer; and (Courbariaux [p. 4 Sec. 5] "With dynamic fixed point, a few grouped variables share a scaling factor which is updated from time to time to reflect the statistics of values in the group. In practice, we associate each layer’s weights, bias, weighted sum, outputs (post-nonlinearity) and the respective gradients vectors and matrices with a different scaling factor. Those scaling factors are initialized with a global value")
	process the interpreted input data values, the set of weights or the set of biases in accordance with the layer to generate output data values for the layer. (Courbariaux [p. 4 Sec. 7] "A Maxout network is a multi-layer neural network that uses maxout units in its hidden layers. A maxout unit outputs the maximum of a set of k dot products between k weight vectors and the input vector of the unit (e.g., the output of the previous layer):"). 

	Regarding claim 20, the combination of Courbariaux, Gysel, and Alstrom teaches The hardware implementation of a DNN of claim 19, wherein the hardware logic is further configured to: receive information indicating a fixed point number format for the output data values of the layer, the fixed point number format for the output data values of the layer having been selected in accordance with the method as set forth in claim 1;  (Courbariaux [p. 4 Sec. 5] "With dynamic fixed point, a few grouped variables share a scaling factor which is updated from time to time to reflect the statistics of values in the group. In practice, we associate each layer’s weights, bias, weighted sum, outputs (post-nonlinearity) and the respective gradients vectors and matrices with a different scaling factor. Those scaling factors are initialized with a global value").
	convert the output data values for the layer into the fixed point number format for the output data values of the layer. (Courbariaux See Algorithm 2 reducing precision of output data.). 

	Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Courbariaux, Gysel, and Alstrom and in further view of Young (US 2018/0165574 A1).

	Regarding claim 3, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 1, further comprising identifying a sequence of the plurality of layers wherein each layer is preceded in the sequence by any layer of the plurality of layers on which it depends, (Courbariaux [p. 4 Sec. 7] "A Maxout network is a multi-layer neural network that uses maxout units in its hidden layers.  A maxout unit outputs the maximum of a set of k dot products between k weight vectors and the input vector of the unit (e.g., the output of the previous layer):").
	However, the combination of Courbariaux, Gysel, and Alstrom does not explicitly teach and wherein each of the subsets comprises a contiguous set of layers in the sequence.  

Young, in the same field of endeavor, teaches and wherein each of the subsets comprises a contiguous set of layers in the sequence. ([¶0024] "Some neural networks pool outputs from one or more neural network layers to generate pooled values that are used as inputs to subsequent neural network layers" A subset is interpreted as synonymous with pooled layers.). 

Courbariaux, Gysel, Alstrom, and Young are all directed towards hardware neural networks.  Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the hardware neural network implementations of the combination of  Courbariaux, Gysel, and Alstrom with that of Young by pooling the outputs of multiple layers. The combination would have been obvious because a person of ordinary skill in the art would be able to determine that grouping layers in sequential order would seem obvious.  Young explains the intrinsic value of the pooling layers disclosed ([¶0009] “Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. An output tensor corresponding to an average pooling neural network layer can be generated in hardware by a special-purpose hardware circuit, even where the hardware circuit cannot directly process an input tensor to perform average pooling”).   
	Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Courbariaux, Gysel, and Alstrom and in further view of Botros (“Hardware Implementation of an Artificial Neural Network”, 1993).

	Regarding claim 4, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 1.
	However, the combination of Courbariaux, Gysel, and Alstrom does not explicitly teach the plurality of layers from which the disjoint subsets are formed do not include a first layer of the DNN and/or a last layer of the DNN.  

Botros, in the same field of endeavor, teaches The method of claim 1, wherein the plurality of layers from which the disjoint subsets are formed do not include a first layer of the DNN and/or a last layer of the DNN. ([p. 1253] "The input layer does no processing but simply buffers the data" Botros teaches a hardware implementation of an artificial neural network and explains that the first layer is not necessary for processing.  Recreating a DNN without explicitly copying a first layer would therefore lead to an expected and obvious outcome.). 

	Courbariaux, Gysel, Alstrom, and Botros are all directed to a hardware implementation of a neural network.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to not use the first layer in the neural network disclosed in Courbariaux and Alstrom for processing. Botros teaches that the first layer is not necessary for processing.  One of ordinary skill in the art would recognize from FIG. 3 of Botros that the input layer can be streamlined in a hardware implementation to simply an input buffer and would not necessarily need to be included in the processing aspect of a hardware neural network implementation.

	Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Courbariaux, Gysel, and Alstrom and in further view of Chung (US10167800B1).

	Regarding claim 6, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 1, wherein all iterative adjustments of the fixed point number formats for the layers in a first subset (See Algorithm 2.  Group scaling factor update is interpreted as synonymous with iterative adjustment of the fixed point number format for the layers in a first subset.).
	However, the combination of Courbariaux, Gysel, and Alstrom does not explicitly teach are completed before a first adjustment of the fixed point number formats for the layers in a second subset.  

Chung, in the same field of endeavor, teaches all iterative adjustments of the fixed point number formats for the layers in a first subset are completed before a first adjustment of the fixed point number formats for the layers in a second subset. ("The method may further include a step (e.g., step 920) including first processing a first subset of the training vector data to determine a first shared exponent for representing values in the first subset of the training vector data in a block-floating point format and second processing a second subset of the training vector data to determine a second shared exponent for representing values in the second subset" While Courbariaux implicitly teaches that the groups are updated in a sequence, Chung explicitly teaches updating the second subset subsequent to the first.). 

	Courbariaux, Gysel, and Alstrom, and Chung are all directed to hardware neural networks.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the exponent in the combination of Courbariaux, Gysel, and Alstrom, with that of Chung. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Chung ([Col. 5 l. 39] “The matrix-vector multiplier may use integer arithmetic, however, in the form of block floating point techniques for expanded dynamic range. This may advantageously result in a processor that communicates with the outside world in floating point and transparently implements internal integer arithmetic when necessary.” [Col. 5 l. 63] “Advantageously, individual members of the block can be operated on with integer arithmetic. Moreover, the shared exponent for each block is determined independently, which may advantageously allow for a higher dynamic range.”).

	Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Courbariaux, Gysel, and Alstrom and in further view of El-Yaniv 

	Regarding claim 9, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 1.
	However, the combination of Courbariaux, Gysel, and Alstrom does not explicitly teach the DNN is a classification network and the output error is a top-1 classification accuracy or a top-5 classification accuracy of an output of the instantiation of the DNN in response to test input data.  

El-Yaniv, in the same field of endeavor, teaches The method of claim 1, wherein the DNN is a classification network and the output error is a top-1 classification accuracy or a top-5 classification accuracy of an output of the instantiation of the DNN in response to test input data. ([¶0105] "The above described training method was applied to tackle the task of binarizing both weights and activations by employing the AlexNet and GoogleNet architectures. This implementation achieved 36:1% top-1 and 60:1% top-5 accuracies using AlexNet and 47:1% top-1 and 69:1% top-5 accuracies using GoogleNet."). 

The combination of Courbariaux, Gysel, and Alstrom, as well as El-Yaniv are both directed towards quantizing a neural network.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to determine a top-1 and/or top-5 accuracy in the quantized neural network suggested by the combination of Courbariaux, Gysel, and Alstrom. El-Yaniv teaches that top-1 and top-5 accuracies are a well-known metric in the art for determining neural network prediction performance.  Furthermore, El-Yaniv shows that said accuracies can be used in a quantized neural network for determining prediction performance and that using top-1 and top-5 accuracies leads to an obvious and expected outcome.

	Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Courbariaux, Gysel, and Alstrom and in further view of He (US20180101766A1).

	Regarding claim 10, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 1.
	However, the combination of Courbariaux, Gysel, and Alstrom does not explicitly teach wherein the DNN is a sum of L1 differences between SoftMax normalised logits of an output of the instantiation of the DNN in response to test input data and SoftMax normalised logits of a baseline output.  

He, in the same field of endeavor, teaches The method of claim 1, wherein the DNN is a sum of L1 differences between SoftMax normalised logits of an output of the instantiation of the DNN in response to test input data and SoftMax normalised logits of a baseline output. ([¶0038] "a Softmax classifier may be used. For predicting real-valued quantities, the loss function may use regression-based methods. For example, in one embodiment, the loss function measures the loss between the predicted quantity and the ground truth before measuring the L2 squared norm, or L1 norm of the difference"). 

The combination of Courbariaux, Gysel, and Alstrom, as well as He are both directed towards training a deep neural network.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the softmax layer taught in Courbariaux with the detailed method taught in He. The combination would have been obvious because a person of ordinary skill in the art would be able to determine that a softmax layer is typically used to predict a probability distribution, of which an L1 norm is typically used to calculate.  He shows that the usage of L1 norm differences from the softmax layer would lead to obvious and expected results in a deep neural network (He [¶0038]).  This motivation for combination also applies to the remaining claims depending on this combination. 

	Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Courbariaux, Gysel, and Alstrom and in further view of Shynk ("performance surfaces of a single-layer perceptron", 1990).

	Regarding claim 12, the combination of Courbariaux, Gysel, and Alstrom teaches The method of claim 1.
	However, the combination of Courbariaux, Gysel, and Alstrom does not explicitly teach, wherein the lower threshold number of layers is one.  

Shynk, in the same field of endeavor, teaches The method of claim 1, wherein the lower threshold number of layers is one. ([p. 1 Sec. 1] "The perceptron is a linear combiner that quantizes its output to one of two discrete values…For a single layer perceptron..." Shynk explicitly teaches a single layer perceptron.  Alstrom
 teaches the threshold reaching the number of layers, therefore the substitution of the multilayer perceptron in Alstrom with the single layered perceptron in Shynk it would be obvious that the threshold number of layers would be one.). 

The combination of Courbariaux, Gysel, and Alstrom, as well as Shynk are both directed towards quantizing neural networks.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to substitute the multi-layer perceptron in Courbariaux with a single-layer perceptron described in Shynk. Shynk teaches as motivation, that a single-layer perceptron is well-known in the art and would lead to obvious and expected outcomes.  

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126