DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 06/24/2021 and 09/09/2021 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 2-5, 12-15 and 19 are non-provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-4, 9-12 and 14 of US Patent No. 11062206 B2 (“reference application”), respectively in view of Ioffee et al. (“Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”). 
A claim chart comparing the two claim sets are given below.
Instant Application
Reference Application 
(US Pat No. 10997509 B2)
2. A method for training a neural network having main parameters on training data to generate normalized outputs that are mappable to un-normalized outputs in accordance with a set of normalization parameters, wherein the training data comprises a sequence of training items and, for each training item in the sequence, a respective target output associated with the training item, wherein the target output is an output expected to be generated by the neural network for the training item, and wherein the method comprises, for each training item in the sequence: obtaining a target output for the training item from the training data; updating current values of the normalization parameters to account for magnitudes of target outputs for training inputs changing during training by updating the current values so that the normalized target outputs for the training items up to and including the training item in the sequence of training data have a specified distribution, wherein the normalization parameters are different from the main parameters of the neural network; determining a normalized target output for the training item by normalizing the target output for the training item in accordance with the updated normalization parameter values; processing the training item using the neural network to generate a normalized output of the neural network for the training item in accordance with current values of the main parameters of the neural network; determining an error for the training item using an objective function that takes as input (i) the normalized output generated for the training item by the neural network and (ii) the normalized target output generated by normalizing the target output associated with the training item in the training data; and using the error to adjust the current values of the main parameters of the neural network.  


1. A method for training a neural network on training data to generate normalized outputs that are mappable to un-normalized outputs in accordance with a set of normalization parameters, wherein the training data comprises a sequence of training items and, for each training item in the sequence, a respective target output associated with the training item, wherein the target output is an output expected to be generated by the neural network for the training item, and wherein the method comprises, for each training item in the sequence: obtaining a target output for the training item from the training data; updating current values of the normalization parameters to account for magnitudes of target outputs for training inputs changing during training by updating the current values so that the normalized target outputs for the training items up to and including the training item in the sequence of training data have a specified distribution; updating current values of auxiliary parameters to preserve a mapping between the un- normalized outputs and the normalized outputs despite the updating of the current values of the normalization parameters; determining a normalized target output by normalizing the target output obtained from the training data in accordance with the updated normalization parameter values; processing the training item using the neural network to generate a normalized output of the neural network for the training item in accordance with current values of main parameters of the neural network, comprising: processing the training item using the neural network in accordance with the current values of the main parameters of the neural network to generate an initial output of the neural network for the training item, and normalizing the initial output in accordance with the updated values of the auxiliary parameters to generate the normalized output; determining an error for the training item using an objective function that takes as input (i) the normalized output generated for the training item by the neural network and (ii) the normalized target output generated by normalizing the target output associated with the training item in the training data; and using the error to adjust the current values of the main parameters of the neural network.
3. The method of claim 2, wherein the normalization parameters comprise a shift parameter and a scale parameter of the normalization.  

2. The method of claim 1, wherein the normalization parameters comprise a shift parameter and a scale parameter of the normalization.
4. The method of claim 3, wherein determining the normalized target output comprises applying the updated values of the scale parameter and the shift parameter to the target output.  

3. The method of claim 2, wherein determining the normalized target output comprises applying the updated values of the scale parameter and the shift parameter to the target output.
5. The method of claim 2, wherein the normalized outputs generated by the neural network are mappable to un-normalized outputs in accordance with the normalization parameters and a set of auxiliary parameters.  

4. The method of claim 1, wherein the normalized outputs generated by the neural network are mappable to un-normalized outputs in accordance with the normalization parameters and the auxiliary parameters.
12. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations for training a neural network on training data to generate normalized outputs that are mappable to un-normalized outputs in accordance with a set of normalization parameters, wherein the training data comprises a sequence of training items and, for each training item in the sequence, a respective target output associated with the training item, wherein the target output is an output expected to be generated by the neural network for the training item, and wherein the operations comprise, for each training item in the sequence: obtaining a target output for the training item from the training data; updating current values of the normalization parameters to account for magnitudes of target outputs for training inputs changing during training by updating the current values so that the normalized target outputs for the training items up to and including the training item in the sequence of training data have a specified distribution, wherein the normalization parameters are different from the main parameters of the neural network; determining a normalized target output for the training item by normalizing the target output for the training item in accordance with the updated normalization parameter values; processing the training item using the neural network to generate a normalized output of the neural network for the training item in accordance with current values of the main parameters of the neural network; determining an error for the training item using an objective function that takes as input (i) the normalized output generated for the training item by the neural network and (ii) the normalized target output generated by normalizing the target output associated with the training item in the training data; and using the error to adjust the current values of the main parameters of the neural network.
9. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations for training a neural network on training data to generate normalized outputs that are mappable to un-normalized outputs in accordance with a set of normalization parameters, wherein the training data comprises a sequence of training items and, for each training item in the sequence, a respective target output associated with the training item, wherein the target output is an output expected to be generated by the neural network for the training item, and wherein the operations comprise, for each training item in the sequence: obtaining a target output for the training item from the training data; updating current values of the normalization parameters to account for magnitudes of target outputs for training inputs changing during training by updating the current values so that the normalized target outputs for the training items up to and including the training item in the sequence of training data have a specified distribution; updating current values of auxiliary parameters to preserve a mapping between the un- normalized outputs and the normalized outputs despite the updating of the current values of the normalization parameters; determining a normalized target output by normalizing the target output obtained from the training data in accordance with the updated normalization parameter values; processing the training item using the neural network to generate a normalized output for the training item in accordance with current values of main parameters of the neural network, comprising: processing the training item using the neural network in accordance with the current values of the main parameters of the neural network to generate an initial output for the training item, and normalizing the initial output in accordance with the updated values of the auxiliary parameters to generate the normalized output; determining an error for the training item using an objective function that takes as input (i) the normalized output generated for the training item by the neural network and (ii) the normalized target output generated by normalizing the target output associated with the training item in the training data; and using the error to adjust the current values of the main parameters of the neural network.

13. The system of claim 12, wherein the normalization parameters comprise a shift parameter and a scale parameter of the normalization.
10. The system of claim 9, wherein the normalization parameters comprise a shift parameter and a scale parameter of the normalization.
14. The system of claim 13, wherein determining the normalized target output comprises applying the updated values of the scale parameter and the shift parameter to the target output.
11. The system of claim 10, wherein determining the normalized target output comprises applying the updated values of the scale parameter and the shift parameter to the target output.
15. The system of claim 12, wherein the normalized outputs generated by the neural network are mappable to un-normalized outputs in accordance with the normalization parameters and a set of auxiliary parameters.
12. The system of claim 9, wherein the normalized outputs generated by the neural network are mappable to un-normalized outputs in accordance with the normalization parameters and the auxiliary parameters
19. One or more non-transitory computer-readable media storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations for training a neural network on training data to generate normalized outputs that are mappable to un-normalized outputs in accordance with a set of normalization parameters, wherein the training data comprises a sequence of training items and, for each training item in the sequence, a respective target output, the operations comprising, for each training item in the sequence: updating current values of the normalization parameters to account for the target output for the training item; determining a normalized target output for the training item by normalizing the target output for the training item in accordance with the updated normalization parameter values; processing the training item using the neural network to generate a normalized output for the training item in accordance with current values of main parameters of the neural network; determining an error for the training item using the normalized target output and the normalized output; and using the error to adjust the current values of the main parameters of the neural network.
14. One or more non-transitory computer-readable media storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations for training a neural network on training data to generate normalized outputs that are mappable to un-normalized outputs in accordance with a set of normalization parameters, wherein the training data comprises a sequence of training items and, for each training item in the sequence, a respective target output associated with the training item, wherein the target output is an output expected to be generated by the neural network for the training item, and wherein the operations comprise, for each training item in the sequence: obtaining a target output for the training item from the training data; updating current values of the normalization parameters to account for magnitudes of target outputs for training inputs changing during training by updating the current values so that the normalized target outputs for the training items up to and including the training item in the sequence of training data have a specified distribution; updating current values of auxiliary parameters to preserve a mapping between the un-normalized outputs and the normalized outputs despite the updating of the current values of the normalization parameters, determining a normalized target output by normalizing the target output obtained from the training data in accordance with the updated normalization parameter values; processing the training item using the neural network to generate a normalized output of the neural network for the training item in accordance with current values of main parameters of the neural network, comprising: processing the training item using the neural network in accordance with the current values of the main parameters of the neural network to generate an initial output of the neural network for the training item, and normalizing the initial output in accordance with the updated values of the auxiliary parameters to generate the normalized output; determining an error for the training item using an objective function that takes as input (i) the normalized output generated for the training item by the neural network and (ii) the normalized target output generated by normalizing the target output associated with the training item in the training data; and using the error to adjust the current values of the main parameters of the neural network.


Regarding claims 1, 12 and 19
	The instant application is substantially identical to the reference application except that it recites “wherein the normalization parameters are different from the main parameters of the neural network”
	Ioffe teaches “wherein the normalization parameters are different from the main parameters of the neural network” (Examiner notes that the normalized parameter are different from the main parameter because the normalized parameter are shifted see pg. 3 right col “we introduce, for each activation x (k), a pair of parameters γ (k) , β(k), which scale and shift the normalized value… These parameters are learned along with the original model parameters[corresponds to main parameters], and restore the representation power of the network.”);	
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the reference application with the teaching of Ioffee to include batch normalization that allows higher learning rates. 
One of ordinary skill in the art would have been motivated to make this modification in order to eliminate the need for dropout to achieve same accuracy and exceed the accuracy of human raters as disclosed by Ioffee (Abstract).

The additional limitations of claims 2-4, 10-12 and 14 of the reference application are nearly identical in language to the additional limitations of claims 3-5 and 13-15 of the instant application, respectively; claims 2-4, 10-12 and 14 are rejected for that reason. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 2-7 and 12-18 and 19-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ioffe et al. (“Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, hereinafter: Ioffe) in view of Graves et al. (“Towards End-to-End Speech Recognition with Recurrent Neural Networks”, hereinafter: Graves) and further in view of Wiesler et al. (“Mean-Normalized Stochastic Gradient Stochastic Gradient for Large-Scale Deep Learning”).
Regarding claim 2
Ioffe teaches a method for training a neural network having main parameters on training data to generate normalized outputs (pg. 4 right col “The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference; we want the output to depend only on the input, deterministically.”) that are mappable to un-normalized outputs in accordance with a set of normalization parameters (pg. 3 right col “we introduce, for each activation x(k), a pair of parameters γ(k), β(k), which scale and shift the normalized value: y(k) = γ(k)x(k) + β(k). These parameters are learned along with the original model parameters, and restore the representation power of the network.”), 
wherein the training data comprises a sequence of training items and, for each training item in the sequence (abstract “Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch[corresponds to sequence of training items].”), 
a respective target output associated with the training item (pg. 3 left col “The first is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have the mean of zero and the variance of 1” also see algorithm 1 where the output is based on the input training data mini-batch which corresponds to training item), 
wherein the target output is an output expected to be generated by the neural network for the training item (Examiner notes that the output is batch-normalized network for inference, NinfBN it is generated from the input network N with trainable parameter[corresponds to training item] see Algorithm 2 on Pg. 4 “Algorithm 2 summarizes the procedure for training batch-normalized networks.”), 
and wherein the method comprises, for each training item in the sequence: (pg. 3 right col “In the batch setting where each training step is based on the entire training set, we would use the whole set to normalize activations”)
obtaining a target output for the training item from the training data; (Examiner interprets target output as the output of the machine learning model Ioffee teaches obtaining target output by training machine learning model using training data[mini batch] see Algorithm 1)
…
wherein the normalization parameters are different from the main parameters of the neural network (Examiner notes that the normalized parameter are different from the main parameter because the normalized parameter are shifted see pg. 3 right col “we introduce, for each activation x (k), a pair of parameters γ (k), β(k), which scale and shift the normalized value… These parameters are learned along with the original model parameters[corresponds to main parameters], and restore the representation power of the network.”);
…
wherein the normalization parameters are different from the main parameters of the neural network (Examiner notes that the normalized parameter are different from the main parameter because the normalized parameter are shifted see pg. 3 right col “we introduce, for each activation x (k), a pair of parameters γ (k) , β(k), which scale and shift the normalized value… These parameters are learned along with the original model parameters[corresponds to main parameters], and restore the representation power of the network.”);
…
determining a normalized target output for the training item by normalizing the target output for the training item in accordance with the updated normalization parameter values (pg. 8 left col “The goal of Batch Normalization is to achieve a stable distribution of activation values throughout training, and in our experiments we apply it before the nonlinearity since that is where matching the first and second moments is more likely to result in a stable distribution. On the contrary,… apply the standardization layer to the output of the nonlinearity, which results in sparser activations.” And also see pg. 2 right col “if these modifications are interspersed with the optimization steps, then the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated, which reduces the effect of the gradient step.”); 
processing the training item using the neural network to generate a normalized output of the neural network for the training item in accordance with current values of the main parameters of the neural network; (pg. 2 right col “For example, consider a layer with the input u that adds the learned bias b, and normalizes the result by subtracting the mean of the activation computed over the training data: x = x − E[x] where x = u + b, X = {x1…N} is the set of values of x over the training set”)
…
and using the error to adjust the current values of the main parameters of the neural network (Pg. 4 left col “During training we need to backpropagate the gradient of loss ℓ through this transformation, as well as compute the gradients with respect to the parameters of the BN transform”);
Ioffe does not teach updating current values of the normalization parameters to account for magnitudes of target outputs for training inputs changing during training by updating the current values so that the normalized target outputs for the training items up to and including the training item in the sequence of training data have a specified distribution, 
wherein the normalization parameters are different from the main parameters of the neural network; 
…
determining an error for the training item using an objective function that takes as input (i) the normalized output generated for the training item by the neural network 
and (ii) the normalized target output generated by normalizing the target output associated with the training item in the training data. 
Graves teaches determining an error for the training item using an objective function that takes as input (i) the normalized output generated for the training item by the neural network (Graves teaches objective function as evidence by pg. 2 left col fifth paragraph “The basic system is enhanced by a new objective function that trains the network to directly optimise the word error rate.” And the normalized output are generated from the training data see pg. 2 right col section 2 “Given an input sequence x = (x1, . . . , xT ), a standard recurrent neural network (RNN) computes the hidden vector sequence h = (h1, . . . , hT ) and output vector sequence y = (y1, . . . , yT ) by iterating the following equations from t = 1 to T”)
and (ii) the normalized target output generated by normalizing the target output associated with the training item in the training data (pg. 4 left col “Given a length T input sequence x, the output vectors yt are normalised with the softmax function, then interpreted as the probability of emitting the label (or blank) with index k at time t” see pg. 2 right col section 2 “Given an input sequence x = (x1,..., xT ), a standard recurrent neural network (RNN) computes the hidden vector sequence h = (h1,..., hT ) and output vector sequence y = (y1,..., yT ) by iterating the following equations from t = 1 to T” also see pg. 6 section 6 “For both training sets, the RNN was trained with CTC, as described in Section 3, using the characters in the transcripts as the target sequences. The RNN was then retrained to minimize the expected word error rate using the method from Section 4, with five alignment samples per sequence.”). 
Ioffee and Graves are analogous art because they are both directed to data normalization.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified batch normalization using deep neural network of Ioffee with determining an error for the training item using an objective function of Graves in order train neural network to “minimize the expectation of an arbitrary transcription loss function” as disclosed by Graves (Abstract “The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model.”).
Ioffee in view of Graves does not teach updating current values of the normalization parameters to account for magnitudes of target outputs for training inputs changing during training by updating the current values so that the normalized target outputs for the training items up to and including the training item in the sequence of training data have a specified distribution.
Wiesler teaches updating current values of the normalization parameters to account for magnitudes of target outputs for training inputs changing during training by updating the current values so that the normalized target outputs for the training items up to (pg. 181 left col section 2.2 “Instead of explicitly normalizing the features and optimizing F, the parameters can be mapped to the normalized parameter space with φ[corresponds to magnitude], updated with the SGD rule (3), and mapped back to the original parameter space” see equation 10)
and including the training item in the sequence of training data have a specified distribution, (pg. 181 left col section 2.2 first paragraph “It can be concluded, that convergence speed is improved when mean and variance of the input features are normalized. For neural networks, only the input to the lowest layer can be normalized directly, because the input to the other layers changes dynamically during training. The idea of our proposed algorithm is to perform a mean normalization step on model side instead of explicitly normalizing the features. Running averages of the activations are used for the required mean statistics”)
Ioffee, Graves and Wiesler are analogous art because they are all directed to neural network. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ioffee in view of Graves with updating values of the normalization parameters to account for magnitudes of target outputs using mean-normalized Stochastic Gradient Descent of Wiesler in order to optimized training data using second order stochastic optimization algorithm which improve generalization performance as disclosed by Wiesler (pg. 1 right col section 2 “we describe our proposed optimization algorithm, prove its convergence, and describe methods to improve generalization performance.”).

Regarding claim 3 (New) 
Ioffe in view of Graves with Wiesler teaches claim 2. 
Ioffe further teaches wherein the normalization parameters comprise a shift parameter and a scale parameter of the normalization (Pg. 3 “normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform. we introduce, for each activation x(k), a pair of parameters γ(k), β(k), which scale and shift the normalized value: y(k) = γ(k)x(k) + β(k).”).  
Regarding claim 13
Claim 13 recites analogous limitations to dependent claim 2 and therefore is rejected on the same ground as dependent claim 2. 

Regarding claim 4 (New) 
Ioffe in view of Graves with Wiesler teaches claim 3. 
Ioffe further teaches wherein determining the normalized target output comprises applying the updated values of the scale parameter and the shift parameter to the target output (Pg. 3 “normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform. we introduce, for each activation x(k), a pair of parameters γ(k), β(k), which scale and shift the normalized value: y(k) = γ(k)x(k) + β(k).”). 
Regarding claim 14
Claim 14 recites analogous limitations to dependent claim 3 and therefore is rejected on the same ground as dependent claim 3. 

Regarding claim 5 (New) 
Ioffe in view of Graves with Wiesler teaches claim 2. 
wherein the normalized outputs generated by the neural network are mappable to un-normalized outputs in accordance with the normalization parameters and a set of auxiliary parameters (pg. 3 right col “we introduce, for each activation x(k), a pair of parameters γ(k), β(k), which scale and shift the normalized value: y(k) = γ(k)x(k) + β(k). These parameters are learned along with the original model parameters, and restore the representation power of the network.”).  
Regarding claim 15
Claim 15 recites analogous limitations to dependent claim 5 and therefore is rejected on the same ground as dependent claim 5. 

Regarding claim 6 (New) 
Ioffe in view of Graves with Wiesler teaches claim 5. 
Ioffe further teaches the method further comprising: updating current values of the auxiliary parameters to preserve the mapping between the un-normalized outputs (Pg. 3 right col “we introduce, for each activation x(k), a pair of parameters γ(k), β(k), which scale and shift the normalized value: y(k) = γ(k)x(k) + β(k). These parameters are learned along with the original model parameters, and restore the representation power of the network.”)
and normalized outputs despite the updating of the current values of the normalization parameters (pg. 5 left col “We could have also normalized the layer inputs u, but since u is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift.”).  


Regarding claim 16
Claim 16 recites analogous limitations to dependent claim 6 and therefore is rejected on the same ground as dependent claim 6. 

Regarding claim 7 (New) 
Ioffe in view of Graves with Wiesler teaches claim 6. 
Ioffe further teaches wherein updating the current values of the auxiliary parameters to preserve the mapping between the un-normalized outputs (pg. 3 right col “we introduce, for each activation x(k), a pair of parameters γ(k), β(k), which scale and shift the normalized value: y(k) = γ(k)x(k) + β(k). These parameters are learned along with the original model parameters, and restore the representation power of the network.”)
and normalized outputs despite the updating of the current values of the normalization parameters comprises updating the current values of the auxiliary parameters to cancel out the effect of updating the current values of the normalization parameters on the mapping (pg. 5 left col “normalizing it is likely to produce activations with a stable distribution. Note that, since we normalize Wu+b, the bias b can be ignored since its effect will be canceled by the subsequent mean subtraction (the role of the bias is subsumed by β in Alg. 1).”)
Regarding claim 17
Claim 17 recites analogous limitations to dependent claim 7 and therefore is rejected on the same ground as dependent claim 7. 

Regarding claim 8 (New) 
Ioffe in view of Graves with Wiesler teaches claim 6.  
Ioffe further teaches wherein determining the error comprises: processing the training input in accordance with the current values of the main parameters to generate an initial output; (pg. 2 left col second paragraph “Batch Normalization also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values.”)
and normalizing the initial output in accordance with the updated values of the auxiliary parameters (pg. 2 right col “the combination of the update to b and subsequent change in normalization led to no change in the output of the layer nor, consequently, the loss.”).  

Regarding claim 18
Claim 18 recites analogous limitations to dependent claim 8 and therefore is rejected on the same ground as dependent claim 8. 

Regarding claim 12 (New) 
Ioffe teaches …cause the one or more …to perform operations for training a neural network on training data to generate normalized outputs that are mappable to un-normalized outputs in accordance with a set of normalization parameters, (pg. 3 right col “we introduce, for each activation x(k), a pair of parameters γ(k), β(k), which scale and shift the normalized value: y(k) = γ(k)x(k) + β(k). These parameters are learned along with the original model parameters, and restore the representation power of the network.”)
wherein the training data comprises a sequence of training items and, for each training item in the sequence, (abstract “Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch[corresponds to sequence of training items].”)
a respective target output associated with the training item, wherein the target output is an output expected to be generated by the neural network for the training item, (pg. 3 left col “The first is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have the mean of zero and the variance of 1” also see algorithm 1 where the output is based on the input training data mini-batch which corresponds to training item)
and wherein the operations comprise, for each training item in the sequence: (pg. 3 right col “In the batch setting where each training step is based on the entire training set, we would use the whole set to normalize activations”)
obtaining a target output for the training item from the training data; (Examiner interprets target output as the output of the machine learning model Ioffee teaches obtaining target output by training machine learning model using training data[mini batch] see Algorithm 1)
…
wherein the normalization parameters are different from the main parameters of the neural network; (Examiner notes that the normalized parameter are different from the main parameter because the normalized parameter are shifted see pg. 3 right col “we introduce, for each activation x (k), a pair of parameters γ (k) , β(k), which scale and shift the normalized value… These parameters are learned along with the original model parameters[corresponds to main parameters], and restore the representation power of the network.”);
determining a normalized target output for the training item by normalizing the target output for the training item in accordance with the updated normalization parameter values (pg. 8 left col “The goal of Batch Normalization is to achieve a stable distribution of activation values throughout training, and in our experiments we apply it before the nonlinearity since that is where matching the first and second moments is more likely to result in a stable distribution. On the contrary,… apply the standardization layer to the output of the nonlinearity, which results in sparser activations.” And also see pg. 2 right col “if these modifications are interspersed with the optimization steps, then the gradient descent step may attempt to update the parameters in a way that requires the normalization to be updated, which reduces the effect of the gradient step.”); 
processing the training item using the neural network to generate a normalized output of the neural network for the training item in accordance with current values of the main parameters of the neural network (pg. 2 right col “For example, consider a layer with the input u that adds the learned bias b, and normalizes the result by subtracting the mean of the activation computed over the training data: x = x − E[x] where x = u + b, X = {x1…N} is the set of values of x over the training set”);
…
and using the error to adjust the current values of the main parameters of the neural network (Pg. 4 left col “During training we need to backpropagate the gradient of loss ℓ through this transformation, as well as compute the gradients with respect to the parameters of the BN transform”);  
Ioffe does not teach a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers,
…
updating current values of the normalization parameters to account for magnitudes of target outputs for training inputs changing during training by updating the current values so that the normalized target outputs for the training items up to and including the training item in the sequence of training data have a specified distribution, 
…
determining an error for the training item using an objective function that takes as input (i) the normalized output generated for the training item by the neural network and (ii) the normalized target output generated by normalizing the target output associated with the training item in the training data.
Graves teaches determining an error for the training item using an objective function that takes as input (i) the normalized output generated for the training item by the neural network (Graves teaches objective function as evidence by pg. 2 left col fifth paragraph “The basic system is enhanced by a new objective function that trains the network to directly optimise the word error rate.” And the normalized output are generated from the training data see pg. 2 right col section 2 “Given an input sequence x = (x1, . . . , xT ), a standard recurrent neural network (RNN) computes the hidden vector sequence h = (h1, . . . , hT ) and output vector sequence y = (y1, . . . , yT ) by iterating the following equations from t = 1 to T”)
and (ii) the normalized target output generated by normalizing the target output associated with the training item in the training data (pg. 4 left col “Given a length T input sequence x, the output vectors yt are normalised with the softmax function, then interpreted as the probability of emitting the label (or blank) with index k at time t” see pg. 2 right col section 2 “Given an input sequence x = (x1,..., xT ), a standard recurrent neural network (RNN) computes the hidden vector sequence h = (h1,..., hT ) and output vector sequence y = (y1,..., yT ) by iterating the following equations from t = 1 to T” also see pg. 6 section 6 “For both training sets, the RNN was trained with CTC, as described in Section 3, using the characters in the transcripts as the target sequences. The RNN was then retrained to minimize the expected word error rate using the method from Section 4, with five alignment samples per sequence.”). 
Ioffee and Graves are analogous art because they are both directed to data normalization.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified batch normalization using deep neural network of Ioffee with determining an error for the training item using an objective function of Graves in order train neural network to “minimize the expectation of an arbitrary transcription loss function” as disclosed by Graves (Abstract “The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model.”).
Ioffee in view of Graves does not teach a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers,
…
updating current values of the normalization parameters to account for magnitudes of target outputs for training inputs changing during training by updating the current values so that the normalized target outputs for the training items up to and including the training item in the sequence of training data have a specified distribution.
Wiesler teaches a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, (pg. 182 right col section 3 “All experiments are performed with our open source DNN tool which is part of RASR [23]. The complete training is performed on a GPU[corresponds to computer with processor].”)
updating current values of the normalization parameters to account for magnitudes of target outputs for training inputs changing during training by updating the current values so that the normalized target outputs for the training items up to (pg. 181 left col section 2.2 “Instead of explicitly normalizing the features and optimizing F, the parameters can be mapped to the normalized parameter space with φ[corresponds to magnitude], updated with the SGD rule (3), and mapped back to the original parameter space” see equation 10) and including the training item in the sequence of training data have a specified distribution (pg. 181 left col section 2.2 first paragraph “It can be concluded, that convergence speed is improved when mean and variance of the input features are normalized. For neural networks, only the input to the lowest layer can be normalized directly, because the input to the other layers changes dynamically during training. The idea of our proposed algorithm is to perform a mean normalization step on model side instead of explicitly normalizing the features. Running averages of the activations are used for the required mean statistics”).
Ioffee, Graves and Wiesler are analogous art because they are all directed to neural network. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ioffee in view of Graves with updating values of the normalization parameters to account for magnitudes of target outputs using mean-normalized Stochastic Gradient Descent of Wiesler in order to optimized training data using second order stochastic optimization algorithm which improve generalization performance as disclosed by Wiesler (pg. 1 right col section 2 “we describe our proposed optimization algorithm, prove its convergence, and describe methods to improve generalization performance.”).




Regarding claim 19 (New) 
Claim 19 recites analogous limitations to independent claim 12 and therefore is rejected on the same ground as independent claim 12. 
In addition, Wiesler teaches one or more non-transitory computer-readable media storing instructions that, when executed by one or more computers, cause the one or more computers to perform operations… (pg. 182 right col section 3 “All experiments are performed with our open source DNN tool which is part of RASR [23]. The complete training is performed on a GPU[corresponds to computer with processor].”)


Claim(s) 9-11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ioffe et al. (“Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, hereinafter: Ioffe) in view of Graves et al. (“Towards End-to-End Speech Recognition with Recurrent Neural Networks”, hereinafter: Graves) in view of Wiesler et al. (“Mean-Normalized Stochastic Gradient Stochastic Gradient for Large-Scale Deep Learning”) and further in view of Sola et al. (“Importance of Input Data Normalization for the Application of Neural Networks to Complex Industrial Problems”, hereinafter: Sola).
Regarding claim 9 (New) 
Ioffe in view of Graves with Wiesler teaches claim 8.  
Ioffe further teaches wherein using the error to adjust the current values of the parameters of the neural network comprises: performing an iteration of a neural network training technique to adjust the current values of the main parameters of the neural network.  
Sola further teaches wherein using the error to adjust the current values of the parameters of the neural network comprises: performing an iteration of a neural network training technique to adjust the current values of the main parameters of the neural network (pg. 1467 left col “Therefore, the initial state for the backpropagation algorithm to begin is always a point in the vicinity of coordinate space origin, while distance to the desired minimum is drastically changed by the scales considered in each case. So, scales that compress all the searching space to a unitary hypercube reduce the distance to be covered, iteration by iteration, by the backpropagation algorithm.”).
Ioffee, Graves, Wiesler and Sola are analogous art because they are all directed to neural network. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ioffee in view of Graves with Wiesler to incorporate the teaching of Sola to include iteration of a neural network training technique to adjust the current values which decreases the error gradually as normalization reduces the differences between the variation range of the different variables as disclosed by Sola (pg. 1466 “In both cases it can be seen that the error gradually decreases as normalization reduces the differences between the variation range of the different variables. It is interesting to note that when the variables span in the same orders of magnitude (as happens with normalizations 3 and 4), the results are better for the case with less variables out of the most common order of magnitude.”).
  
Regarding claim 10 (New) 
Ioffe in view of Graves with Wiesler with Sola teaches claim 9.  
Sola further teaches the method further comprising: adjusting the updated values of the auxiliary parameters as part of performing the iteration of the neural network training technique (Pg. 1466 left col “The final average mean errors obtained are very similar in all cases (differences below 0.05%), while the one chosen shows slightly better performance in the initial stages of the training phase (up to 5 10 iterations).”).  
Ioffee, Graves, Wiesler and Sola are analogous art because they are all directed to neural network. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ioffee in view of Graves with Wiesler to incorporate the teaching of Sola to include iteration of a neural network training technique to adjust the current values which decreases the error gradually as normalization reduces the differences between the variation range of the different variables as disclosed by Sola (pg. 1466 “In both cases it can be seen that the error gradually decreases as normalization reduces the differences between the variation range of the different variables. It is interesting to note that when the variables span in the same orders of magnitude (as happens with normalizations 3 and 4), the results are better for the case with less variables out of the most common order of magnitude.”).

Regarding claim 11 (New) 
Ioffe in view of Graves with Wiesler and Sola teaches claim 10. 
Ioffe further teaches wherein the neural network training technique is stochastic gradient descent (SGD) (pg. 4 section 3.1 “A model employing Batch Normalization can be trained using batch gradient descent, or Stochastic Gradient Descent”).


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VAN C MANG whose telephone number is (571)270-7598. The examiner can normally be reached Mon - Fri 8:00-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on 5712729767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/VAN C MANG/Examiner, Art Unit 2126