Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The filing date of the present application is 03/20/2017.
This action is in response to amendments and/or remarks filed on 04/15/2021. In the current amendments, claims 1, 9 and 21 have been amended, and claims 15-20 and 22 have been cancelled. Claims 1-14 and 21 are pending and have been examined. 
In view of Applicant’s amendments and/or remarks, the objections to claim 22 made in the previous Office Action have been withdrawn.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-14 and 21 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 1 and 9 recite “a task completion”, but it is not clear if the “task” indicates the “a given task” or a different task. If they are different, “another task” or “second task” or something else may be used. If they are the same, a more specific elaboration may be necessary (e.g., “until a task completion” may be changed to “until the given task completion” or something else).
Claim 21 recites the limitation "the past time series input" in line 3. There is insufficient antecedent basis for this limitation in the claim.
The amended claim 21 recites “an RNN layer that computes a nonlinear feature map of the past time series input to the first neural network” while the original claim 11 recites “updating a state of the RNN based on a feature mapping of a history of the RNN and a current time frame of the input” (with emphasis underlined). In other words, it appears that they contradict each other. It appears that claim 21 is based on par 23 “A Gaussian DyBM may be connected with an M-dimensional RNN, whose state vector 
    PNG
    media_image1.png
    41
    50
    media_image1.png
    Greyscale
 is a nonlinear feature mapping 
    PNG
    media_image2.png
    41
    299
    media_image2.png
    Greyscale
 of its own history and the N-dimensional time-series input data vector at time t-1” and par 40 “RNN 220 may be an M-dimensional RNN, whose state vector changes dependent on a nonlinear feature mapping of its own history and the N-dimensional time-series input data vector at time t-1” while claim 11 is based on par 68 “The updating section may be further configured to update a state of RNN 520 based on a feature mapping of a history of RNN 520 and a current time frame of the input” (with emphasis underlined). Appropriate correction and/or explanation may be required.
Claims 1, 9, 21 each recite limitations that raise issues of indefiniteness as set forth above, and dependent claims 1-8 and 10-14 are rejected at least based on their direct and/or indirect dependency from independent claim 1 and 9. Appropriate explanation and/or amendment is required.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:


(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1, 3 and 21 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Boulanger-Lewandowski et al. (“Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription”).

Regarding claim 1, 
Boulanger-Lewandowski teaches
A computer program product including one or more computer readable storage mediums collectively storing program instructions that are executable by a computer to cause the computer to perform operations comprising ([sec 1] “internal memory”): 

providing an input to a first neural network, the first neural network including a plurality of first parameters ([fig 2]; [secs 2-4] “The joint probability distribution of the RNN-RBM is also given by equation (7), but with hˆ(t) defined arbitrarily, here as per equation (11). … For simplicity, we consider the RBM parameters to be W, bv(t), bh(t) (i.e. only the biases are variable) and a single-layer RNN (bottom portion of Fig. 2(b)) whose hidden units hˆ(t) are only connected to their direct predecessor hˆ(t−1) and to v(t) by the relation: 
    PNG
    media_image3.png
    38
    386
    media_image3.png
    Greyscale
 (11)”; “Restricted Boltzmann machine” reads on “first neural network”, and “W, bv(t) , bh(t)” read on “a plurality of first parameters”. In addition, v(t) of eq (7) reads on “input”.); 

providing the input to a recurrent neural network configured as a non-linear extension of the first neural network to cooperatively process the input for a given task until a task completion providing a final output from the recurrent neural network, the recurrent neural network including [fig 2] “The RBM biases bh(t), bv(t) are a linear function of hˆ(t−1)”; [secs 2-4] “The joint probability distribution of the RNN-RBM is also given by equation (7), but with hˆ(t) defined arbitrarily, here as per equation (11). … For simplicity, we consider the RBM parameters to be W, bv(t) , bh(t) (i.e. only the biases are variable) and a single-layer RNN (bottom portion of Fig. 2(b)) whose hidden units hˆ(t) are only connected to their direct predecessor hˆ(t−1) and to v(t) by the relation:  
    PNG
    media_image3.png
    38
    386
    media_image3.png
    Greyscale
 (11) The RBM portion of the RNN-RBM (upper portion of Fig. 2(b)) is otherwise exactly the same as its RTRBM counterpart. This gives the single-layer RNN-RBM nine parameters: W, bv, bh, W’, W’’, hˆ(0), W2, W3, bhˆ.”; see also [sec 6] “probabilistic modeling of sequences of polyphonic music”; 
    PNG
    media_image3.png
    38
    386
    media_image3.png
    Greyscale
 based on the input v(t) reads on “providing the input to a recurrent neural network”. In addition, “hˆ(0), W2, W3, bhˆ” read on “a plurality of second parameters”. Furthermore, “joint probability distribution of the RNN-RBM” reads on “non-linear extension of the first neural network to cooperatively process the input”. Moreover, “
    PNG
    media_image3.png
    38
    386
    media_image3.png
    Greyscale
” reads on “final output from the recurrent neural network” since 
    PNG
    media_image4.png
    63
    78
    media_image4.png
    Greyscale
 is provided as a final output at each time step from the RNN for calculating weights for the RBM. Furthermore, training or testing may read on “a given task” and training may read on “task”.); and

updating at least one first parameter of the plurality of first parameters based on the final output from the recurrent neural network ([fig 2] “The RBM biases bh(t), bv(t) are a linear function of hˆ(t−1)”; [secs 2-4] “For simplicity, we consider the RBM parameters to be W, bv(t) , bh(t) (i.e. only the biases are variable) and a single-layer RNN (bottom portion of Fig. 2(b)) whose hidden units hˆ(t) are only connected to their direct predecessor hˆ(t−1) and to v(t) by the relation:  
    PNG
    media_image3.png
    38
    386
    media_image3.png
    Greyscale
 (11) The RBM portion of the RNN-RBM (upper portion of Fig. 2(b)) is otherwise exactly the same as its RTRBM counterpart. This gives the single-layer RNN-RBM nine parameters: W, bv, bh, W’, W’’, hˆ(0), W2, W3, bhˆ.”; “Restricted Boltzmann machine” reads on “first neural network”, and “W, bv(t), bh(t)” read on “a plurality of first parameters”. In addition, “The RBM biases bh(t), bv(t) are a linear function of hˆ(t−1)” of fig 2 and 
    PNG
    media_image3.png
    38
    386
    media_image3.png
    Greyscale
 based on the input v(t) read on “updating at least one first parameter … based on the final output from the recurrent neural network”.), 

wherein the given task is a post processing task, and wherein the input provided to the first neural network and the recurrent neural network are identical values of multiple time frames ([fig 2] “The RBM biases bh(t), bv(t) are a linear function of hˆ(t−1)”; [secs 2-4] “The joint probability distribution of the RNN-RBM is also given by equation (7), but with hˆ(t) defined arbitrarily, here as per equation (11). … For simplicity, we consider the RBM parameters to be W, bv(t) , bh(t) (i.e. only the biases are variable) and a single-layer RNN (bottom portion of Fig. 2(b)) whose hidden units hˆ(t) are only connected to their direct predecessor hˆ(t−1) and to v(t) by the relation:  
    PNG
    media_image3.png
    38
    386
    media_image3.png
    Greyscale
 (11) The RBM portion of the RNN-RBM (upper portion of Fig. 2(b)) is otherwise exactly the same as its RTRBM counterpart. This gives the single-layer RNN-RBM nine parameters: W, bv, bh, W’, W’’, hˆ(0), W2, W3, bhˆ.”; see also [sec 6] “probabilistic modeling of sequences of polyphonic music … Although it is not strictly necessary, learning is facilitated if the sequences are transposed in a common tonality (e.g. C major/minor) as preprocessing.”; “learning is facilitated if the sequences are transposed in a common tonality (e.g. C major/minor) as preprocessing” reads on “given task is a post processing task” since preprocessing may be carried out before learning or testing. In addition, eq (1), eq (7) and eq (11) read on “identical values of multiple time frames” based on the input, v(t), which is provided to each neural network. See “Response to Arguments” as well.).

Regarding claim 3, 
Boulanger-Lewandowski further teaches
the at least one first parameter includes a bias parameter ([fig 2]; [secs 2-4] “For simplicity, we consider the RBM parameters to be W, bv(t), bh(t) (i.e. only the biases are variable) and a single-layer RNN (bottom portion of Fig. 2(b)) whose hidden units hˆ(t) are only connected to their direct predecessor hˆ(t−1) and to v(t) by the relation: Equation (11)”).	

Regarding claim 21, 
Boulanger-Lewandowski further teaches 
nonlinear analysis of the input is provided by the RNN using an RNN layer that computes a nonlinear feature map of the past time series input to the first neural network ([fig 2]; [secs 2-4] “
    PNG
    media_image5.png
    40
    283
    media_image5.png
    Greyscale
 is the element-wise logistic sigmoid function. … For simplicity, we consider the RBM parameters to be W, bv(t) , bh(t) (i.e. only the biases are variable) and a single-layer RNN (bottom portion of Fig. 2(b)) whose hidden units hˆ(t) are only connected to their direct predecessor hˆ(t−1) and to v(t) by the relation:  
    PNG
    media_image3.png
    38
    386
    media_image3.png
    Greyscale
 (11) The RBM portion of the RNN-RBM (upper portion of Fig. 2(b)) is otherwise exactly the same as its RTRBM counterpart. This gives the single-layer RNN-RBM nine parameters: W, bv, bh, W’, W’’, hˆ(0), W2, W3, bhˆ.”; see also [sec 6] “probabilistic modeling of sequences of polyphonic music”; 
    PNG
    media_image3.png
    38
    386
    media_image3.png
    Greyscale
 based on the input v(t) reads on “nonlinear analysis of the input is provided by the RNN using an RNN layer that computes a nonlinear feature map of the past time series input to the first neural network” because the input is used and analyzed in RNN using an RNN layer, and because the RNN layer computes a nonlinear feature map of the past time series input to the first neural network since                         
                            
                                
                                    
                                        
                                            h
                                        
                                        ^
                                    
                                
                                
                                    (
                                    t
                                    )
                                
                            
                        
                     is calculated based on                         
                            
                                
                                    
                                        
                                            h
                                        
                                        ^
                                    
                                
                                
                                    (
                                    t
                                    -
                                    1
                                    )
                                
                            
                        
                     which has a past time series input, v(t-1).),
([fig 2]; [secs 2-4] “An RBM is an energy-based model where the joint probability of a given configuration of the visible vector v (inputs) and the hidden vector h is: 
    PNG
    media_image6.png
    44
    572
    media_image6.png
    Greyscale
 (1) where bv, bh and W are the model parameters and Z is the usually intractable partition function.”; “ 
    PNG
    media_image7.png
    83
    595
    media_image7.png
    Greyscale
” with RBM reads on “linear analysis of the input is provided by the first neural network” since the input is used and analyzed in RBM based on the mathematical expression which has linear operations only.).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Boulanger-Lewandowski et al. (“Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription”) in view of Osogami et al. (“Learning dynamic Boltzmann machines with spike-timing dependent plasticity”).

Regarding claim 2, 
Boulanger-Lewandowski teaches claim 1. 

However, Boulanger-Lewandowski does not teach
the first neural network:
a plurality of layers of nodes among a plurality of nodes, each layer sequentially forwarding values of a time frame of the input, the plurality of layers of nodes: 
a first layer of a plurality of input nodes among the plurality of nodes, the input nodes receiving values of a current time frame of the input;
a plurality of intermediate layers, each node in each intermediate layer forwarding a value to a node in a subsequent or shared layer; and 
a plurality of weight values among the plurality of first parameters, each weight value to be applied to each value in the corresponding node to obtain a value propagating from a pre-synaptic node to a post-synaptic node.

Osogami teaches
the first neural network:
a plurality of layers of nodes among a plurality of nodes, each layer sequentially forwarding values of a time frame of the input ([figs 4-5]; [secs 2-3] “Formally, we define the DyBM-T as the Boltzmann machine having T layers from −T + 1 to 0, where T is a positive integer or infinity. Let x ≡ (x[t])−T<t<=0, where x[t] is the values of the units in the t-th layer, which we consider as the values at time t.”; Fig 4 and fig 5 read “each layer sequentially forwarding values of a time frame of the input”. Note that Boulanger-Lewandowski teaches “input”. In addition, each circle of fig 4 reads on “nodes”.), the plurality of layers of nodes:

a first layer of a plurality of input nodes among the plurality of nodes, the input nodes receiving values of a current time frame of the input ([figs 4-5]; [secs 2-3] “Formally, we define the DyBM-T as the Boltzmann machine having T layers from −T + 1 to 0, where T is a positive integer or infinity. Let x ≡ (x[t])−T<t<=0, where x[t] is the values of the units in the t-th layer, which we consider as the values at time t.”; The rightmost layer of Fig 4 reads “first layer”, and each node of the rightmost layer of fig 4 reads on “input nodes”. In addition, fig 4 and fig 5 read on “the input nodes receiving values of a current time frame of the input”. Note that Boulanger-Lewandowski teaches “input”.);

a plurality of intermediate layers, each node in each intermediate layer forwarding a value to a node in a subsequent or shared layer ([figs 4-5]; [secs 2-3] “Formally, we define the DyBM-T as the Boltzmann machine having T layers from −T + 1 to 0, where T is a positive integer or infinity. Let x ≡ (x[t])−T<t<=0, where x[t] is the values of the units in the t-th layer, which we consider as the values at time t.”; The other layers other than the rightmost layer of Fig 4 read “intermediate layers”. In addition, fig 4 and fig 5 read “each node in each intermediate layer forwarding a value to a node in a subsequent or shared layer”.);

a plurality of weight values among the plurality of first parameters, each weight value to be applied to each value in the corresponding node to obtain a value propagating from a pre-synaptic node to a post-synaptic node ([figs 4-5] “Spikes traveling from a pre-synaptic neuron (i) to a post-synaptic neuron (j) and eligibility traces.”; [secs 2-3] “Formally, we define the DyBM-T as the Boltzmann machine having T layers from −T + 1 to 0, where T is a positive integer or infinity. Let x ≡ (x[t])−T<t<=0, where x[t] is the values of the units in the t-th layer, which we consider as the values at time t. … For δ ≥ 1, let W[δ] be the matrix whose (i, j) element, Wi,j[δ], denotes the weight between the i-th unit at time −δ and the j-th unit at time 0 for any t.”; “Wij[δ]” of fig 4 reads on “weight values”. Note that Boulanger-Lewandowski teaches “first parameters”. In addition, “Wi,j[δ], denotes the weight between the i-th unit at time −δ and the j-th unit at time 0 for any t” reads on “each weight value to be applied to each value in the corresponding node”.).

Boulanger-Lewandowski and Osogami are all in the same field of endeavor of processing input signal with the Boltzmann machine and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the Boltzmann machine system of Boulanger-Lewandowski with the multiple layers of Osogami. Doing so would lead to significantly simplifying the learning rule for the DyBM (Dynamic Boltzmann machine) and exhibiting various characteristics of STDP that have been observed in biological neural networks when the DyBM has an infinite number of layers and particularly structured parameters (Osogami, sec 1).

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Boulanger-Lewandowski et al. (“Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription”) in view of Osogami et al. (“Seven neurons memorizing sequences of alphabetical images via spike-timing dependent plasticity”, hereinafter Osogami2015).

Regarding claim 4, 
Boulanger-Lewandowski teaches claim 1.
However, Boulanger-Lewandowski does not teach
initializing the plurality of first parameters to zero.

Osogami2015 teaches
initializing the plurality of first parameters to zero ([sec “Results”] “Here, the values of the eligibility traces and the FIFO queues were reset to zero before a cue was presented.”).

Osogami2015, sec “Results”).

Claims 5-6 are rejected under 35 U.S.C. 103 as being unpatentable over Boulanger-Lewandowski et al. (“Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription”) in view of Ranzato et al. (“Modeling Natural Images Using Gated MRFs”).

Regarding claim 5, 
Boulanger-Lewandowski teaches claim 1.
However, Boulanger-Lewandowski does not teach 
estimating a mean of the current time frame of the input using a conditional probability density of the input, wherein a current time frame of the input is assumed to have a Gaussian distribution.

Ranzato teaches
estimating a mean of the current time frame of the input using a conditional probability density of the input, wherein a current time frame of the input is assumed to have a Gaussian distribution ([sec 2] “they contribute to control the mean of the conditional distribution over the input 
    PNG
    media_image8.png
    34
    563
    media_image8.png
    Greyscale
 
    PNG
    media_image9.png
    38
    354
    media_image9.png
    Greyscale
(7) where I is the identity matrix, WϵRDxM is a matrix of trainable parameters, and bx ϵ RD is a vector of trainable biases for the input variables.”; “control the mean of the conditional distribution over the input” and eq (7) read on “estimating a mean of the current time frame of the input using a conditional probability density of the input”.).

Boulanger-Lewandowski and Ranzato are all in the same field of endeavor of processing input signal with the Boltzmann machine and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the Boltzmann machine system of Boulanger-Lewandowski with the mean estimation of Ranzato. Doing so would lead to enable the conditional distribution over the input pixels to be a Gaussian with not only its covariance but also its mean depending on the states of the latent variables (Ranzato, sec 2).

Regarding claim 6, 
Boulanger-Lewandowski and Ranzato teach claim 5.
Boulanger-Lewandowski further teaches
the updating includes learning the first parameters … and a plurality of output weight values of the output from the recurrent neural network ([fig 2]; [secs 2-4] “While all the parameters of the RBMs can depend on the previous time steps, we will consider the case where only the biases depend on hˆ(t−1): 
    PNG
    media_image10.png
    51
    294
    media_image10.png
    Greyscale
 (8) 
    PNG
    media_image11.png
    47
    300
    media_image11.png
    Greyscale
 (9) … The hidden-to-bias weights W’, W’’ can then be initialized to small random values, such that the sequential model will initially behave like independent RBMs, eventually departing from that state. … The gradient then back-propagates through the hidden-to-bias parameters (eq. 8 and 9):”; “biases” read on “first parameters”. In addition, “hidden-to-bias weights W’, W’’” reads on “a plurality of output weight values of the output from the recurrent neural network”.).

Ranzato teaches 
learning … a standard deviation of the current time frame of the input ([secs 2-4] “In this work, we extend these two classes of models with a new model whose conditional distribution over the input has both a mean and a covariance matrix determined by latent variables.”).

Boulanger-Lewandowski and Ranzato are all in the same field of endeavor of processing input signal with the Boltzmann machine and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the Boltzmann machine system of Boulanger-Lewandowski and Ranzato with the standard deviation learning of Ranzato. Doing so would lead to enable the conditional distribution over the input pixels to be a Gaussian with not only its covariance but also its mean depending on the states of the latent variables (Ranzato, sec 2).

Claims 7-8 are rejected under 35 U.S.C. 103 as being unpatentable over Boulanger-Lewandowski et al. (“Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription”) in view of Ranzato et al. (“Modeling Natural Images Using Gated MRFs”), and further in view of Osogami et al. (“Seven neurons memorizing sequences of alphabetical images via spike-timing dependent plasticity”, hereinafter Osogami2015).

Regarding claim 7, 
Boulanger-Lewandowski and Ranzato teach claim 6.

updating a plurality of eligibility traces and a plurality of first-in-first-out (FIFO) queues.

Osogami2015 teaches 
updating a plurality of eligibility traces and a plurality of first-in-first-out (FIFO) queues ([sec “Results”] “We did not train the DyBM but did update its eligibility traces and FIFO queues when it was presented with cues or when it was generating sequential patterns.”).

Boulanger-Lewandowski, Ranzato and Osogami2015 are all in the same field of endeavor of processing input signal with the Boltzmann machine and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the Boltzmann machine system of Boulanger-Lewandowski and Ranzato with updating of the eligibility traces and FIFO queues of Osogami2015. Doing so would lead to enabling DyBM to generate varying patterns during multiple time periods by updating the values of the eligibility traces and the FIFO queues each time a 7-bit pattern was presented (Osogami2015, sec “Results”).

Regarding claim 8, 
Boulanger-Lewandowski, Ranzato and Osogami2015 teach claim 7. 
Boulanger-Lewandowski further teaches 
evaluating a learning objective of the first neural network ([fig 2]; [secs 2-4] “The joint probability distribution of the RNN-RBM is also given by equation (7), but with hˆ(t) defined arbitrarily, here as per equation (11). … For simplicity, we consider the RBM parameters to be W, bv(t) , bh(t) (i.e. only the biases are variable) and a single-layer RNN (bottom portion of Fig. 2(b)) whose hidden units hˆ(t) are only connected to their direct predecessor hˆ(t−1) and to v(t) by the relation: 
    PNG
    media_image3.png
    38
    386
    media_image3.png
    Greyscale
 (11)”; Eq (7) reads on “learning objective”.).

Claims 9-13 are rejected under 35 U.S.C. 103 as being unpatentable over Tang et al. (Knowledge Transfer Pre-training) in view of Vinyals et al. (Show and Tell: A Neural Image Caption Generator).

Regarding claim 9, 
Tang teaches
providing an input to a first neural network ([sec IV] “A DNN system is then trained with the alignment provided by the GMM system. The feature used for the DNN system is the 40-dimensional Fbanks. A symmetric 11-frame window is applied to concatenate neighboring frames, and an LDA transform is used to reduce the feature dimension to 200, which forms the DNN input. The DNN architecture involves 4 hidden layers and each layer consists of 2048 units. The output layer is composed of 2008 units, equal to the total number of Gaussian mixtures in the GMM system.”; “The feature used for the DNN system is the 40-dimensional Fbanks” reads on “input”.);

providing the input to a recurrent neural network (RNN) configured as a non-linear extension of the first neural network to cooperatively process the input for a given task until a task completion providing a final output from the RNN, the RNN including a plurality of second parameters ([tables I-III]; [sec IV] “A DNN system is then trained with the alignment provided by the GMM system. The feature used for the DNN system is the 40-dimensional Fbanks. A symmetric 11-frame window is applied to concatenate neighboring frames, and an LDA transform is used to reduce the feature dimension to 200, which forms the DNN input. The DNN architecture involves 4 hidden layers and each layer consists of 2048 units. The output layer is composed of 2008 units, equal to the total number of Gaussian mixtures in the GMM system. … To train the RNN acoustic models, the DNN model of the baseline system is used as the teacher model. The RNN is based on the LSTM structure, where the input features are the 40-dimensional Fbanks, and the output units correspond to the Gaussian mixtures as in the DNN model. The momentum is empirically set to 0.9, and the starting learning rate is set to 0.0001 by default.”; [sec III] “Note that learning soft targets is not the ultimate goal of the model training, so a fine-tuning step is required to refine the model with the original hard targets. In this sense, the knowledge transfer learning is a pre-training step, which initializes the model parameters in such a way that the fine-tuning has a good starting point to reach a better local minimum, compared to training with hard targets from the beginning.”; [sec I] “This teacher model might be rather weak, but it is sufficient to direct the child model where to go. Once the teacher model helps the child model reach a reasonable place in the parameter space, the child model can learn by itself and finally finds a good local optimum, delivering a performance even better than the teacher model.”; “The RNN is based on the LSTM structure, where the input features are the 40-dimensional Fbanks” reads on “the input.” In addition, “help training complex models” and “the teacher model helps the child model reach a reasonable place in the parameter space in such a way that the fine-tuning has a good starting point to reach a better local minimum” read on “a non-linear extension of the first neural network to cooperatively process the input for a given task” since the teacher model is extended in a non-linear manner based on the student model. Note that Ba et al. (Do Deep Nets Really Need to be Deep?) teaches “knowledge transfer” and “teacher/student model”, and Hinton et al. (Distilling the Knowledge in a Neural Network) teaches the knowledge distillation algorithm in detail. Furthermore, the child model parameters read on “a plurality of second parameters”. Moreover, “the output units correspond to the Gaussian mixtures as in the DNN model” reads on “final output from the RNN”. Besides, training or testing may read on “a given task” and training may read on “task”.); and

updating the plurality of second parameters based on a learning objective of the first neural network ([sec I] “the teacher model is firstly trained and then is used to generate targets for the training data. These targets are actually posterior probabilities and so are ‘soft’ compared to the original one-hot ‘hard’ targets. The soft targets are used to train the child model. As we will see, using soft targets leads to a smoother objective function, which makes the pre-training a much easier task than training with the original hard targets”; [sec III] “We focus on the dark knowledge distiller model rather than logit matching as it showed better performance in our experiments. This model uses a well-trained DNN as the teacher model to predict the targets of the training samples, and these targets are used to train the child model. … the knowledge transfer learning is a pre-training step, which initializes the model parameters in such a way that the fine-tuning has a good starting point to reach a better local minimum, compared to training with hard targets from the beginning.”; “the teacher model is firstly trained and then is used to generate targets for the training data” reads on “a learning objective of the first neural network” because nonlinear functions of the teacher model are used for training the DNN.),

wherein the given task is a post processing task, and wherein the input provided to the first neural network and the recurrent neural network are identical values of multiple time frames ([tables I-III]; [sec IV] “the training starts from constructing a system based on Gaussian mixture models (GMMs) with the standard 13-dimensional MFCC features plus the first- and second-order derivatives. A DNN system is then trained with the alignment provided by the GMM system. The feature used for the DNN system is the 40-dimensional Fbanks. A symmetric 11-frame window is applied to concatenate neighboring frames, and an LDA transform is used to reduce the feature dimension to 200, which forms the DNN input. … To train the RNN acoustic models, the DNN model of the baseline system is used as the teacher model. The RNN is based on the LSTM structure, where the input features are the 40-dimensional Fbanks, and the output units correspond to the Gaussian mixtures as in the DNN model. The momentum is empirically set to 0.9, and the starting learning rate is set to 0.0001 by default.”; see also [sec III]; [sec I] “This teacher model might be rather weak, but it is sufficient to direct the child model where to go. Once the teacher model helps the child model reach a reasonable place in the parameter space, the child model can learn by itself and finally finds a good local optimum, delivering a performance even better than the teacher model.”; “The feature used for the DNN system is the 40-dimensional Fbanks” and “The RNN is based on the LSTM structure, where the input features are the 40-dimensional Fbanks” with “learning rate is set to 0.0001” read on “the input provided to the first neural network and the recurrent neural network are identical values of multiple time frames.” In addition, “constructing a system based on Gaussian mixture models (GMMs)” reads on “the given task is a post processing task” since the GMM system is constructed before training or testing the DNN and the RNN.).

However, Vinyals does not teach
A computer program product including one or more computer readable storage mediums collectively storing program instructions that are executable by a computer to cause the computer to perform operations comprising ([sec 3] “memory”).

Tang and Vinyals are all in the same field of endeavor of processing input signal with the neural networks and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network system of Tang with the memory of Vinyals. Doing so would lead to providing state-of-Vinyals, sec 3).

Regarding claim 10, 
Vinyals and Li teaches claim 9.
Vinyals further teaches 
updating a state of the RNN using a nonlinear function ([figs 1-3]; [sec 3] “It is natural to model p(St|I, S0, . . . , St−1) with a Recurrent Neural Network (RNN), where the variable number of words we condition upon up to t − 1 is expressed by a fixed length hidden state or memory ht. This memory is updated after seeing a new input xt by using a non-linear function f: ht+1 = f(ht, xt). (3)”).

Tang and Vinyals are all in the same field of endeavor of processing input signal with the neural networks and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network system of Tang and Vinyals with the RNN update of Vinyals. Doing so would lead to providing state-of-the art performance on sequence tasks by using non-linear functions to update memories based on inputs (Vinyals, sec 3).

Regarding claim 11, 
Vinyals and Li teaches claim 9.
Vinyals further teaches
updating a state of the RNN based on a feature mapping of a history of the RNN and a current time frame of the input ([figs 1-3]; [sec 3] “In particular, three gates are being used which control whether to forget the current cell value (forget gate f), if it should read its input (input gate i) and whether to output the new cell value (output gate o). The definition of the gates and cell update and output are as follows: 
    PNG
    media_image12.png
    302
    804
    media_image12.png
    Greyscale
 
where ⊙ represents the product with a gate value, and the various W matrices are trained parameters. Such multiplicative gates make it possible to train the LSTM robustly as these gates deal well with exploding and vanishing gradients [10]. The nonlinearities are sigmoid σ(·) and hyperbolic tangent h(·). The last equation mt is what is used to feed to a Softmax, which will produce a probability distribution pt over all words.”; “it” reads on “a current time frame of the input”. In addition, “sigmoid σ(·) and hyperbolic tangent h(·)” read on “a feature mapping of a history of the RNN” since the functions transform the history information of the RNN.).

Tang and Vinyals are all in the same field of endeavor of processing input signal with the neural networks and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network system of Tang and Vinyals with the RNN update of Vinyals. Doing so would lead to providing state-of-the art performance on sequence tasks by using non-linear functions to update memories based on inputs (Vinyals, sec 3).

Regarding claim 12, 
Vinyals and Li teaches claim 9.
Vinyals further teaches
updating the plurality of second parameters includes updating a plurality of output weights ([sec 3] “In particular, three gates are being used which control whether to forget the current cell value (forget gate f), if it should read its input (input gate i) and whether to output the new cell value (output gate o). The definition of the gates and cell update and output are as follows: 
    PNG
    media_image12.png
    302
    804
    media_image12.png
    Greyscale
 where ʘ represents the product with a gate value, and the various W matrices are trained parameters.”; Wox  and Wom read on “a plurality of output weights”.).

Tang and Vinyals are all in the same field of endeavor of processing input signal with the neural networks and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network system of Tang and Vinyals with the parameter update of Vinyals. Doing so would lead to providing state-of-the art performance on sequence tasks by using non-linear functions to update memories based on inputs (Vinyals, sec 3).

Regarding claim 13, 
Vinyals and Li teaches claim 9.
Vinyals further teaches
initializing a plurality of input weights of the RNN and a plurality of RNN weights of the RNN randomly ([sec 3] “In particular, three gates are being used which control whether to forget the current cell value (forget gate f), if it should read its input (input gate i) and whether to output the new cell value (output gate o). The definition of the gates and cell update and output are as follows: 
    PNG
    media_image12.png
    302
    804
    media_image12.png
    Greyscale
 where  represents the product with a gate value, and the various W matrices are trained parameters.”; [sec 4] “We trained all sets of weights using stochastic gradient descent with fixed learning rate and no momentum. All weights were randomly initialized.”; Wix  and Wim read on “a plurality of input weights”, and Wfx  and Wfm read on “a plurality of RNN weights”.).

Tang and Vinyals are all in the same field of endeavor of processing input signal with the neural networks and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network system of Tang and Vinyals with the parameter initialization of Vinyals. Doing so would lead to providing state-of-the art performance on sequence tasks by using non-linear functions to update memories based on inputs (Vinyals, sec 3).

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Tang et al. (Knowledge Transfer Pre-training) in view of Vinyals et al. (Show and Tell: A Neural Image Caption Generator), further in view of Evermann et al. (“Predicting Process Behaviour using Deep Learning”).

Regarding claim 14, 
Tang and Vinyals teach claim 13.
However, Tang and Vinyals do not teach 


Evermann teaches
updating the plurality of second parameters includes maintaining the plurality of input weights of the RNN and a plurality of RNN weights of the RNN ([sec 3] “
    PNG
    media_image13.png
    124
    1014
    media_image13.png
    Greyscale
”; [sec 4] “Subsequent epochs maintain the weights W and biases b learned from the previous epoch but reinitialize the states for each layer and then train the net again on the entire event log.”; Note that Vinyals teaches the plurality of input weights of the RNN and a plurality of RNN weights of the RNN.).

Tang, Vinyals and Evermann are all in the same field of endeavor of processing input signal with the neural networks and are analogous. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network system of Tang and Vinyals with the maintaining of the input weights and RNN weights of Evermann. Doing so would lead to enabling each epoch to train the net on the entire event log based on the backpropagation algorithm by computing the mean gradients for all parameters (Evermann, sec 4).

Response to Arguments
Applicant's arguments filed on 04/15/2021 have been fully considered but they are not persuasive.

Applicant asserts 
“Applicant respectfully submits that independent claim 1 is patentable over Boulanger under 35 U.S.C. §102(a)(1) at least because Boulanger fails to disclose "providing an input to a first neural network, the first neural network including a plurality of first parameters; [and] providing the input to a recurrent neural network, the recurrent neural network including a plurality of second parameters,... , wherein the input provided to the first neural network and the recurrent neural network are identical values of multiple time frames" as currently recited in amended independent claim 1 (emphasis added). 
To that end, it is emphasized that, under straightforward claim interpretation, the SAME INPUT is being applied to both the first neural network and to the recurrent neural network. The same means, to anyone of ordinary skill in the art as well as a layperson, "identical" or "without difference". The claim actually recites "IDENTICAL" now to make that point clear. Modifying what is put into one of these networks as disclosed in Boulanger is not the same or identical or without difference, but instead teaches away from these claim limitations. 
For example, with reference to Figure 2(b) of Boulanger, reproduced below, Boulanger discloses that "[t]he training algorithm is slightly different than for the RTRBM since the mean- field values of the hit) are now distinct from P." (Boulanger, p. 4, col. 1.) Boulanger explicitly discloses that:

    PNG
    media_image14.png
    252
    691
    media_image14.png
    Greyscale


		<Figure 2> 
Hence, even if v(t) can be considered the "input" recited in independent claim 1, as alleged in the Office Action and which Applicant does not concede, it is clear that VI of the RBM ofPage 8 of 15 Boulanger is modified by at least bf) and W2 before being provided to hl) of the RNN. Indeed, Figure 2 of Boulanger clearly indicates that the first instance of vW) of the RBM appears after the introduction of b(), and therefore, the input provided to the RBM and the RNN is different at least because vW) is modified before being introduced to the RNN. The structure is different as Figure 2 of Boulanger involves a RTRBM, a RNN, and a RBM. Therefore, in no way does Boulanger disclose providing the SAME (IDENTICAL) INPUT to both the RBM and RNN, particularly given the intervening RNN and in no instance discloses "providing an input to a first neural network, the first neural network including a plurality of first parameters; [and] providing the input to a recurrent neural network, the recurrent neural network including a plurality of second parameters, ... , wherein the input provided to the first neural network and the recurrent neural network are identical values of multiple time frames" as currently recited in independent claim 1.” (Remarks, pg 7)

Examiner’s response:
The examiner respectively disagrees. 
	
	The examiner understands the applicant’s assertion. However, as explained in “Response to Arguments” of the previous office action, the recited claim does not say that the input cannot be modified, but it just says that an input is provided to each neural network. Thus, there is nothing that prevents v(t) of eq (1), eq (7) and eq (11) in the Boulanger-Lewandowski reference from reading on the claimed input. Fig 2 of the (1) starts as an input and is provided to RNN and RBM as the the recited claim says, and eq (1), eq (7) and eq (11) clearly show that the input, v(t), is provided to each neural network.

	In addition, in case of RBM, bv(1) is just used as a weight for the input v(t) based on eq (1), eq (7) and eq (8), and in case of RNN, W2 is used just used as a weight for the identical input v(t) based on eq (11). Furthermore, the last paragraph of sec 3 says “Note that equation (10) is exactly the defining equation of a single-layer RNN with hidden units 
    PNG
    media_image4.png
    63
    78
    media_image4.png
    Greyscale
.” Just for the sake of comparison, for example, the Elman network on Wikipedia (https://en.wikipedia.org/wiki/Recurrent_neural_network#cite_note-25) calculates the hidden layer vector with the same mathematical expression as eq (11) of Boulanger-Lewandowski, and xt is just an “input vector” for the Elman network. In the same manner, v(t) is just used as an input for the RNN portion of the RNN-RBM in the Boulanger-Lewandowski reference.

For more details, see the rejections. Thus, the examiner’s rejections are reasonable and proper.

Applicant asserts 
“Claim 21 has been amended to now recite, inter alia, "wherein nonlinear analysis of the input is provided by the RNN using an RNN layer that computes a nonlinear feature map of the past time series input to the first neural network, while linear analysis of the input is provided by the first neural network." Support can be found in paragraph [0039] of the instant specification as filed. The cited references are silent regarding the newly added limitations reproduced above from claim 21. 
(Remarks, pg 8)

Examiner’s response:
The examiner respectively disagrees. 

Boulanger-Lewandowski still teaches the recited limitation below since an RNN layer of the RNN module computes a past time series input based on the hidden unit, as follows:
	
Boulanger-Lewandowski further teaches 
nonlinear analysis of the input is provided by the RNN using an RNN layer that computes a nonlinear feature map of the past time series input to the first neural network ([fig 2]; [secs 2-4] “
    PNG
    media_image5.png
    40
    283
    media_image5.png
    Greyscale
 is the element-wise logistic sigmoid function. … For simplicity, we consider the RBM parameters to be W, bv(t) , bh(t) (i.e. only the biases are variable) and a single-layer RNN (bottom portion of Fig. 2(b)) whose hidden units hˆ(t) are only connected to their direct predecessor hˆ(t−1) and to v(t) by the relation:  
    PNG
    media_image3.png
    38
    386
    media_image3.png
    Greyscale
 (11) The RBM portion of the RNN-RBM (upper portion of Fig. 2(b)) is otherwise exactly the same as its RTRBM counterpart. This gives the single-layer RNN-RBM nine parameters: W, bv, bh, W’, W’’, hˆ(0), W2, W3, bhˆ.”; see also [sec 6] “probabilistic modeling of sequences of polyphonic music”; 
    PNG
    media_image3.png
    38
    386
    media_image3.png
    Greyscale
 based on the input v(t) reads on “nonlinear analysis of the input is provided by the RNN using an RNN layer that computes a nonlinear feature map of the past time series input to the first neural network” because the input is used and analyzed in RNN using an RNN layer, and because the RNN layer computes a nonlinear feature map of the past time series input to the first neural network since                         
                            
                                
                                    
                                        
                                            h
                                        
                                        ^
                                    
                                
                                
                                    (
                                    t
                                    )
                                
                            
                        
                     is calculated based on                         
                            
                                
                                    
                                        
                                            h
                                        
                                        ^
                                    
                                
                                
                                    (
                                    t
                                    -
                                    1
                                    )
                                
                            
                        
                     which has a past time series input, v(t-1).),
while linear analysis of the input is provided by the first neural network ([fig 2]; [secs 2-4] “An RBM is an energy-based model where the joint probability of a given configuration of the visible vector v (inputs) and the hidden vector h is: 
    PNG
    media_image6.png
    44
    572
    media_image6.png
    Greyscale
 (1) where bv, bh and W are the model parameters and Z is the usually intractable partition function.”; “ 
    PNG
    media_image15.png
    40
    278
    media_image15.png
    Greyscale
” with RBM reads on “linear analysis of the input is provided by the first neural network” since the input is used and analyzed in RBM based on the mathematical expression which has linear operations only.).

For more details, see the rejections. Thus, the examiner’s rejections are reasonable and proper.

Applicant asserts 
“Applicant respectfully submits that independent claim 9 is patentable over Tang and Vinyals under 35 U.S.C. §103 at least because Tang and Vinyals fails to disclose "providing the input to a recurrent neural network configured as a non-linear extension of the first neural network to cooperatively process the input for a given task until a task completion providing a final output from the RNN, the recurrent neural network including  is no cooperative processing in Tang of the same (identical) input. 
With reference to Figure 1 of Vinyals, reproduced below, Vinyals discloses that "NIC, our model, is based end-to-end on a neural network consisting of a vision CNN followed by a language generating RNN. It generates complete sentences in natural language from an input image." (Vinyals, caption of Figure 1). Vinyals further discloses that "it is natural to use a CNN as an image 'encoder', [sic] by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences. (Vinyals, p. 2, sec. 1). Thus, in Vinyals, the 2 models are used for completely different tasks (video encoding versus sentence generation) and NOT a given task as recited in claim 9.
<figure>
Applicant respectfully submits that in no instance does Vinyals disclose providing the input to each of the CNN and the RNN FOR COOPERATIVE PROCESSING, LET ALONE UNTIL A TASK COMPLETION PROVIDING A FINAL OUTPUT FROM THE RNN. Rather, an input (e.g., the image) is provided to the CNN, which encodes the image and then outputs the encoded image to the RNN. Therefore, the input provided to the RNN has been manipulated by the CNN and cannot be considered the IDENTICAL input provided to the CNN. Accordingly, Applicant respectfully submits that Vinyals fails to disclose "providing an input to a recurrent neural network (RNN) including a plurality of second parameters; [and] providing the input to a first neural network, ... wherein the input provided to the first neural network and the RNN are identical" as currently recited in independent claim 9. 
Thus, Vinyals essentially teaches away from limitations of claim 9. 
Moreover, Tang does teach or suggest all of the limitations of claim 9, thus forming a combination NOT suggestive of the present features recited in claim 9. For gured as a non-linear extension of the first neural network to COOPERATIVELY PROCESS the input for a given task UNTIL TASK COMPLETION". 
In direct contrast, Tang discloses a temporary relationship of teacher/child between two separate and independent models, namely a trained simple model (teacher model) and a complex untrained model (child model), where the simple model is SHED once the complex model has been sufficiently trained to surpass the simple model (Tang, P. 2, first non-full paragraph) thus preventing the possibility of one being an extension of the other to cooperatively process the input for a given task until a task completion. There is no configuring as an actual extension, let alone, a non-linear extension, of a first network with respect to a second network, but instead a simple use of a first model to train a second model only until the second model is sufficiently trained at which point the first model is discarded. Moreover, there is not cooperatively processing until a task competition, as once the child model surpasses the teacher model, the teacher is discarded and the final output it taken from the child model.” (Remarks, pg 11)

Examiner’s response:
The examiner respectively disagrees. 

Tang still teaches “cooperatively process the input for a given task until a task completion” because 1) DNN and RNN cooperatively process the input during training, 2) “a given task” is not elaborated with a specific phase/step and 3) the relationship between the “given task” and the “task” of “until a task completion” is not definite as rejected in “Claim Rejections - 35 USC § 112”.
	


providing the input to a recurrent neural network (RNN) configured as a non-linear extension of the first neural network to cooperatively process the input for a given task until a task completion providing a final output from the RNN, the RNN including a plurality of second parameters ([tables I-III]; [sec IV] “A DNN system is then trained with the alignment provided by the GMM system. The feature used for the DNN system is the 40-dimensional Fbanks. A symmetric 11-frame window is applied to concatenate neighboring frames, and an LDA transform is used to reduce the feature dimension to 200, which forms the DNN input. The DNN architecture involves 4 hidden layers and each layer consists of 2048 units. The output layer is composed of 2008 units, equal to the total number of Gaussian mixtures in the GMM system. … To train the RNN acoustic models, the DNN model of the baseline system is used as the teacher model. The RNN is based on the LSTM structure, where the input features are the 40-dimensional Fbanks, and the output units correspond to the Gaussian mixtures as in the DNN model. The momentum is empirically set to 0.9, and the starting learning rate is set to 0.0001 by default.”; [sec III] “Note that learning soft targets is not the ultimate goal of the model training, so a fine-tuning step is required to refine the model with the original hard targets. In this sense, the knowledge transfer learning is a pre-training step, which initializes the model parameters in such a way that the fine-tuning has a good starting point to reach a better local minimum, compared to training with hard targets from the beginning.”; [sec I] “This teacher model might be rather weak, but it is sufficient to direct the child model where to go. Once the teacher model helps the child model reach a reasonable place in the parameter space, the child model can learn by itself and finally finds a good local optimum, delivering a performance even better than the teacher model.”; “The RNN is based on the LSTM structure, where the input features are the 40-dimensional Fbanks” reads on “the input.” In addition, “help training complex models” and “the teacher model helps the child model reach a reasonable place in the parameter space in such a way that the fine-tuning has a good starting point to reach a better local minimum” read on “a non-linear extension of the first neural network to cooperatively process the input for a given task” since the teacher model is extended in a non-linear manner based on the student model. Note that Ba et al. (Do Deep Nets Really Need to be Deep?) teaches “knowledge transfer” and “teacher/student model”, and Hinton et al. (Distilling the Knowledge in a Neural Network) teaches the knowledge distillation algorithm in detail. Furthermore, the child model parameters read on “a plurality of second parameters”. Moreover, “the output units correspond to the Gaussian mixtures as in the DNN model” reads on “final output from the RNN”. Besides, training or testing may read on “a given task” and training may read on “task”.);

For more details, see the rejections. Thus, the examiner’s rejections are reasonable and proper.

Applicant asserts 
“The limitations of claim 22 now recited in claims 1 and 9 further strengthens the preceding argument in disclosing that "the given task is a POST-TRAINING task", where the task in Tang is a TRAINING task that trains the child model by the teacher model, in contrast to the explicit limitations recited in claims 1 and 9. That is, the child model, once trained, does not use the teacher anymore, as the teacher (simpler) model is discarded. Thus, there is no cooperative processing of an identical input in a post training task in Tang, in contrast to the limitations recited in claims 1 and 9. Wer is determined solely by the child model once trained in Tang. Even comparing the two models (baselines) separately to determine their respective outputs once the child model is trained prevents any cooperative training using both models during a post-training task. For example, as disclosed in Tang (1st paragraph under TABLE 1), "From the results, it can be observed that the RNN baseline (RNN [raw]) cannot beat the DNN baseline in terms of WER.” (Remarks, pg 13)

Examiner’s response:
The examiner respectively disagrees. 

The examiner understands the applicant’s assertion “The limitations of claim 22 now recited in claims 1 and 9 further strengthens the preceding argument in disclosing that "the given task is a POST-TRAINING task", where the task in Tang is a TRAINING task that trains the child model by the teacher model, in contrast to the explicit limitations recited in claims 1 and 9”.

However, claims 1 and 9 recite “the given task is a post processing task” instead of “post-training task”, and how “post processing task” is related to “post-training task” is not clear from the recited claim, the remarks or the specification. Thus, for the purpose of examination, the recited limitation “the given task is a post processing task” is used.

For more details, see the rejections. Thus, the examiner’s rejections are reasonable and proper.
	
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEHWAN KIM whose telephone number is (571)270-7409.  The examiner can normally be reached on Mon - Thu 7:00 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ALEXEY SHMATOV can be reached on 571-270-3428.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR 

/S.K./Examiner, Art Unit 2123                                                                                                                                                                                                        


/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126