DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
In the response to this office action, the Examiner respectfully requests that support be shown for language added to any original claims on amendment and any new claims. That is, indicate support for newly added claim language by specifically pointing to page(s) and line numbers in the specification and/or drawing figure(s). This will assist the Examiner in prosecuting this application.

Claim Objections
Claims 1-24 are objected to because of the following informalities: 
Claim 1 recites a neural sequence-to-sequence seq2seq model“ and then “the model” and “the seq2seq model” and wherein “the model” shall be – the neural seq2seq model --  and “the seq2seq model” shall be -- the neural seq2seq model --. Claim 1 further recites “to provide the trained seq2seq model” which should be -- to provide a trained seq2seq model --.  Claims 2-20 are objected due to the dependencies to claim 1.
Claim 21 is objected for the at least similar reason as described in claim 1 above since claim 21 recited similar deficient features as recited in claim 1. For example, claim 21 recites “a pretrained neural sequence-to-sequence seq2seq model” and then recites “the seq2seq model” and ”the pretrained model” which should be -- the pretrained neural seq2seq model -- and -- the pretrained neural seq2seq model --, respectively. Claim 21 further recites “the training data” which should be -- the stored parallel training data --. 
Claim 3 further recites “the global or semantic features …” which should be -- the one or more global or semantic features …--.
Claim 7 is objected for the at least similar reason as described in claim 1 above because claim 7 recites the similar deficient feature as recited in claim 1. For example, claim 7 further recites “the seq2seq model”.
Claim 8 is objected for the at least similar reason as described in claim 1 above because claim 8 recites the similar deficient feature as recited in claim 1. For example, claim 8 further recites “the seq2seq model”.
Claim 11 further recites “the feature functions” which should be -- the one or more feature functions --.
Claim 13 further recites “the model parameters” and “the seq2seq model” which should be -- the one or more model parameters -- and -- the neural seq2seq model --, respectively.
Claim 16 is further objected for the at least similar reason as described in claim 13 above since claim 16 recites the similar deficient feature as recited in claim 13. For example, claim 16 further recites “the model parameters”.
Clam 19 further recites “the model parameters” which should be --the one or more model parameters--.
Claim 22 is objected for the similar reason as described in claim 1 above since claim 22 recites the similar deficient features as recited in claim 1. For example, claim 22 recites “the seq2seq model” and “the model parameters”. Claims 23-24 are objected due to the dependencies to claim 22.
Claim 23 is objected for the at least similar reason as described in claim 7 above because claim 23 recites the similar deficient feature as recited in claim 7. For example, claim 23 further recites “the seq2seq model”. Claim 23 is duplicate of claim 7 and thus, it is recommended to cancel claim 23.
Claim 24 is objected for the at least similar reasons as described in claim 13 above because claim 24 recites the similar deficient features as recited in claim 13. For example, claim 24 further recites “the seq2seq model” and “the model parameters”. Claim 24 is duplicate of claim 13 and thus, it is recommended to cancel claim 24.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(B)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

Claims 1-24 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which applicant regards as the invention.
Claim 1 recites “an expected loss of one or more global or semantic features or constrains between the predicted target sequences and the training target sequences given the training source sequences”, i.e., of “training target sequences given the training source sequences”, and then recites “wherein the expected loss is based on one or more global or semantic features or constraints of general target sequences given general source sequences”, which is confusing because it is unclear whether “the expected loss” is of “training target sequences given the train source sequences” or is “of general target sequences given general source sequences” and thus, renders claim indefinite. Claims 2-20 are rejected due to the dependencies to claim 1.
Claim 3 further recites “the global or semantic features or constraints are generated using the general source sequences and the general target sequences” and the parent claim 1 recites “”an expected loss of one or more global or semantic features or constraints between the predicted target sequences and the training target sequences given the training source sequences”, i.e., using “training target sequences” and “training source sequences” which is confusing because it is unclear whether generated “the global or semantic features or constraints” is by using “general source sequences and the general target sequences” or using “training source sequences” and “training target sequences” and thus, further renders claim indefinite. Claim 3 further recites “wherein the generated global or semantic features or constraints are stored prior to said receiving the training data”, but the parent claim 1 recites “global or semantic features or constraints between the predicted target sequences and the training target sequences given the training source sequences” and “the training data comprising a plurality of training source sequences and a corresponding plurality of training target sequences”, i.e., “global or semantic features or constraints between” can not be generated without “receiving” “the training data”, or saying “global or semantic features or constraints” cannot be stored prior to said receiving the training data”, which causes confusing because it is unclear how storing a non-existed “generated global or semantic features or constraints” is performed and thus, further renders claim indefinite.
Claim 6 further recites “the tokens in the first sequences, the second sequence, and the third sequence each are of one or more types …” which is further confusing because “each are” doesn’t make any sense and causes confusing because it is unclear whether “each” is “one or more types of” or they “are one or more types of” and further renders claim indefinite.
Claim 12 further recites “one or more feature functions comprise a function representing one of: a relative quantity of repeated tokens or repeated sets of tokens in the training source and target sequences, a relative quantity of tokens in the training source and target sequences, a relative …” and wherein “the training source and target sequences” has insufficient antecedent basis for the limitation and causes confusing because it is unclear what “the training source and target sequences” is and it is unclear what “a function” represents and comprised in “the one or more feature functions” and thus, renders claim indefinite. 
Claim 19 recites “for each of the training sequences” and wherein “the training sequences” has an insufficient antecedent basis for the limitation in claim 21, and causes confusing because it is unclear what “the training sequences” is and it is unclear for what “updating” is performed and thus, renders claim indefinite. Claim 19 further recites “distribution of the selected plurality of training target sequences that corresponds to the respective training sequence” and wherein “the respective training sequence” has an insufficient antecedent basis for the limitation in claim 21 and causes confusing because it is unclear what “the respective training sequence” is and it is unclear corresponding to what “a distribution of the selected plurality of training target sequences” and thus, further renders claim indefinite. Claim 20 is rejected due to the dependency to claim 19.
Claim 21 recites “the updated model parameters for the seq2seq model with respect to a best score …”, which is confusing because it is unclear whether “the updated model parameters” is referred back to “updating, … the model parameters … based on the computed total moment matching gradients to update the seq2seq model” or “updating, … the model parameters … based on the computed CE-based gradients to update the seq2seq model” and it is unclear whether “for the seq2seq model” herein is referred back to “update the seq2seq model” based on “computed total moment matching gradients” or to “update the seq2seq model” based on “computed CE-based gradients” and thus, renders claim indefinite. Claim 21 further recites “the updated model parameters for the seq2seq model with respect to a best score based on minimizing an approximation of the moment matching loss over at least a portion of the training data …” and wherein “the moment matching loss” has an insufficient antecedent basis for the limitation in claim 21, and causes confusing because it is unclear what “the moment matching loss” is and it is unclear how “saving in a memory the updated model parameters for the seq2seq model with respect to a best score based on minimizing …” is performed and thus, further renders claim indefinite.
Claim 22 is rejected for the at least similar reason as described in claim 1 above because claim 22 recites the similar deficient features as recited in claim 1. For example, claim 22 recites “…the training target sequences given the training source sequences” and “…general target sequences given general source sequences” with respect to “one or more global or semantic features or constraints”. Claims 23-24 are rejected due to the dependencies to claim 22.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1-14, 22-24 are rejected under 35 U.S.C. 103 as being unpatentable over Sutskever et al. (“Sequence to Sequence Learning with Neural Networks”, Advances in Neural Information Processing Systems 27, 2014, p.1-9, hereinafter Sutskever, IDS) and in view of reference Ranzato et al. (“Sequence Level Training with Recurrent Neural Networks”, ICLR, 20 November, 2015, p.1-16, hereinafter Ranzato, IDS).
Claim 1: Sutskever teaches a method (title and abstract, ln 1-21) for training a neural sequence-to-sequence seq2seq model (a model for sequence to sequence learning, i.e., seq2seq model, in fig. 1) using a processor (including GPU with C++ programmed code, Section 3.5 Parallelization, p.5), the seq2seq model being configured to receive an input source sequence (including source sentence S having sequenced words as the training set, Section 3.2 Decoding and Rescoring, p.4) and output a predicted target sequence (Ť as most likely translation according to the equation 2, Section 3.2, Decoding and Rescoring, p.4) based on one or more model parameters (model parameters Whx, Wyh, Whh, etc., Section The model, p.3), the method comprising: 
receiving, by the processor, the model and training data (including the source sentence S and correct translation T, Section 3.2 Decoding and Rescoring, p.4), the training data comprising a plurality of training source sequences (the source sentence having sequenced words or characters, e.g., “ABC. WXYZ” in fig. 1) and a corresponding plurality of training target sequences (the correct translation T, Section 3.2 Decoding and Rescoring, p.4); 
generating, by the processor using the seq2seq model, a plurality of predicted target sequences corresponding to the plurality of training source sequences (generating the sequenced words or characters, e.g., WXYZ in fig. 1); 
updating the model parameters (model parameters, from initialized parameters, Section 3.4, Training details; e.g., weight Whx, Whh, Wyh, etc., Section 2 The model, inherently updated from the initialized parameters for accurate translation of a given sentence) based on a comparison of the generated plurality of predicted target sequences to the corresponding plurality of training target sequences (represented by the probability of p(T|S) to find the maximum of the probability via max p(T|S), Section 3.2 Decoding and Rescoring, p.4); and 
saving in a memory (GPU with software programmed via C++, and thus, a memory is inherently for the program, para 3.5, Parallelization, p.5; memory utilization is achieved, para 3.3 Reversing the Source Sentences, p.5) the updated model parameters for the seq2seq model to provide the trained seq2seq model (the trained LSTM model via the encoder and decoder and applied to long sentences and the performance in fig. 2, Section 3.7 Performance on long sentences, Section 3.8 Model Analysis, and thus, the updated model or model parameters are saved in the memory for long sentences’ usage is inherency) and the training target sequence by given the training source sequences (the correct translation T corresponding to the given training source sequences S, Section 3.2 Decoding and Rescoring, p.4).
However, Sutskever does not explicitly teach a local loss and an expected loss of one or more global or semantic features or constraints between the predicted target sequences and the training target sequence given the training source sequences, wherein the expected loss is based on one or more global or semantic features or constraints of general target sequences, and wherein the updating of the model parameters is purported to reduce or minimize the local loss in the predicted target sequences and to minimize or reduce the expected loss.
Ranzato teaches an analogous field of endeavor by disclosing a method for training a neural sequence-to-sequence model (title and abstract, ln 1-10) and wherein updating the model parameters (finding the agent or model parameters, Section 3.2.1 REINFORCE combined with training model parameters by minimizing cross-entropy loss for NXENT epochs using the ground truth sequences as claimed predicted target sequences, Section 3.2.2 MIXER) based on a comparison of the generated plurality of predicted target sequences to the corresponding plurality of training target sequence (minimizing cross-entropy loss XENT in equation 6 with RNN learning network wherein {M0, Mi, Mh, Mc} are model parameters, Section 3 MODELS, p.3 , and Section 3.1.1 CROSS ENTROPY TRAINING XENT, P.4; in REINFORCE, by comparing the sequence of actions from the current policy as the claimed predicted target sequence, against the optimal action sequence as the claimed training target sequence with the probability 
    PNG
    media_image1.png
    23
    177
    media_image1.png
    Greyscale
, wherein Ө is the model parameters, Section 3.2.1 REINFORCE, P.6-7) to reduce or minimize a local loss in the predicted target sequences (training the RNN with the cross-entropy loss for NXENT epochs using the ground truth sequences as the claimed predicted target sequences, e.g., at the first (T- ∆), (T-2∆), etc. steps, Section 3.2.2 MIXED INCREMENTAL CROSS-ENTROPY REINFORMCE OR MIXER, P.7) and to minimize or reduce an expected loss of one or more global or semantic features or constraints between the predicted target sequences (represented by actions according to the current policy or model parameters in REINFORCE, Section 3.2.1 REINFORCE) and training target sequence (represented by optimal action sequence, Section 3.2.1 REINFORMCE; negative expected reward Lθ in equation 9, p.6; minimizing the mean squared loss || ɍt – r||2 in REINFORMCE of the MIXER, Section 3.2.1 REINFORCE, p.7) given training source sequences (an input word wt at time t, Section MODELS, p.3) for benefits of achieving a performance improvement by avoiding exposure bias and maximizing the probability of next generation of correct words with relatively smaller number of samples (Section 1 INTRODUCTION, p.1-2) and significantly reducing uncertainty of the prediction of the text generation (Section 3.2.2 MIXED INCREMENTAL CROSS-ENTROPY REINFORMCE OR MIXER, P.7).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied the updating of model parameters and wherein the updating of the model parameters is based on the comparison of the generated plurality of predicted target sequences to the corresponding plurality of training target sequence to reduce or minimize the local loss in the predicted target sequences and to minimize or reduce the expected loss of one or more global or semantic features or constraints between the predicted target sequences and training target sequence given training source sequences, as taught by Ranzato, to the updating of the model parameters in the method for training the neural seq2seq model using the processor, as taught by Sutskever, for the benefits discussed above.
Claim 22 has been analyzed and rejected according to claim 1 above and the combination of Sutskever and Ranzato further teaches 
receiving the source sequence (Sutskever, long sentences as input sequences applied to the developed LSTM model, Section 3.7 Performance on long sentences, and Ranzato, applied the model for text summarization by taking source text as input, Section 4.1 TEXT SUMMERIZATION, p.8-9; applied to Germany-English translation and one of English sentence 17.5 words long or German sentence 18.5 words long as an input, Section 4.2 MACINE TRANSLATION, p.9); 
accessing in a memory a trained neural sequence-to-sequence model configured to receive the received source sequence and output the target sequence based on one or more model parameters (Sutskever, applying the model to the English sentences, and generating the sentences in table 3, p.7, and thus, inherently the model parameters are saved and applied to the give English sentences for French translation in table 3; and Ranzato, generating target text corresponding to the source text, Section TEXT SUMMARIZATION, and thus, inherently, the model parameters were saved and then retrieved for application of the text summarization); 
generating the target sequence corresponding to the received source sequence using the trained neural sequence-to-sequence model (Sutskever, generating the French sentences corresponding to English sentences in table 3, and Ranzato, target text corresponding to the source text, Section 4.1 TEXT SUMMARIZATION); and 
outputting the generated target sequence (Sutskever, the generated corresponding French sentences to the English sentences in table 3 and Ranzato, the target text corresponding to the source text, Section TEXT SUMMARIZATION; English sentence is 17.5 words long corresponding to the German sentence is 18.5 words long in average, Section 4.2 MACHINE TRANSLATION); wherein the trained neural sequence-to-sequence model is trained using a processor that: receives a neural sequence-to-sequence (seq2seq) model and training data, the training data comprising a plurality of training source sequences and a corresponding plurality of training target sequences; generates, using the seq2seq model, a plurality of predicted target sequences corresponding to the plurality of training source sequences; updates the model parameters based on a comparison of the generated plurality of predicted target sequences to the corresponding plurality of training target sequences to reduce or minimize a local loss in the predicted target sequences and to minimize or reduce an expected loss of one or more global or semantic features or constraints between the predicted target sequences and the training target sequences given the training source sequences, wherein the expected loss is based on one or more global or semantic features or constraints of general target sequences given general source sequences; and saves in the memory the updated model parameters for the seq2seq model to provide the trained neural sequence-to-sequence model (the discussion in claim 1 above) for the similar benefits discussed in claim 1 above.
Claim 2: the combination of Sutskever and Ranzato further teaches, according to claim 1 above, wherein the local loss comprises a cross-entropy (CE) loss of a predicted next token in the predicted target sequences (Sutskever, tokenized predictions for evaluating the BELU score and Ranzato, pre-trained with cross-entropy loss, Section RELATED WORK, p.3; details in fig. 1, XENT to carry out cross-entropy loss of the predicted p() in fig. 1, Section 3.1.1 CROSS ENTROPY TRAINING OR XENT in fig. 1).
Claim 3: the combination of Sutskever and Ranzato further teaches, according to claim 1 above, wherein the global or semantic features or constraints are generated using the general source sequences and the general target sequences (Sutskever, p(T|S) as the general by using the general source sequences S and the general target sequences S, Section 3.2 Decoding and Rescoring and Ranzato, p(wt|w1, …, wT) in the equation 6, Section 3.1.1 CROSS ENTROPY TRAINING XENT, p.4); and wherein the generated global or semantic features or constraints are stored prior to said receiving the training data (Sutskever, the function p() as the probability function was well defined before T and S, Section 3.2 Decoding and Rescoring, p.4 and Ranzato, p() with wt, ht+1, etc., Section 3 MODELS, p.3).
Claim 4: the combination of Sutskever and Ranzato further teaches, according to claim 1 above, wherein the neural seq2seq model uses a sequential process implemented via a neural mechanism that comprises a recurrent neural network (Sutskever, the LSTM is essentially a recurrent neural network language model, Section 1 Introduction, p.2 and inherently sequential process from node to node within the RNN and Ranzato, RNNs, Section MODELS, p.3 and training in a sequential processing in fig. 9, p.14).
Claim 5: the combination of Sutskever and Ranzato further teaches, according to claim 1 above, wherein the training source sequence comprises a first sequence of tokens, wherein the training target sequence comprises a second sequence of tokens, and wherein the predicted target sequence comprises a third sequence of tokens (Sutskever, end-of-sentence token in fig. 1, p.2; tokenized predictions and ground truth, Section 3.6 Experimental Results, and tokenized training set, Section 3.1 Dataset details, p.4; including tokenized S and T, Section 3.2 Decoding and Rescoring, p.4, and Ranzato, pre-procesing training data using the tokenizer, Section 4.2 MACHINE TRANSLATION, p.9).
Claim 6: the combination of Sutskever and Ranzato further teaches, according to claim 5 above, wherein the tokens in the first sequence, the second sequence, and the third sequence each are of one or more types selected from the group consisting of words, words and characters, characters, images, and sounds (Sutskever, word in the sentence and reversed order of the words in the sentence, abstract, and Ranzato, words for Machine Translation, Section 4.2 MACHINE TRANLSATION; applied in speech recognition, Section 2 RELATED WORK, p.3; Text Summarization, Machine Translation and Image Captioning, Section 1 INTRODUCTION, p.2).
Claim 7: the combination of Sutskever and Ranzato further teaches, according to claim 1 above, wherein the seq2seq model comprises a conditional model that comprises one or more of a neural machine translation model, a captioning model, or a summarization model (Sutskever, SMT system, e.g., English to French translation task, abstract; the translation in table 3, p.7; also applied in speech recognition, Section 1 Introduction, p.1, and Ranzato, Text Summarization, Machine Translation and Image Captioning, Section 1 INTRODUCTION, p.2).
Claim 8: the combination of Sutskever and Ranzato further teaches, according to claim 1 above, wherein the seq2seq model comprises an unconditional model (Sutskever, RNN-Language Model, i.e., unconditional model, Section 4 Related work, p.7-8; neural language model, Section 3.1 Dataset details, p.4, and Ranzato, language model that is randomly initialized, Section 3.2.2 MIXER, p.7, a language model is inherently unconditional model).
Claim 9: the combination of Sutskever and Ranzato further teaches, according to claim 1 above, wherein the global features encode prior knowledge or semantics about the training target sequence (Sutskever, probability distribution of [x1, …, xT] in vector ʋ as training target sequence in the equation 1, p.3 and Ranzato, p(w1, …, wT) in equations 6-7, p.4).
Claim 10: the combination of Sutskever and Ranzato further teaches, according to claim 1 above, wherein the global features are defined using one or more feature functions (Sutskever, probability function p() in equation 1, p.3, and Ranzato, similar function in equation 6-7, p.4).
Claim 11: the combination of Sutskever and Ranzato further teaches, according to claim 10 above, wherein the feature functions comprise conditional feature functions of the training target sequence given the corresponding training source sequence (Sutskever, condition in the equation 1, p.3, or generally, p(T|S) in the equation 2 of p.4, and Ranzato, the equation 7 and derived at the last paragraph of p.4).
Claim 12: the combination of Sutskever and Ranzato further teaches, according to claim 10 above, wherein the one or more feature functions comprise a function representing one of: a relative quantity of repeated tokens or repeated sets of tokens in the training source and target sequences, a relative quantity of tokens in the training source and target sequences, a relative quantity of selected attributes of one or more tokens in the training source and target sequences, a biasedness determined based upon an external evaluation of the training source and target sequences, and a presence or omission of one or more semantic features in the training source and target sequences (Sutskever, the training objective is by the equation 2 and wherein p(T|S) is calculated, wherein T is correct translation as claimed target sequences, and S is the source sentence, Section 3.2, Decoding and Rescoring, p.4, and Ranzato, p() in fig. 4, the probability of the next word is maximized, Section 3.1.1, XENT, p.4).
Claim 13: the combination of Sutskever and Ranzato further teaches, according to claim 1 above, wherein the model parameters comprise one of: weights input to nodes of the seq2seq model, and one or more biases input to nodes of the seq2seq model (Sutskever, including the weights Whx, Whh, Wyh, Section 2 The model, p.3 and Ranzato, the weight as the difference of the mean squared loss ||rt – r||2 in the equation 11, p.7).
Claim 14: the combination of Sutskever and Ranzato further teaches, according to claim 1 above, wherein said updating comprises updating the model parameters to reduce or minimize a difference between a model average estimate based on a distribution of the corresponding predicted target sequences and an empirical average estimate based on a distribution of the corresponding training target sequences, wherein the model average estimate and the empirical average estimate are each based on a mathematical representation of the one or more global features or constraints (Sutskever, maximizing the probability of the next word generation, Section 3.2 Decoding and Rescoring, p.4, and Ranzato, including minimizing the difference between rt representing distribution corresponding to training target sequences by averaging up to t+1, and r representing the predicted target sequence, Section REINFORCE. p.7; the equation 11).
Claim 23 has been analyzed and rejected according to claims 1, 7 above.
Claim 24 has been analyzed and rejected according to claims 1, 13 above.

Claim 15-18 are rejected under 35 U.S.C. 103 as being unpatentable over Sutskever (above) and in view of references Ranzato (above) and Ravuri et al. (“Learning Implicit Generative Models with the Method of Learned Moments”, Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018, p.1-10, hereinafter Ravuri, IDS).
Claim 15: the combination Sutskever and Ranzato teaches all the elements of claim 15, according to claim 1 above, including a portion of the generated plurality of predicted target sequences (Sutskever, batches of 128 sequences for the gradient and divided it the size of the batch 128, Section 3.4 Training details, p.5, and Ranzato, gradient descent GD with mini-batches size 32, Section 4 EXPERIMENTS, p.8) and a corresponding portion of the plurality of training target sequences (Sutskever, minibatch 128 randomly chosen training sentences so that many short sentences, Section 3.4 Training details), and the one or more global features or constraints being represented by one or more conditional feature functions (Sutskever, probability function p() in equation 1, p.3, and Ranzato, similar function in equation 6-7, p.4 and the discussion in claim 10 above), except computing, by the processor, total moment matching gradients over the portion and the corresponding portion.
Ravuri teaches an analogous field of endeavor by disclosing a method (title and abstract, ln 1-27 and fig. 1) and wherein computing, by computation (gradient computation for the squared-error, Section 2.4 Computational Considerations, p.4), total moment matching gradients (feature averages of a sample generator model, as a model moment, to match those of the data such as music, image, and speech data, Section 1. Introduction, which is calculated by m(Ө) related to feature function Φ(Ө) of a model, Section 2.1. A Review of the Method of Moments, p.2) over data (the number of N data, including data z and initial samples, for less likelihood i.e., batch N satisfied, Section 1. Introduction) for benefits of achieving extending model training specifically for those applications with difficulty-to-capture likelihood models such as music, speech, and image with higher efficient operations (Section 1. Introduction). 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have applied the computing of the total moment matching gradients over the data, as taught by Ravuri, to the processor and portion of the generated plurality of predicted target sequences and the corresponding portion of the plurality of training target sequences in the method, as taught by the combination of Sutskever and Ranzato, for the benefits discussed above.
Claim 16: the combination of Sutskever, Ranzato, and Ravuri further teaches, according to claim 1 above, wherein said updating comprises: computing, by the processor, total moment matching gradients over a first portion of the generated plurality of predicted target sequences (Ravuri, for the data set that generates unnatural likelihood, feature averages of a sample generator model, as a model moment, to match those of the data such as music, image, and speech data, Section 1. Introduction, which is calculated by m(Ө) related to feature function Φ(Ө) of a model over the model parameters having difficulty-to-capture likelihood models such as music, speech, and image, Section 2.1. A Review of the Method of Moments, p.2) and a corresponding first portion of the plurality of training target sequences (Ravuri, input data set xi  by feature function Φ(Ө), Section 2.1. A Review of the Method of Moments, p.2), the one or more global features or constraints being represented by one or more conditional feature functions (Sutskever, probability function p() in equation 1, p.3, and Ranzato, similar function in equation 5-7, p.4 and the discussion in claim 10 above and Ravuri, e.g., the mement with the feature function Φ(Ө): 
    PNG
    media_image2.png
    27
    169
    media_image2.png
    Greyscale
); 
computing, by the processor, cross-entropy-based CE-based based gradients for a second portion of the plurality of generated predicted target sequences and a corresponding second portion of the plurality of training target sequences (Ranzato, data set w1, w2, …, wT and generated predicted data set wg1, wg2, …, wgT in equation 10-11, wherein Ө is the model parameters, Section 3.2.1 REINFORCE, p.6-7); and 
updating, by the processor, the model parameters based on the computed total moment matching gradients (Ravuri, minimizing the asymptotic covariance 
    PNG
    media_image3.png
    25
    166
    media_image3.png
    Greyscale
 for the optimal generator parameters, Section 2.1. A Review of the Method of Moments, p.2) and the computed CE-based gradients (Ranzato, in RNN, {M0, Mi, Mh, Mc} as model parameters are updated by minimizing equation 6, and where p() is modeled as a parametric function in equation 5, Section 3.1.1 CROSS ENTROPY TRAINING XENT, p.4 and further in REINFORCE of the MIXER, the model parameters Ө is further updated via maximizing the expected reword or minimizing the loss as negative expected reward in equation 9, Section 3.2.1 REINFORCE, p.6-7).
Claim 17: the combination of Sutskever, Ranzato, and Ravuri further teaches, according to claim 16 above, wherein said computing the total moment matching gradients comprises determining a distance between expectations of the one or more global features or constraints over the first portion of generated predicted target sequences and over the corresponding first portion of training target sequences (Ravuri, by squared error loss between moments of the data and samples,  representing moment-matching objective, in equation 1, and wherein 
    PNG
    media_image4.png
    29
    55
    media_image4.png
    Greyscale
 error loss as the claimed distance in equation 1).
Claim 18: the combination of Sutskever, Ranzato, and Ravuri further teaches, according to claim 16 above, wherein said computing the total moment matching gradients comprises: determining a score based on a difference between a model average estimate over the first portion of generated predicted target sequences and an empirical average estimate over the corresponding first portion of training target sequences (Ranzato, via BLEU score on the test set [w1, w2, …, wT] and predicted set [wg1, wg2, …, wgT] in fig. 4); and combining a CE-based gradient update over the corresponding first portion of training source sequences with the determined score (Ranzato, the reward is computed via BLUE, and combined cross-entropy loss in MIXER to indicate the performance improvement in fig. 5, p.10).

Examiner Comments

With respect to claims 19-21, there are 112(b) issues and claim objections in claims, which causes confusions in scope and limitation by limitation, and thus, it is noted that, as best understood in view of the claim rejection under 35 USC 112(b) and claim objections above, a prior art search has been conducted by the examiner, which is recorded in attached PTO-892 form. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LESHUI ZHANG whose telephone number is (571)270-5589.  The examiner can normally be reached on Monday-Friday 6:30am-4:00pm EST.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vivian Chin can be reached on 571-272-7848.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/LESHUI ZHANG/
Primary Examiner, Art Unit 2654