DETAILED ACTION
 This final rejection is responsive to amendments and remarks filed 15 October 2021.
Claims 1, 3, and 5-20 are amended. No claims have been added, cancelled, or withdrawn. Therefore, claims 1-20 are presently pending.

Response to Arguments
In view of the Applicant’s amendments, a previous ground of rejection to claim 10 under 35 U.S.C. § 112(b) is withdrawn. However, because the amendments did not address all of the issues to claim 10, there is still an outstanding rejection to claim 10 under 35 U.S.C. § 112(b).
Applicant’s arguments with respect to the rejection to claims 1, 11, and 16 under 35 U.S.C. § 103 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claim 10 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 10 recites the step of “an attention network for generating attention weights based on an output of the biLSTM and the attention weights.” However, it is unclear how one generates attention weights based on the attention weights. The Examiner referred to paragraph [0055] of the Specification for support in interpreting this limitation: “An attention layer 620 then generates a vector of attention weights                     
                        
                            
                                α
                            
                            
                                t
                            
                        
                    
                 representing of each encoding time-step to the current decoder state based on the final encoded sequence                     
                        h
                    
                 and the context-adjusted hidden state                     
                        
                            
                                h
                            
                            
                                t
                            
                            
                                d
                                e
                                c
                            
                        
                    
                .”

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1, 3, 8, 11-12, 15-17, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yin et al. (“Neural Generative Question Answering,” 22 April 2016, arXiv:1512.01337v4[cs.CL], pp. 1-12) (“Yin”) in view of Pentina et al. (“Curriculum Learning of Multiple Tasks,” 2015, CVPR2015, pp. 5492-5500) (“Pentina”).
Regarding claim 1, Yin teaches a method for training a question answering system, the method comprising:
receiving a plurality of training samples, each of the training samples including a natural language context, a natural language question, and a natural language ground truth answer (Yin, pp. 3-4, Section 3, “Let                         
                            Q
                            =
                            (
                            
                                
                                    x
                                
                                
                                    1
                                
                            
                            ,
                            …
                            ,
                            
                                
                                    x
                                
                                
                                    
                                        
                                            T
                                        
                                        
                                            Q
                                        
                                    
                                
                            
                            )
                        
                     and                         
                            Y
                            =
                            (
                            
                                
                                    y
                                
                                
                                    1
                                
                            
                            ,
                            …
                            ,
                            
                                
                                    y
                                
                                
                                    
                                        
                                            T
                                        
                                        
                                            Q
                                        
                                    
                                
                            
                            )
                        
                     denote the natural language question and answer respectively. The knowledge-base is organized as a set of triples (subject, predicate, object), each denoted as                         
                            τ
                            =
                            (
                            
                                
                                    τ
                                
                                
                                    s
                                
                            
                            ,
                            
                                
                                    τ
                                
                                
                                    p
                                
                            
                            ,
                            
                                
                                    τ
                                
                                
                                    o
                                
                            
                            )
                        
                    .” Yin, p. 7, Section 3.4, “given the training data                         
                            D
                            =
                            {
                            (
                            
                                
                                    Q
                                
                                
                                    
                                        
                                            i
                                        
                                    
                                
                            
                            ,
                            
                                
                                    Y
                                
                                
                                    
                                        
                                            i
                                        
                                    
                                
                            
                            ,
                            
                                
                                    T
                                
                                
                                    Q
                                
                                
                                    (
                                    i
                                    )
                                
                            
                            )
                            }
                        
                    , the optical parameters are obtained by minimizing the negative log-likelihood with regularization on all the parameters.”),
wherein the plurality of training samples contains different training samples each corresponding to training the question answering system for a different task type from a plurality of task types (Yin, p. 2, Section 1, “The model is trained on a dataset composed of real world question-answer pairs associated with triples in the knowledge-base, in which all components of the model are jointly tuned.” Yin, p. 3, Section 2.2, “To facilitate research on the task of generative QA, we create a new dataset by collecting data from the web. We first build a knowledge-base by mining from three Chinese encyclopedia web sites [example of a training samples corresponding to training the question answering system for a task type]. Specifically we extract entities and associated triples (subject, predicate, object) from the structured parts (e.g. HTML tables) of the web pages at the web sites. Then the extracted data is normalized and aggregated to form a knowledge-base. In this paper we sometimes refer to an item of a triple as a constituent of knowledge-base. Second, we collect question-answer pairs by extracting from two Chinese community QA sites [example of a training samples corresponding to training the question answering system for a different task type]. We automatically and heuristically construct training and test data for generative QA by ‘grounding’ the QA pairs with the triples in the knowledge-base. Specifically, for each QA pair, a list of candidate triples with the subject fields appearing in the question, is retrieved by using the Aho-Corasick string search algorithm.”); 
presenting the plurality of training samples to a neural model to generate an answer (Yin, p. 7, Section 3.5, “The parameters to be learned include the weights in the RNNs for Interpreter and Answerer, parameters in Enquirer (either the matrix M or the weights in the convolution layer and MLP), and the word-embeddings which are shared by the Interpreter RNN and the knowledge-base. GENQA [a neural model], although essentially containing a retrieval operation, can be trained [presented with the training samples] in an end-to-end fashion by maximizing the likelihood of observed data, since the mixture form of probability in Answerer provides a unified way to generate words from the common vocabulary and the KB vocabulary [generate an answer].”); 
determining an error between the generated answer and the natural language ground truth answer for each training sample presented (Yin, p. 7, Section 3.4, “the optimal parameters are obtained by minimizing the negative log-likelihood [an error between the generated answer and the natural language ground truth answer] with regularization on all the parameters.” Yin, p. 7, Section 3.5, “GENQA, although essentially containing a retrieval operation, can be trained in an end-to-end fashion by maximizing the likelihood of observed data, since the mixture form of probability in Answerer provides a unified way to generate words from the common vocabulary and the KB vocabulary.” Yin, p. 8, Section 4.2, “cross-entropy loss [is] used in GENQA.”); and 
adjusting parameters of the neural model based on the error (Yin, p. 7, Section 3.4, “the optimal parameters are obtained [adjusting parameters of the neural model] by minimizing the negative log-likelihood with regularization on all the parameters.”); 
….
Yin does not explicitly disclose the method comprising:
…
wherein the plurality of training samples are presented to the neural model by: 
initially selecting a first set of training samples from the plurality of training samples according to a first training strategy that sequentially selects samples corresponding to different task types resulting in a first ordering of the first set of training samples that covers each of the plurality of task types; and
switching to selecting a second set of training samples from the plurality of training samples according to a second training strategy that mixes samples corresponding to the different task types, resulting in a second ordering of the second set of training samples that covers each of the plurality of task types.
However, Pentina discloses the method comprising:
wherein the plurality of training samples are presented to the neural model by (Pentina, p. 5497, Section 4.1, “All methods in this study use Adaptive SVM as a learning algorithm for solving the next task and differ only by how the order of tasks is defined.”): 
initially selecting a first set of training samples from the plurality of training samples according to a first training strategy that sequentially selects samples corresponding to different task types resulting in a first ordering of the first set of training samples that covers each of the plurality of task types (Pentina, p. 5495, Section 3.3, “The proposed algorithm, SeqMT, relies on the idea that all tasks can be ordered in a sequence, where each task is related to the previous one [initially selecting a first set of training samples … according to a first training strategy that sequentially selects samples]. … we propose an extension of the SeqMT model, that allows tasks to form subsequences, where the information is transferred only between the tasks within the subsequence. Our multiple subsequences version, Multi-SeqMT, also chooses tasks iteratively, but at any stage it allows the learner to choose whether to continue one of the existing subsequences or to start a new one.” Pentina, p. 5495, Algorithm 1 summarizes sequential learning of multiple tasks with SeqMT.); and
switching to selecting a second set of training samples from the plurality of training samples according to a second training strategy that mixes samples corresponding to the different task types, resulting in a second ordering of the second set of training samples that covers each of the plurality of task types (Pentina, p. 5496, Section 4.1, “In order to study how relevant the knowledge transfer actually is, we compare SeqMT with a linear SVM baseline that solves each task independently (IndSVM). As a reference, we also trained a linear SVM on data that is merged from all tasks and outputs one linear predictor for all tasks (MergedSVM) [a second training strategy that mixes samples corresponding to the different task types, resulting in a second ordering].”).
Both Yin and Pentina are directed to supervised learning tasks. While Yin discloses training neural models on a plurality of training samples each corresponding to training a question answering system for a different task type, Yin is inexplicit in disclosing training according to a first sequential and second mixed ordering of training samples. However, Pentina teaches a first sequential and a second mixed training strategy according to respective first and second orderings. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the model training in Yin to include switching the ordering of training samples, as disclosed in Pentina, to yield predictable results of training a neural network model using particular presentation orders of training samples. Further, doing so demonstrates “how relevant the knowledge transfer actually is” between different learning models (Pentina, p. 5496, Section 4.1).

Regarding claim 3, Yin in view of Pentina teaches the method of claim 1.
Pentina further teaches the method, wherein the first training strategy is a sequential training strategy where each of the first set of training samples for a first task type are selected before selecting training samples of a second task type (Pentina, p. 5495, Section 3.3, “The proposed algorithm, SeqMT, relies on the idea that all tasks can be ordered in a sequence, where each task is related to the previous one. … we propose an extension of the SeqMT model, that allows tasks to form subsequences, where the information is transferred only between the tasks within the subsequence. Our multiple subsequences version, Multi-SeqMT, also chooses tasks iteratively, but at any stage it allows the learner to choose whether to continue one of the existing subsequences or to start a new one.” Pentina, p. 5495, Algorithm 1 summarizes sequential learning of multiple tasks with SeqMT, where each of the first set of training samples for a first task type are selected before selecting training samples of a second task type.).

Regarding claim 8, Yin in view of Pentina teaches the method of claim 1.
Pentina further teaches the method, further comprising switching to selecting the second set of training samples using the second training strategy after each of the training samples for each of the plurality of task types is presented to the neural model a predetermined number of times (Pentina, p. 5495, Algorithm 1 summarizes SeqMT, where each set of training samples are presented once for each task. Pentina, p. 5496, Section 4.1, “In order to study how relevant the knowledge transfer actually is, we compare SeqMT with a linear SVM baseline that solves each task independently (IndSVM). As a reference, we also trained a linear SVM on data that is merged from all tasks and outputs one linear predictor for all tasks (MergedSVM).”).
 
Regarding claims 11, 12, and 15; claims 11, 12, and 15 are directed to a non-transitory machine-readable medium comprising a plurality of machine-readable instructions which when executed by one or more processors associated with a computing device are adapted to cause the one or more processors to perform a method comprising steps similar to those recited in claims 1, 3, and 8, respectively. Therefore, the rejection to claims 1, 3, and 8 are applied to claims 11, 12, and 15.
Yin further teaches a non-transitory machine-readable medium comprising a plurality of machine-readable instructions which, when executed by one or more processors associated with a computing device, are adapted to cause the one or more processors to perform the method: “Our models are trained on an NVIDIA Tesla K40 GPU using Theano, with the mini-batch size of 80. The training of each model takes about two or three days.” (Yin, p. 8, Section 4.1).

Regarding claims 16, 17, and 20; claims 16, 17, and 20 are directed to a system for deep learning, the system comprising a multi-layer neural network, wherein the system is configured to perform the method recited in claims 1, 3, and 8, respectively. Therefore, the rejection to claims 1, 3, and 8 are applied to claims 16, 17, and 20.
In addition, Yin teaches a multi-layer neural network: “we propose an end-to-end neural network model for generative QA, named GENQA, which is illustrated in Figure 1.” (Yin, p. 4, Section 3 and Figure 1).

Claims 2 and 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yin in view of Pentina, further in view of Luong et al. (“Multi-task Sequence to Sequence Learning,” 1 March 2016, arXiv:1511.06114v4, pp. 1-10) (“Luong”).
Regarding claim 2, Yin in view of Pentina teaches the method of claim 1.
Yin further teaches the method, wherein each of the plurality of task types is a … question answering task type (Yin, p. 3, Section 2.2, “Second, we collect question-answer pairs by extracting from two Chinese community QA sites. We automatically and heuristically construct training and test data for generative QA by ‘grounding’ the QA pairs with the triples in the knowledge-base.”).
Pentina further discloses the method, wherein each of the plurality of tasks types is a classification type (Pentina, p. 5496, Section 4.1, “For each class the annotation specifies ranking scores of its images from easiest to hardest. To create easy-hard tasks, we split the data in each class into five equal parts with respect to their easy-hard ranking and use these parts to create five tasks per class. … Each task is a binary classification of one of the parts against the remaining seven classes.”).
Both Yin and Pentina are directed to supervised learning tasks. While Yin discloses a question answering task type, Yin does not disclose classification or language translation task types. However, Pentina discloses a classification task type. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the plurality of task types in the combination to include a classification task type, as disclosed in Luong, to yield predictable results of applying the disclosed methods to an application of classification. Further, one would be motivated to do so, because “learning multiple tasks sequentially can be more effective than learning them jointly” and “the order in which tasks are solved effects the overall classification performance” (Pentina, p. 5499, Section 5).
Neither Yin nor Pentina disclose the method, wherein each of the plurality of task types is a language translation task type….
However, Luong teaches the method, wherein each of the plurality of task types is a language translation task type [or] a classification task type (Luong, p. 4, Section 4, “We evaluate the multi-task learning setup on a wide variety of sequence-to-sequence tasks: constituency parsing, image caption generation, machine translation [a language translation task type], and a number of unsupervised learning as summarized in Table 1.”).
Both the combination of Yin and Pentina and the disclosure of Luong are directed to sequence-to-sequence learning in machine learning natural language applications. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the plurality of task types in the combination to include a language translation task type, as disclosed in Luong. One would be motivated to do so, because “[m]ulti-task learning (MTL) is an important machine learning paradigm that aims at improving the generalization performance of a task using other related tasks” (Leong, p. 1, Section 1).

Regarding claim 7, Yin in view of Pentina teaches the method of claim 1.
Neither Yin nor Pentina further discloses the method, wherein the first training strategy is a modified sequential training strategy where the first set of training samples are selected according to a sequential training strategy with periodic intervals where the training samples are selected according to a joint training strategy.
However, Luong teaches the method, wherein the first training strategy is a modified sequential training strategy where the training samples are selected according to a sequential training strategy with periodic intervals where the training samples are selected according to a joint training strategy (Luong, p. 4, Section 3.5, “Each parameter update consists of training data from one task only. When switching between tasks, we select randomly a new task i with probability                                 
                                    
                                        
                                            
                                                
                                                    α
                                                
                                                
                                                    i
                                                
                                            
                                        
                                        
                                            
                                                
                                                    ∑
                                                    
                                                        j
                                                    
                                                
                                                
                                                    
                                                        
                                                            α
                                                        
                                                        
                                                            j
                                                        
                                                    
                                                
                                            
                                        
                                    
                                
                             [random selection between switching tasks disclosing a modified sequential training strategy where the training samples are selected according to a sequential training strategy with periodic intervals where the training samples are selected according to a joint training strategy]. Our convention is that the first task is the reference task with                                 
                                    
                                        
                                            α
                                        
                                        
                                            1
                                        
                                    
                                    =
                                    1.0
                                
                             and the number of training parameter updates for that task is prespecified to be                                 
                                    N
                                
                            . A typical task                                 
                                    i
                                
                             will then be trained for                                 
                                    
                                        
                                            
                                                
                                                    α
                                                
                                                
                                                    i
                                                
                                            
                                        
                                        
                                            
                                                
                                                    α
                                                
                                                
                                                    1
                                                
                                            
                                        
                                    
                                    ·
                                    N
                                
                             parameter updates.” Luong, p. 5, Section 4.2 and Table 1, “The choice of the reference task helps specify the number of training epochs [each training epoch disclosing a training strategy] and the finetune start/cycle values.”).
Both the combination of Yin and Pentina and the disclosure of Luong are directed to sequence-to-sequence learning in machine learning natural language applications. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the first training strategy in the combination to include a modified sequential training strategy, as disclosed in Luong. One would be motivated to do so, because “[s]uch convention makes it easier … to fairly compare the same reference task in a single-task setting which has also been trained for exactly N parameter updates” (Leong, p. 4, Section 3.5).

Claims 4, 6, 14, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yin in view of Pentina, further in view of Dong et al. (“Multi-Task Learning for Multiple Language Translation,” 2015, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 1723-1732) (“Dong”).
Regarding claim 4, Yin in view of Pentina, further in view of Dong teaches the method of claim 3.
Dong further teaches the method, wherein the sequential training strategy includes reselecting training samples for the first task type after selecting training samples for each of the plurality of task types (Dong, pp. 1726-1727, Section 3.3 and Figure 3, “we learn several minibatches within a fixed language pair for several mini-batch iterations and then move onto the next language pair. Our optimization procedure is shown in Figure 3.” Figure 3 shows the mini batches of training data used, where each fixed language pair corresponds to different task types; after each fixed language pair mini batch is used for learning, they are reselected for learning in the same order.).
Both the combination of Yin and Pentina and the disclosure of Dong are directed to a sequence learning problem and use of an encoder-decoder framework for natural language applications. While Yin discloses training a neural model on a set of training samples from a plurality of tasks and Pentina discloses presentation of training samples in different orders, both are inexplicit in disclosing wherein the sequential training strategy includes reselecting training samples for the first task type after selecting training samples for each of the plurality of task types. However, Dong teaches this particular sequential training strategy. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the first training strategy in Yin to include the sequential training strategy, as disclosed in Dong, to yield the predictable result of training the neural model using a particular order of training samples. 

Regarding claim 6, Yin in view of Pentina teaches the method of claim 1.
Neither Yin nor Pentina further disclose the method, wherein the second training strategy is a joint training strategy where each of the second set of training samples are selected so that consecutively selected small groups of training samples are selected from different ones of the plurality of task types. 
However, Dong teaches the method, wherein the second training strategy is a joint training strategy where second set of training samples are selected so that consecutively selected small groups of training samples are selected from different ones of the plurality of task types (Dong, pp. 1726-1727, Section 3.3 and Figure 3, “we learn several minibatches within a fixed language pair for several mini-batch iterations and then move onto the next language pair. Our optimization procedure is shown in Figure 3.” Figure 3 shows the mini batches of training data used, where each mini batch of training data is of a different task type from the last one.).
Both the combination of Yin and Pentina and the disclosure of Dong are directed to a sequence learning problem and use of an encoder-decoder framework for natural language applications. While Yin discloses training a neural model on a set of training samples from a plurality of tasks and Pentina discloses presentation of training samples in different orders, both are inexplicit in disclosing a joint training strategy where each of the training samples are selected so that consecutively selected small groups of training samples are selected from different ones of the plurality of task types. However, Dong teaches this particular joint training strategy. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the second training strategy in the combination to include the joint training strategy, as disclosed in Dong, to yield the predictable result of training the neural model using a particular order of training samples. 
  
Regarding claim 14, claim 14 is directed to a non-transitory machine-readable medium comprising a plurality of machine-readable instructions which when executed by one or more processors associated with a computing device are adapted to cause the one or more processors to perform a method comprising steps similar to those recited in claim 6. Therefore, the rejection to claim 6 is applied to claim 14.
Yin further teaches a non-transitory machine-readable medium comprising a plurality of machine-readable instructions which, when executed by one or more processors associated with a computing device, are adapted to cause the one or more processors to perform the method: “Our models are trained on an NVIDIA Tesla K40 GPU using Theano, with the mini-batch size of 80. The training of each model takes about two or three days.” (Yin, p. 8, Section 4.1).
Regarding claim 19, claim 19 is directed to a system for deep learning, the system comprising a multi-layer neural network, wherein the system is configured to perform the method recited in claim 6. Therefore, the rejection to claim 6 is applied to claim 19.
In addition, Yin teaches a multi-layer neural network: “we propose an end-to-end neural network model for generative QA, named GENQA, which is illustrated in Figure 1.” (Yin, p. 4, Section 3 and Figure 1).

Claims 5, 13, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yin in view of Pentina, further in view of Collobert et al. (“A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning,” 2008, Prooceedings of the 25th International Conference on Machine Learning, pp. 160-167) (“Collobert”).
Regarding claim 5, Yin in view of Pentina teaches the method of claim 5. 
Neither Yin nor Pentina further teach the method, wherein the second training strategy is a joint training strategy where each of the second set of training samples are selected so that consecutively selected training samples are selected from different ones of the plurality of task types.
However, Collobert teaches the method, wherein the second training strategy is a joint training strategy where each of the second set of training samples are selected so that consecutively selected training samples are selected from different ones of the plurality of task types (Collobert, p. 163, Section 4.1, “Training is achieved in a stochastic manner by looping over the tasks: 1. Select the next task. 2. Select a random training example for this task. 3. Update the NN for this task by taking a gradient step with respect to this example. 4. Go to 1. It is worth noticing that labeled data for training each task can come from completely different datasets.”).
Both the combination of Yin and Pentina and the disclosure of Collobert are directed to a neural network architecture for natural language processing applications. While Yin discloses training a neural model on a set of training samples from a plurality of tasks and Pentina discloses presentation of training samples in different orders, both are inexplicit in disclosing a joint training strategy where each of the training samples are selected so that consecutively selected training samples are selected from different ones of the plurality of task types. However, Collobert teaches this particular joint training strategy. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the second training strategy in the combination to include the joint training strategy, as disclosed in Collobert, to yield the predictable result of training the neural model using a particular order of training samples. Further, “learning tasks simultaneously can improve generalization performance” (Collobert, p. 167, Section 7). 

Regarding claim 13, claim 13 is directed to a non-transitory machine-readable medium comprising a plurality of machine-readable instructions which when executed by one or more processors associated with a computing device are adapted to cause the one or more processors to perform a method comprising steps similar to those recited in claim 5. Therefore, the rejection to claim 5 is applied to claim 13.
Yin further teaches a non-transitory machine-readable medium comprising a plurality of machine-readable instructions which, when executed by one or more processors associated with a computing device, are adapted to cause the one or more processors to perform the method: “Our models are trained on an NVIDIA Tesla K40 GPU using Theano, with the mini-batch size of 80. The training of each model takes about two or three days.” (Yin, p. 8, Section 4.1).
Regarding claim 18, claim 18 is directed to a system for deep learning, the system comprising a multi-layer neural network, wherein the system is configured to perform the method recited in claim 5. Therefore, the rejection to claim 5 is applied to claim 18.
In addition, Yin teaches a multi-layer neural network: “we propose an end-to-end neural network model for generative QA, named GENQA, which is illustrated in Figure 1.” (Yin, p. 4, Section 3 and Figure 1). 

Claim 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yin in view of Pentina, further in view of Vinyals et al. (“Order Matters: Sequence to sequence for sets,” 23 February 2016, arXiv:1511.06391v4 [stat.ML], pp. 1-11) (“Vinyals”).
Regarding claim 9, Yin in view of Pentina teaches the method of claim 1.
Yin further teaches the method, further comprising … monitoring of performance metrics associated with each of the plurality of task types (Yin, p. 9, Section 4.3, “Among the two GENQA variants, GENQACNN achieves the best accuracy, getting over half of the questions right. An explanation for that is that the convolution layer helps to capture salient features in matching. The experiment results demonstrate the ability of GENQA models to find the right answers from the KB even with regard to new facts. For example, to the example question mentioned above, GENQA gives the correct answer ‘He plays for Spain’.”).
Neither Yin nor Pentina disclose the method, further comprising switching to selecting the second set of training samples using the second training strategy based on monitoring of performance metrics….
However, Vinyals teaches the method, further comprising switching to selecting the second set of training samples using the second training strategy based on monitoring of performance metrics (Vinyals, p. 9, Section 5.2, “We pretrain the model with a uniform prior over π(X) for 1000 steps, which amounts to replacing the maxπ(Xi) in eq. (9) by a                                 
                                    
                                        
                                            ∑
                                            
                                                π
                                            
                                        
                                        
                                            (
                                            
                                                
                                                    X
                                                
                                                
                                                    i
                                                
                                            
                                            )
                                        
                                    
                                
                            . We then pick an ordering [switching to selecting the training samples using the second training strategy] by sampling π(X) according to a distribution proportional to p(Yπ(X)|X) [based on monitoring of performance metrics]. This costs O(1) model evaluations (vs. naive search which would be O(n!)).”).
Both the combination of Yin and Pentina and the disclosure of Vinyals are directed to a sequence to sequence and encoder-decoder framework in a recurrent neural network architecture for natural language applications. While Yin discloses training a neural model on a set of training samples from a plurality of tasks and monitoring of performance metrics associated with each of the plurality of task types and Pentina discloses presentation of training samples in different orders, both are inexplicit in disclosing switching to selecting the training samples using the second training strategy based on monitoring of performance metrics. However, Vinyals teaches this particular switching condition. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify switching of training strategies in the combination to include a switching condition, as disclosed in Vinyals, to yield the predictable result of training the neural model using two different orders of presenting training samples based on a switching condition. Further, doing so allows the model to “decide which is the best ordering” and “to explore the space of all orderings” (Vinyals, p. 8, Section 5.2).

Claim 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yin in view of Pentina, further in view of Bahdanau et al. (“Neural Machine Translation by Jointly Learning to Align and Translate,” 24 April 2015, arXiv:1409.0473v6 [cs.CL], pp. 1-15) (“Bahdanau”).
Regarding claim 10, Yin in view of Pentina teaches the method of claim 1.
Yin further teaches the method, wherein the neural model comprises:
an input layer for encoding first words from the context and second words from the question (Yin, p. 4, Section 3.1, “Given the question represented as word sequence Q = (x1, . . . , xTQ), Interpreter encodes it to an array of vector representations.” Yin, pp. 5-6, Section 3.2, “In this work, we provide two implementations for Enquirer to calculate the matching scores between question and triples [the context],” both of the implementations taking “the average of the embeddings of the subject and predicate as the representation of the triple (denoted as                         
                            
                                
                                    u
                                
                                
                                    T
                                
                            
                        
                    ).”); 
a self-attention based transformer comprising an encoder and a decoder (Yin, p. 6, Section 3.3 and Figure 3, “Answerer [a self-attention based transformer comprising an encoder and a decoder] uses an RNN to generate an answer based on the information of question saved in the short-term memory (represented as H_Q) and the relevant facts retrieved from the long-term memory (indexed by r_Q), as illustrated in Figure 3. The probability of generating the answer                         
                            Y
                            =
                            
                                
                                    
                                        
                                            y
                                        
                                        
                                            1
                                        
                                    
                                    ,
                                    
                                        
                                            y
                                        
                                        
                                            2
                                        
                                    
                                    ,
                                    …
                                    ,
                                    
                                        
                                            y
                                        
                                        
                                            
                                                
                                                    T
                                                
                                                
                                                    y
                                                
                                            
                                        
                                    
                                
                            
                        
                     is defined as [see corresponding equation in Section 3.3],” and the “conditional probability in the RNN model (with hidden states s_1,…,s_Ty) is specified by [see corresponding equation in Section 3.3]. … In generating common words, Answerer acts in the same way as the decoder of RNN in [1] with information from                         
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                        
                     selected by the attention model. Specifically, the hidden state at t step is computed as                         
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            =
                            
                                
                                    f
                                
                                
                                    s
                                
                            
                            (
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                            )
                        
                     and                         
                            p
                            
                                
                                    
                                        
                                            y
                                        
                                        
                                            t
                                        
                                    
                                
                                
                                    
                                        
                                            y
                                        
                                        
                                            t
                                            -
                                            1
                                        
                                    
                                    ,
                                    
                                        
                                            s
                                        
                                        
                                            t
                                        
                                    
                                    ,
                                    
                                        
                                            H
                                        
                                        
                                            Q
                                        
                                    
                                    ,
                                    
                                        
                                            z
                                        
                                        
                                            t
                                        
                                    
                                    =
                                    0
                                    ;
                                    θ
                                
                            
                            =
                            
                                
                                    f
                                
                                
                                    y
                                
                            
                            (
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                            )
                        
                    , where                         
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                        
                     is the context vector computed as a weighted sum of the hidden states stored in the short-term memory                         
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                        
                    .”); 
…
a vocabulary layer for generating a distribution over third words in a vocabulary based on the attention weights (Yin, p. 5, Section 3.2, “For question Q, the scores are represented in a KQ-dimensional vector rQ [corresponds to the attention weights, as taught below by Bahdanau] where the kth element of rQ is defined as the probability [see Equation for rQ].” Yin, p. 6, Section 3.3, “In generating the tth word                         
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                        
                     in the answer, the probability is given by the following mixture model                         
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    r
                                
                                
                                    Q
                                
                            
                            ;
                            θ
                            )
                            =
                            p
                            (
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            0
                            |
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ;
                            θ
                            )
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            0
                            ;
                            θ
                            )
                            +
                            p
                            (
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                            |
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ;
                            θ
                            )
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    r
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                            ;
                            θ
                            )
                        
                     [                        
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            0
                            ;
                            θ
                            )
                        
                     discloses a distribution over third words in a vocabulary], which sums the contributions from the “language” part [vocabulary] and the “knowledge” part, with the coefficient                         
                            p
                            (
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ;
                            θ
                            )
                        
                     being realized by a logistic regression model with                         
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                        
                     as input. Here the latent variable                         
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                        
                     indicates whether the tth word is generated from a common vocabulary (for                         
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            0
                        
                    ) or a KB vocabulary                         
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                        
                    ). … In generating common words, Answerer acts in the same way as the decoder of RNN in [1] with information from                         
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                        
                     selected by the attention model. Specifically, the hidden state at t step is computed as                         
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            =
                            
                                
                                    f
                                
                                
                                    s
                                
                            
                            (
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                            )
                        
                     and                         
                            p
                            
                                
                                    
                                        
                                            y
                                        
                                        
                                            t
                                        
                                    
                                
                                
                                    
                                        
                                            y
                                        
                                        
                                            t
                                            -
                                            1
                                        
                                    
                                    ,
                                    
                                        
                                            s
                                        
                                        
                                            t
                                        
                                    
                                    ,
                                    
                                        
                                            H
                                        
                                        
                                            Q
                                        
                                    
                                    ,
                                    
                                        
                                            z
                                        
                                        
                                            t
                                        
                                    
                                    =
                                    0
                                    ;
                                    θ
                                
                            
                            =
                            
                                
                                    f
                                
                                
                                    y
                                
                            
                            (
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                            )
                        
                    , where                         
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                        
                     is the context vector computed as a weighted sum of the hidden states stored in the short-term memory                         
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                        
                    .“); 
a context layer for generating a distribution over the first words from the context based on the attention weights (Yin, p. 5, Section 3.2, “For question Q, the scores are represented in a KQ-dimensional vector rQ [corresponds to the attention weights, as taught below by Bahdanau] where the kth element of rQ is defined as the probability [see Equation for rQ].” Yin, p. 6, Section 3.3, “In generating the tth word                         
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                        
                     in the answer [each subsequent generation depends on the attention weights], the probability is given by the following mixture model                         
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    r
                                
                                
                                    Q
                                
                            
                            ;
                            θ
                            )
                            =
                            p
                            (
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            0
                            |
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ;
                            θ
                            )
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            0
                            ;
                            θ
                            )
                            +
                            p
                            (
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                            |
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ;
                            θ
                            )
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    r
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                            ;
                            θ
                            )
                        
                     [                        
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    r
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                            ;
                            θ
                            )
                        
                     discloses a distribution over the first words from the context], which sums the contributions from the “language” part [vocabulary] and the “knowledge” part, with the coefficient                         
                            p
                            (
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ;
                            θ
                            )
                        
                     being realized by a logistic regression model with                         
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                        
                     as input. Here the latent variable                         
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                        
                     indicates whether the tth word is generated from a common vocabulary (for                         
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            0
                        
                    ) or a KB vocabulary                         
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                        
                    ).”); and 
a switch for:     
generating a weighting between the distribution over the third words from the vocabulary and the distribution over the first words from the context (Yin, p. 6, Section 3.3, “In generating the tth word                         
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                        
                     in the answer, the probability is given by the following mixture model                         
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    r
                                
                                
                                    Q
                                
                            
                            ;
                            θ
                            )
                            =
                            p
                            (
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            0
                            |
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ;
                            θ
                            )
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            0
                            ;
                            θ
                            )
                            +
                            p
                            (
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                            |
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ;
                            θ
                            )
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    r
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                            ;
                            θ
                            )
                        
                    , which sums the contributions from the “language” part [vocabulary] and the “knowledge” part [context], with the coefficient                         
                            p
                            (
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ;
                            θ
                            )
                        
                     [a weighting] being realized by a logistic regression model with                         
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                        
                     as input. Here the latent variable                         
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                        
                     [a switch] indicates whether the tth word is generated from a common vocabulary (for                         
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            0
                        
                    ) or a KB vocabulary                         
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                        
                    ).”); 
generating a composite distribution based on the weighting of the distribution over the third words from the vocabulary and the distribution over the first words from the context (Yin, p. 6, Section 3.3, “In generating the tth word                         
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                        
                     in the answer, the probability is given by the following mixture model [a composite distribution based on the weighting]                         
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    r
                                
                                
                                    Q
                                
                            
                            ;
                            θ
                            )
                            =
                            p
                            (
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            0
                            |
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ;
                            θ
                            )
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            0
                            ;
                            θ
                            )
                            +
                            p
                            (
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                            |
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ;
                            θ
                            )
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    r
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                            ;
                            θ
                            )
                        
                    , which sums the contributions from the “language” part and the “knowledge” part, with the coefficient                         
                            p
                            (
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ;
                            θ
                            )
                        
                     [a weighting] being realized by a logistic regression model with                         
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                        
                     as input. Here the latent variable                         
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                        
                     indicates whether the tth word is generated from a common vocabulary (for                         
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            0
                        
                    ) or a KB vocabulary                         
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                        
                    ).”); and 
selecting a word for inclusion in an answer using the composite distribution (Yin, p. 4, Section 3, “The Answerer feeds on the question representation                         
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                        
                     (through the Attention Model) as well as the vector                         
                            
                                
                                    r
                                
                                
                                    Q
                                
                            
                        
                     and generates an answer with Generator.” Yin, p. 6, Section 3.3, “answer                         
                            Y
                            =
                            
                                
                                    
                                        
                                            y
                                        
                                        
                                            1
                                        
                                    
                                    ,
                                    
                                        
                                            y
                                        
                                        
                                            2
                                        
                                    
                                    ,
                                    …
                                    ,
                                    
                                        
                                            y
                                        
                                        
                                            
                                                
                                                    T
                                                
                                                
                                                    y
                                                
                                            
                                        
                                    
                                
                            
                            .
                        
                     … In generating common words, Answerer acts in the same way as the decoder of RNN in [1] with information from                         
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                        
                     selected by the attention model. Specifically, the hidden state at t step is computed as                         
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            =
                            
                                
                                    f
                                
                                
                                    s
                                
                            
                            (
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                            )
                        
                     and                         
                            p
                            
                                
                                    
                                        
                                            y
                                        
                                        
                                            t
                                        
                                    
                                
                                
                                    
                                        
                                            y
                                        
                                        
                                            t
                                            -
                                            1
                                        
                                    
                                    ,
                                    
                                        
                                            s
                                        
                                        
                                            t
                                        
                                    
                                    ,
                                    
                                        
                                            H
                                        
                                        
                                            Q
                                        
                                    
                                    ,
                                    
                                        
                                            z
                                        
                                        
                                            t
                                        
                                    
                                    =
                                    0
                                    ;
                                    θ
                                
                            
                            =
                            
                                
                                    f
                                
                                
                                    y
                                
                            
                            (
                            
                                
                                    y
                                
                                
                                    t
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    s
                                
                                
                                    t
                                
                            
                            ,
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                            )
                        
                    , where                         
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                        
                     is the context vector computed as a weighted sum of the hidden states stored in the short-term memory                         
                            
                                
                                    H
                                
                                
                                    Q
                                
                            
                        
                    . In generating KB-words [knowledge-base words] via                         
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    r
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                            ;
                            θ
                            )
                        
                    , Answerer simply employs the model                         
                            p
                            (
                            
                                
                                    y
                                
                                
                                    t
                                
                            
                            |
                            
                                
                                    r
                                
                                
                                    Q
                                
                            
                            ,
                            
                                
                                    z
                                
                                
                                    t
                                
                            
                            =
                            1
                            ;
                            θ
                            )
                            =
                            
                                
                                    r
                                
                                
                                    
                                        
                                            Q
                                        
                                        
                                            k
                                        
                                    
                                
                            
                        
                    .”). 
Neither Yin nor Pentina appear to disclose the method, wherein the neural model comprises:
…
a bi-directional long-term short-term memory (biLSTM) for further encoding an output of the encoder; 
a long-term short-term memory (LSTM) for generating a context-adjusted hidden state from an output of the decoder and a hidden state; [and]
an attention network for generating attention weights based on an output of the biLSTM and the attention weights 
….
However, Bahdanau teaches the method, wherein the neural model comprises:
a bi-directional long-term short-term memory (biLSTM) for further encoding an output of the encoder (Bahdanau, p. 4, Section 3.2, “we would like the annotation of each word to summarize not only the preceding words, but also the following words. Hence, we propose to use a bidirectional RNN. … This sequence of annotations is used by the decoder and the alignment model later to compute the context vector.” Bahdanau, p. 12, Section A.1.1, “It is therefore possible to use LSTM units instead of the gated hidden unit described here.”); 
a long-term short-term memory (LSTM) for generating a context-adjusted hidden state from an output of the decoder and a hidden state (Bahdanau, p. 3, Section 3.1, “we define each conditional probability in Eq. (2) as: [see Equation (4)], where s_i is an RNN hidden state for time i, computed by                         
                            
                                
                                    s
                                
                                
                                    i
                                
                            
                            =
                            f
                            (
                            
                                
                                    s
                                
                                
                                    i
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    y
                                
                                
                                    i
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    c
                                
                                
                                    i
                                
                            
                            )
                        
                    . … The context vector c_i depends on a sequence of annotations … to which an encoder maps the input sentence.” Bahdanau, p. 12, Section A.1.1, “It is therefore possible to use LSTM units instead of the gated hidden unit described here.”); [and]
an attention network for generating attention weights based on an output of the biLSTM and the attention weights (Bahdanau, p. 3, Section 3.1, “The context vector c_i is, then, computed as a weighted sum of these annotations h_i: [see Equation (5)]. The weight … of each annotation … is computed by [see Equation (6)] where                         
                            
                                
                                    e
                                
                                
                                    i
                                    j
                                
                            
                            =
                            a
                            (
                            
                                
                                    s
                                
                                
                                    i
                                    -
                                    1
                                
                            
                            ,
                            
                                
                                    h
                                
                                
                                    j
                                
                            
                            )
                        
                     is an alignment model which scores how well the inputs around position j and the output at position i match. The score is based on the RNN hidden state                         
                            
                                
                                    s
                                
                                
                                    i
                                    -
                                    1
                                
                            
                        
                     (just before emitting y_i, Eq. (4)) and the j-th annotation h_j of the input sentence.”).
Both the combination of Yin and Pentina and the disclosure of Bahdanau are directed to a recurrent neural network encoder-decoder for natural language applications. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the recurrent neural network in the combination to include LSTM units, as disclosed in Bahdanau. One would be motivated to do so, because the gated hidden unit in the RNN “is similar to a long short-term memory (LSTM) unit proposed earlier … sharing with it the ability to better model and learn long-term dependencies,” and these computation paths in the unfolded RNN “allow gradient to flow backward easily without suffering too much from the vanishing effect”; therefore, it is “possible to use LSTM units instead of the gated hidden unit described here” (Bahdanau, p. 12, Section A.1.1).
  

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Bengio et al. ("Curriculum Learning," 2009, Proceedings of the 26th International Conference on Machine Learning, 8 pages) compares a no curriculum setting (random ordering) with a curriculum setting in which examples are ordered by easiness (Bengio et al., p. 4, Section 4.2).
Graves et al. ("Automated Curriculum Learning for Neural Networks," 2017, Proceedings of the 34th International Conference on Machine Learning, 10 pages) teaches maximizing learning efficiency of a neural network through curriculum learning and discloses a multiple tasks setting (Graves et al.; p. 1, Abstract; p. 2, Section 2.1).

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CATHERINE F LEE whose telephone number is (571)270-7487. The examiner can normally be reached Monday thru Friday, 10:00AM-6:00PM PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/C.F.L./Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124