Detailed Action
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 09/20/2021 has been entered.

Status of Claims
The following claims is/are pending in this office action: 1-6, 8-9, 11-16, and 18-20 
The following claim(s) is/are amended: 1, 3-5, 11, 13-15, and 20
The following claim(s) is/are new: None
The following claim(s) is/are cancelled: 7, 10, and 17
Claim(s) rejected: 1-6, 8-9, 11-16, and 18-20

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5, 8-9, 11-15, and 18-20 are rejected under35 U.S.C. 103 as being unpatentable over Ranzato et al. ("Sequence level training with recurrent neural networks", hereinafter "Ranzato" – IDS ) in view of Bahdanau et al. ("An actor-critic algorithm for sequence prediction", hereinafter "Bahdanau" – IDS ).

Regarding claim 1, Ranzato teaches a computer-implemented method. (Section 1 Para 2: “This process is very brittle because the model was trained on a different distribution of inputs, namely, words drawn from the data distribution, as opposed to words drawn from the model distribution.” A computer-implemented method that requires training is used. An input data and data distribution system is used which implies that a computer readable storage media is also used.) comprising: selecting a word for use as a next word in a text stream (Section 3.2.1 para 1: "In the sequence generation setting, an action refers to predicting the next word in the sequence at each time step". Section 1 para 3: “we build on the REINFORCE algorithm proposed by Williams.”) using natural language processing (NLP) techniques (Abstract: “Many natural language processing applications use language models to generate text. These models are typically trained to predict the next word in a sequence, given the previous words and some context such as an image.”) in the context of image captioning (Page 2 para 2: “Mixed Incremental Cross-Entropy Reinforce (MIXER), which is our first major contribution of this work. MIXER is an easy-to-implement recipe to make REINFORCE work well for text generation applications.” Section 4.3 para 1 “For the image captioning task, we use the MSCOCO dataset.” REINFORCE algorithm and MIXER is used in the context of text generation for an image.)
determining, by an algorithm, an expected future reward value for the word using a test
policy including a training policy and a test-time inference procedure; (Abstract: "However, at
testtime the model is expected to generate the entire sequence from scratch. This
discrepancy makes generation brittle, as errors may accumulate along the way. We address
this issue by proposing a novel sequence level training algorithm that directly optimizes the
metric used at test time, such as BLEU or ROUGE.” In Ranzato, test policy refers to generating the entire sequence from scratch during testing, a training policy refers to a training algorithm, and test time inference refers to optimizing metric BLEU or ROUGE.) with the training policy being used to minimize a negative expected future reward for the word (Section 3.2.1 Para 1: “The goal of training is to find the parameters of the agent that maximize the expected reward. We define our loss as the negative expected reward…” Maximizing expected reward will result in minimizing a negative expected reward. Negative expected reward is also called loss. Also, section 3.1.1 para 1 mentions training involves minimizing the loss as : “Cross-entropy loss (XENT) maximizes the probability of the observed sequence according to the model. If the target sequence is [w1;w2; : : : ;wT ], then XENT training involves minimizing L…”
But Ranzato does not explicitly teach and normalizing, through the use of a self-critical sequence training (SCST) algorithm, a set of expected future reward estimate(s) using the expected future reward value for the sampled word using the test policy, with the normalization of the set of expected future reward estimate(s) utilizing the output of the test-time inference procedure in order to reduce variance.
Bahdanau, however, teaches and normalizing, through the use of a self-critical sequence training (SCST) algorithm, a set of expected future reward estimate(s) (Page 5 Section 3: “REINFORCE uses the cumulative reward                         
                            
                                
                                    ∑
                                    
                                        τ
                                        =
                                        t
                                    
                                    
                                        T
                                    
                                
                                
                                    
                                        
                                            γ
                                        
                                        
                                            τ
                                             
                                        
                                    
                                    (
                                    
                                        
                                            
                                                
                                                    y
                                                
                                                ^
                                            
                                        
                                        
                                            τ
                                        
                                    
                                    ;
                                
                            
                            
                                
                                    
                                        
                                            Y
                                        
                                        ^
                                    
                                
                                
                                    1
                                    …
                                    τ
                                    -
                                    1
                                
                            
                            )
                        
                     following the action                         
                            
                                
                                    
                                        
                                            y
                                        
                                        ^
                                    
                                
                                
                                    t
                                
                            
                        
                     , which again can be seen as a 1-sample estimate of Q. Due to these simplifications and the potential high variance in the cumulative reward, the REINFORCE gradient estimator has very high variance. In order to improve upon it, we consider the actor-critic estimate from Equation 8, which has a lower variance at the cost of significant bias, since the critic is not perfect and trained simultaneously with the actor. The success depends on our ability to control the bias by designing the critic network and using an appropriate training criterion for it.” Reward estimate shows high variance when regular RNN is used. This reward is normalized by reducing variance using actor-critic algorithm. Fig. 1 shows working of actor-critic algorithm for word sequence prediction. Actor receives word sequence and produce samples. BLEU is used to score the expected reward. Per Spec Para 0063, SCST algorithm is a form of Reinforce algorithm. Per Para Spec 0070 SCST is an actor-critic type algorithm.) using the expected future reward value for the sampled word using the test policy (Page 5 Last Para: “A crucial component of our approach is policy evaluation, that is the training of the critic to produce useful estimates of                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                    . With a na                        
                            
                                
                                    i
                                
                                ¨
                            
                        
                    ve Monte-Carlo method, one could use the future return                         
                            
                                
                                    ∑
                                    
                                        τ
                                        =
                                        t
                                    
                                    
                                        T
                                    
                                
                                
                                    
                                        
                                            γ
                                        
                                        
                                            τ
                                             
                                        
                                    
                                    (
                                    
                                        
                                            
                                                
                                                    y
                                                
                                                ^
                                            
                                        
                                        
                                            τ
                                        
                                    
                                    ;
                                
                            
                            
                                
                                    
                                        
                                            Y
                                        
                                        ^
                                    
                                
                                
                                    1
                                    …
                                    τ
                                    -
                                    1
                                
                            
                            )
                        
                     as a target to                         
                            
                                
                                    Q
                                     
                                    (
                                
                                ¨
                            
                            
                                
                                    
                                        
                                            y
                                        
                                        ¨
                                    
                                
                                
                                    t
                                
                            
                            ;
                            
                                
                                    
                                        
                                            Y
                                        
                                        ^
                                    
                                
                                
                                    1
                                    …
                                    τ
                                    -
                                    1
                                
                            
                            )
                            .
                        
                    ”  Page 2 Para 1: “In this work, we propose and study an alternative procedure for training sequence prediction networks that aims to directly improve their test time metrics.” Section 5.2 Para 2: “The return is defined as a smoothed and rescaled version of the BLEU score. Specifically, we start all n-gram counts from 1 instead of 0, and multiply the resulting score by the length of the ground-truth translation. Smoothing is a common practice when sentence-level BLEU score is considered.” Test policy or policy evaluation also performed by training the algorithm to improve reward value at test time. Future return is a BLEU score (or reward value) at the sentence level.)
with the normalization of the set of expected future reward estimate(s) utilizing the output of the test-time inference procedure in order to reduce variance (Section 5 Last Para: “Specifically, we use the                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                     estimates produced by the critic network as the baseline for the REINFORCE algorithm. The motivation behind this approach is that using the ground-truth output should produce a better baseline that lowers the variance of REINFORCE, resulting in higher task-specific scores. We refer to this method as REINFORCE-critic.” Page 2 Para 1: “In this work, we propose and study an alternative procedure for training sequence prediction networks that aims to directly improve their test time metrics.” Normalization (which is a process of determining a reward value) is discussed in previous section. Bahdanau is utilizing ground-truth output in critic algorithm to reduce variance at test time. So besides normalization, Bahdanau’s actor-critic algorithm is also minimizing the output variance.)
(In section 6, Bahdanau discusses test error using actor-critic algorithm as in section 6 Para 1 says “we were able to significantly reduce the gap in the training speed and achieve a better test error using our critic network as the baseline for REINFORCE.” Further, Bahdanau proposes suitable training criteria to have stable or well-behaved gradient or harmonizes the gradient in section 6 para 3 as: “The critic would sometimes assign very high values to actions with a very low probability according to the actor. We were able to resolve this by penalizing the critic’s variance… For example, one can look for suitable training criteria that have a well-behaved gradient even when the policy has little or no stochasticity.”)
Before the effective filing date of the invention it would have been obvious to one of
ordinary skill in the art to combine the method of Ranzato with normalization process of Bahdanau in order to improve the variance. (Bahdanau, Section 3, Section 5).

Regarding claim 2, Regarding claim 2, Ranzato and Bahdanau teach the method of claim 1. 
Bahdanau also teaches wherein the test-time inference procedure is utilized only in the normalization of the set of expected future reward estimate(s) (“Page 5 Section 3: “REINFORCE uses the cumulative reward                         
                            
                                
                                    ∑
                                    
                                        τ
                                        =
                                        t
                                    
                                    
                                        T
                                    
                                
                                
                                    
                                        
                                            γ
                                        
                                        
                                            τ
                                             
                                        
                                    
                                    (
                                    
                                        
                                            
                                                
                                                    y
                                                
                                                ^
                                            
                                        
                                        
                                            τ
                                        
                                    
                                    ;
                                
                            
                            
                                
                                    
                                        
                                            Y
                                        
                                        ^
                                    
                                
                                
                                    1
                                    …
                                    τ
                                    -
                                    1
                                
                            
                            )
                        
                     following the action                         
                            
                                
                                    
                                        
                                            y
                                        
                                        ^
                                    
                                
                                
                                    t
                                
                            
                        
                     , which again can be seen as a 1-sample estimate of Q. Due to these simplifications and the potential high variance in the cumulative reward, the REINFORCE gradient estimator has very high variance. In order to improve upon it, we consider the actor-critic estimate from Equation 8, which has a lower variance at the cost of significant bias, since the critic is not perfect and trained simultaneously with the actor.” Actor-critic algorithm is used to normalize the reward estimate by reducing output variance. Section 1 Para 3: “In this work, we propose and study an alternative procedure for training sequence prediction networks that aims to directly improve their test time metrics…” Actor-critic method is proposed in this study which aims to improve test time metrics (during test-time inference). Actor-critic method also normalizes the estimate by determining a reward score or value at test time.).
Same motivation to combine the teaching of Ranzato and Bahdanau as claim 1.

Regarding claim 3, Ranzato and Bahdanau teach the method of claim 1.
Ranzato further teaches wherein the test-time inference procedure involves performing a beam search. (Section 6.3 first para: “At test time we can reduce the effect of search error by pursuing not only one but k next word candidates at each point, which is commonly known as beam search.”).

Regarding claim 4, Ranzato and Bahdanau teach the method of claim 1.
And Ranzato further teaches wherein the training policy is defined by the parameters of a network (Section 3.2.1 Para 1: “Our generative model (the RNN) can be viewed as an agent, which interacts with the external environment (the words and the context vector it sees as input at every time step).The parameters of this agent defines a policy, whose execution results in the agent picking an action. In the sequence generation setting, an action refers to predicting the next word in the sequence at each time step.” Section 3.2.1 Para 1: “The goal of training is to find the parameters of the agent that maximize the expected reward.” In RNN network, parameters of an agent defines a policy, which will be called training policy if it involves training.) and the training policy is used to predict a second word for use as the next word (Section 3.1.1 Para 1: “When using an RNN, each term p(wt|w1,…,wt-1) is modeled as a parametric function as given in Equation (5). This loss function trains the model to be good at greedily predicting the next word at each time step without considering the whole sequence.” Fig. 2 Section 3.1.1: “Training proceeds similar to XENT, except that at each time step we choose with a certain probability whether to take the previous model prediction or the ground truth word. Notice how a) gradients are not backpropagated through the eventual model predictions wgt and b) the XENT loss always uses as target the next word in the reference sequence, even when the input is wgt.” Section 3.1.1 Last Para: “One popular way to reduce the effect of search error is to pursue not only one but k next word candidates at each point.” Section 3.1.1 explains the process of predicting next word using training. In cross entropy training (XENT), k next work candidates are considered as next word in predicting next word)

Regarding claim 5, Ranzato and Bahdanau teach the method of claim 1. 
Ranzato further teaches wherein the selection of words involves sampling from the policy being learned, (prioritized) training or experienced data, or any other policy (Section 1 Para 3: “While sampling from the model during training is quite a natural step for the REINFORCE algorithm, optimizing directly for any test metric can also be achieved by it.”).

Regarding claim 8, Ranzato and Bahdanau teach the method of claim 1. 
Ranzato also teaches wherein the algorithm is a REINFORCE type algorithm (Section 3.2.1 para 1: “In order to apply the REINFORCE algorithm to the problem of sequence generation we cast our problem in the reinforcement learning (RL) framework.”).

Regarding claim 9, Ranzato and Bahdanau teach the method of claim 1.
Bahdanau also teaches wherein the algorithm is an actor-critic policy-gradient
type algorithm. (Page 6 Figure 1: Actor-critic algorithm is used where actor receives an input
sequence and produces samples which are evaluated by the critic. The values computed by the critic are used to approximate the gradient of the expected returns with respect to the
parameters of the actor.).
Same motivation to combine the teaching of Ranzato and Bahdanau as claim 1.

Regarding claims 11-15, they are substantially similar to claim 1-5 and are rejected in
the same manner, the same art and reasoning applying.

Regarding claims 18, it is substantially similar to claim 8 and is rejected in
the same manner, the same art and reasoning applying.


Regarding claim 19, it is substantially similar to claim 9 and is rejected in the same manner, the same art and reasoning applying.

Regarding claim 20, it is substantially similar to claim 1 and is rejected in the same
manner, the same art and reasoning applying.

Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Ranzato
(“Sequence level training with recurrent neural networks”) in view of Bahdanau (“An actor-critic algorithm for sequence prediction”) and further in view of Barnard et al. (“Matching words and pictures”, hereinafter “Barnard”).

Regarding claim 6, Ranzato and Bahdanau teach the method of claim 1.
 Neither Ranzato nor Bahdanau teach further teaches wherein the selection of words involves clustering words as a part of a distribution characterization procedure 
Barnard, however, teaches further teaches wherein the selection of words involves clustering words as a part of a distribution characterization procedure (Section 5.1 para2: “Notice that we have chosen to compute the distribution inherited by the words on a cluster by cluster basis… we can consider cluster dependent level distributions.” Page 1113 para 2: “To stabilize training, we translate and scale the region feature data to have zero mean and unit variance…We also limit the variance of the Gaussian distribution to be at least 0.001 in the training data space. Similarly, the word frequency is forced to be at least a small value greater than zero (0.01 / vocabulary size)…” Page 1115 para 2: “Using these parameters, we perform image based word-prediction by finding the corresponding distribution over words.” Page 1129 para 2: “Of course, clustering is warranted for many applications such as browsing and search where characterization of the training set is important.” Cluster dependent distribution of words in words selection. The distribution is characterized by mean and variance.).
Before the effective filing date if the invention it would have been obvious to one of
ordinary skill in the art to combine the method of Ranzato as modified by Bahdanau with the
clustering method of Barnard to predict words given images (Barnard, Page 1110 para2).

Regarding claim 16, it is substantially similar to claim 6 and is rejected in the same manner, the same art and reasoning applying.

Response to Amendment
Applicant’s arguments filed on 09/20/2021 with respect to the 35 U.S.C. 103 rejections have been fully considered. Claims 1, 3-5, 11, 13-15, and 20 have been amended by the applicant. New amendments have been added in 103 rejections and relevant citations have been provided. All claims remain rejected.

Conclusion
An inquiry concerning this communication or earlier communication from the examiner should be directed QAMAR IQBAL whose telephone number is 571-272-2563. The examiner can normally be reached on M-F 10-6pm (EST). 

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). 

/Q.I/ 
Examiner 
Art unit 2123
01/04/2022
/BRIAN M SMITH/Primary Examiner, Art Unit 2122