DETAILED ACTION
This is the response to applicant’s amendment action regarding application number 15/811,728, filed November 14, 2017.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Response to Amendments
The amendment filed March 29, 2022 has been entered. Examiner acknowledges receipt of Amendments to Application 15/811,728, which include: Amendments to the Claims, and Remarks containing Applicant’s amendments. 
Regarding Applicant’s Remarks and Amendments to the Claims, Examiner acknowledges Claims 1, 3, 11, 13, and 20 have been amended, with Claims 7, 10, and 17 previously cancelled, and with Claim 21 newly added. Claims 1-6, 8-9, 11-16, and 18-21 remain pending in the application. 

Response to Arguments
Examiner acknowledges receipt of Arguments to Application 15/811,728, which include: Remarks containing Applicant’s arguments. 
Regarding Applicant's Remarks for Claims 1-5, 8-9, 11-15, and 18-20 under 35 U.S.C. §103 as being unpatentable over Ranzato et al., Sequence Level Training with Recurrent Neural Networks, May 6 2016 [henceforth referred as Ranzato], in view of Bahdanau et al., Neural Machine Translation by Jointly Learning to Align and Translate, May 19 2016 [henceforth referred as Bahdanau], Examiner acknowledges Applicant’s arguments and have considered them, and have found them to be not persuasive. Examiner points out that all of the Applicant’s arguments are directed towards the amended claim limitations, which were not previously entered. Examiner further notes that the amendments presented in the independent and dependent claims necessitate further examination and re-evaluation of the amended and related original claims. The updated claim mappings according to the Applicant’s amended claims are provided in the relevant sections indicated below.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-5, 8-9, 11-15, and 18-21 are rejected under 35 U.S.C. 103 as being unpatentable over 
Ranzato et al., Sequence Level Training with Recurrent Neural Networks, May 6 2016 [hereafter referred as Ranzato], in view of Bahdanau et al., An Actor-Critic Algorithm for Sequence Prediction, March 3 2017 [hereafter referred as Bahdanau], in further view of Lu et al., Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning, June 6 2017 [hereafter referred as Lu].  
Regarding amended Claim 1,
 Ranzato teaches
(currently amended) A computer-implemented method comprising: 
… receiving a first image, with the first image being encoded with contextual information using a … convolutional neural network (CNN), and with the contextual information including a set of spatial features of the first image (Examiner’s note: Ranzato teaches extracting image features from each image in an image dataset by using a convolutional neural network, where the extracted image features by the convolutional neural network represent spatial features of an image, and are further represented by a context containing a sequence of words, and as such, each context for each image represents contextual information that includes a set of spatial features of a corresponding image (Ranzato p.9 Section 4.3: “For the image captioning task, we use the MSCOCO dataset … There are 5 different captions for each image. … The context is represented by 1024 features extracted by a Convolutional Neural Network (CNN) … The RNN is a single layer LSTM with 512 hidden units and the image features are provided to the generative model as the first word in the sequence. …”).) …
… with each spatial feature of the set of spatial features having a respective weight (Examiner’s note: Ranzato teaches using an attentive encoder to compute a context vector based on a context, where the attentive encoder takes the context (taught in Ranzato p.4 Section 4.3, as the source sentence composed of M words s=[                        
                            
                                
                                    w
                                
                                
                                    1
                                
                            
                        
                    ,…,                        
                            
                                
                                    w
                                
                                
                                    M
                                
                            
                        
                    ], with each word representing a spatial feature) and associates an aggregate embedding                         
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                        
                     to each word                         
                            
                                
                                    w
                                
                                
                                    j
                                
                            
                        
                     in the source sentence, and computes a resulting context vector by applying each word with weights                         
                            
                                
                                    α
                                
                                
                                    j
                                    ,
                                    t
                                
                            
                        
                     based on the aggregate embeddings, where Ranzato p.15 equations (13) and (14) teaches that each word (representing a spatial feature) has an associated weight (Ranzato p.3 Section 3 1st paragraph: “… the RNN can also take as input an additional context vector                         
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                        
                    , which encodes the context to be used while generating the output. …                         
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                        
                     is computed using an attentive [en]coder … the details of which are given in Section 6.2 …”; and p.15 Section 6.2: “… Let us denote by s the source sentence which is composed of a sequence of M words s=[                        
                            
                                
                                    w
                                
                                
                                    1
                                
                            
                        
                    ,…,                        
                            
                                
                                    w
                                
                                
                                    M
                                
                            
                        
                    ]. … the full embedding for the i-th word in the input sentence is given by                         
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            =
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                            +
                            
                                
                                    l
                                
                                
                                    i
                                
                            
                        
                     … we associate an aggregate embedding                         
                            
                                
                                    z
                                
                                
                                    i
                                
                            
                        
                     to each word in the source sentence … its aggregate embedding                         
                            
                                
                                    z
                                
                                
                                    i
                                
                            
                        
                    is computed by taking a window of q consecutive words centered at position i and averaging the embeddings of all the words in this window. … Given these aggregate vectors of words, we compute the context vector                         
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                        
                     (the final output of the encoder) as:                         
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                            =
                             
                            
                                
                                    ∑
                                    
                                        j
                                        =
                                        1
                                    
                                    
                                        M
                                    
                                
                                
                                    
                                        
                                            α
                                        
                                        
                                            j
                                            ,
                                            t
                                        
                                    
                                    
                                        
                                            w
                                        
                                        
                                            j
                                        
                                    
                                
                            
                        
                    , where the weights                         
                            
                                
                                    α
                                
                                
                                    j
                                    ,
                                    t
                                
                            
                            =
                            e
                            x
                            p
                            (
                            
                                
                                    z
                                
                                
                                    j
                                
                            
                            ∙
                            
                                
                                    h
                                
                                
                                    t
                                
                            
                            )
                            /
                            (
                            
                                
                                    ∑
                                    
                                        i
                                        =
                                        1
                                    
                                    
                                        M
                                    
                                
                                
                                    e
                                    x
                                    p
                                    (
                                    
                                        
                                            z
                                        
                                        
                                            i
                                        
                                    
                                    ∙
                                    
                                        
                                            h
                                        
                                        
                                            t
                                        
                                    
                                     
                                    )
                                    )
                                
                            
                        
                    .”).) …
… selecting a word for use as a next word in a text stream using natural language processing (NLP) techniques to describe a first spatial feature (Examiner’s note: As indicated earlier, Ranzato teaches extracting image features from each image in an image dataset into a context, and encoding the context into a context vector. Ranzato further teaches using a reinforcement learning based method for generating text, where the method trains a RNN-based model to predict the next word in a sequence of words and applying n-gram/bi-gram test metrics (e.g., BLEU, ROUGE) used in NLP applications to maximize the reward to produce an optimal action sequence, where the optimal action sequence represents a description of the spatial features (Ranzato p.9 Section 4.3 Image Captioning; p.1 Abstract: “Many natural language processing applications use language models to generate text. These models are typically trained to predict the next word in a sequence, given the previous words and some context such as an image …”; p.1 Section 1 2nd paragraph: “… One such metric is called BLEU … which measures the n-gram overlap between the model generation and the reference text …”; p.2 1st-2nd paragraphs: “… we build on the REINFORCE algorithm proposed by Williams (1992) … we introduce Mixed Incremental Cross-Entropy Reinforce (MIXER), which is our first major contribution of this work. MIXER is an easy-to-implement recipe to make REINFORCE work well for text generation applications …”; and p.6 Section 3.2.1 1st paragraph: “… In the sequence generation setting, an action refers to predicting the next word in the sequence at each time step. After taking an action the agent updates its internal state … Once the agent has reached the end of a sequence, it observes a reward. We can choose any reward function … we use BLEU … and ROUGE-2 … since these are the metrics we use at test time … we have a training set of optimal sequences of actions. During training we choose actions according to the current policy and only observe a reward at the end of the sequence, by comparing the sequence of actions from the current policy against the optimal action sequence. The goal of training is to find the parameters of the agent that maximize the expected reward. …”).) … 
… determining, by an algorithm, an expected future reward value for the word using a test policy including a training policy and a test-time inference procedure (Examiner’s note: As indicated earlier, Ranzato teaches using a reinforcement learning based method that uses RNN-based models to predict the next word in a sequence. Ranzato further teaches that the models are expected to generate the entire sequence from scratch (where this process of generating an entire sequence from scratch represents a test policy), and achieves this by using a sequence level training algorithm incorporating cross-entropy training (to generate the text sequence during test time), and applying an reinforcement learning algorithm (REINFORCE) to apply test metrics (e.g., BLEU and ROUGE) as reward functions to produce the optimal action sequence to maximize the expected reward, where the sequence level training algorithm incorporating cross-entropy training and reinforcement learning represents the training policy, and the process that uses these test metrics as reward functions to optimize an action sequence to maximize the expected reward represents a test-time inference procedure (with the combined training policy and test-time inference procedure representing a test policy) (Ranzato p.1 Abstract: "However, at test time the model is expected to generate the entire sequence from scratch. This discrepancy makes generation brittle, as errors may accumulate along the way. We address this issue by proposing a novel sequence level training algorithm that directly optimizes the metric used at test time, such as BLEU or ROUGE …”; p.4 Figure 1: “RNN training using XENT (top), and how it is used at test time for generation (bottom) …”; p.6 Section 3.2.1 1st paragraph; and p.8 Figure 4).)…
… with the training policy being used to minimize a negative expected future reward for the word (Examiner’s note: As indicated earlier, Ranzato teaches the sequence level training algorithm incorporating cross-entropy training and reinforcement learning, where this sequence level training algorithm represents the training policy. Ranzato further teaches the goal of the training algorithm is to maximize the expected reward, and defines a loss function based on a negative expected reward for each word chosen by the model at each n-th time step, such that maximizing the expected reward results in minimizing the loss function during training, and hence minimizing the negative expected reward (Ranzato p.4 Section 3.1.1 1st paragraph: “… Cross-entropy loss (XENT) maximizes the probability of the observed sequence according to the model. If the target sequence is [                        
                            
                                
                                    w
                                
                                
                                    1
                                
                            
                        
                    ,                        
                             
                            
                                
                                    w
                                
                                
                                    2
                                
                            
                        
                    ,…,                        
                            
                                
                                    w
                                
                                
                                    T
                                
                            
                        
                    ], then XENT training involves minimizing L … <see p.4 equation (6)> …”; and p.6 Section 3.2.1 1st paragraph: “… The goal of training is to find the parameters of the agent that maximize the expected reward. We define our loss as the negative expected reward:                         
                            
                                
                                    L
                                
                                
                                    θ
                                
                            
                        
                     … <see p.6 equation (9)> … where                         
                            
                                
                                    w
                                
                                
                                    n
                                
                                
                                    g
                                
                            
                        
                     is the word chose by our model at the n-th time step, and r is the reward associated with the generated sequence …”).) … 
While Ranzato teaches minimizing a negative expected future reward through application of the REINFORCE algorithm, Ranzato does not explicitly teach
… normalizing, through the use of a self-critical sequence training (SCST) algorithm, a set of expected future reward estimate(s) …
… using the expected future reward value for the sampled word using the test policy …
… with the normalization of the set of expected future reward estimate(s) utilizing the output of the test-time inference procedure in order to reduce variance …
… with the reduced variance resulting in a stable normalization gradient produced through the use of the SCST algorithm.
Bahdanau teaches
… normalizing, through the use of a self-critical sequence training (SCST) algorithm, a set of expected future reward estimate(s) (Examiner’s note: Under its broadest reasonable interpretation in light of the Applicant’s specification paragraph [0063] and [0070], the term “self-critical sequence training (SCST)” broadly recites a variation of the REINFORCE algorithm for normalizing expected reward estimates, where the term “normalizing” broadly recites an action for conforming an item of interest (in this case the expected reward estimate). Bahdanau teaches an actor-critic algorithm that is based on the REINFORCE algorithm, which instead of using the cumulative reward following an action (which results in a gradient exhibiting high variance), implements an actor-critic architecture that produces a gradient exhibiting low variance for training sequence prediction networks to improve their test time metrics, where the actor RNN is the model that generates the word sequences (represented by output tokens                         
                            
                                
                                    
                                        
                                            y
                                        
                                        ^
                                    
                                
                                
                                    t
                                
                            
                        
                    ), and these output tokens are received as inputs to a critic RNN to produce estimates                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                    (a;                        
                            
                                
                                    
                                        
                                            Y
                                        
                                        ^
                                    
                                
                                
                                    1
                                    …
                                    t
                                
                            
                        
                    ), which are then used to approximate the gradient of the returns to optimize the expected rewards (e.g., a BLEU score), such that this process of producing low variance gradients to optimize the expected rewards represents a method for normalizing the expected reward estimates (Bahdanau p.2 2nd paragraph: “… In this work, we propose and study an alternative procedure for training sequence prediction networks that aims to directly improve their test time metrics …”; p.5 Section 3 2nd-3rd paragraphs: “… Due to these simplifications and the potential high variance in the cumulative reward, the REINFORCE gradient estimator has very high variance. In order to improve upon it, we consider the actor-critic estimate from Equation 8, which has a lower variance at the cost of significant bias, since the critic is not perfect and trained simultaneously with the actor … To implement the critic, we propose to use a separate RNN parameterized by 𝛟. The critic RNN is run in parallel with the actor, consumes the tokens                         
                            
                                
                                    
                                        
                                            y
                                        
                                        ^
                                    
                                
                                
                                    t
                                
                            
                        
                     that the actor outputs and produces the estimates                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                    (a;                        
                            
                                
                                    
                                        
                                            Y
                                        
                                        ^
                                    
                                
                                
                                    1
                                    …
                                    t
                                
                            
                        
                    ) for all a ∈ A … the return R(                        
                            
                                
                                    Y
                                
                                ^
                            
                        
                    ,Y) is a deterministic function of Y, and we argue that using Y to compute                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                     should be of great help … See Figure 1 for a visual representation of our actor-critic architecture.”; and p.6 Figure 1 caption: “…The actor receives an input sequence X and produces samples Y which are evaluated by the critic. The critic takes in the … actor’s prediction                         
                            
                                
                                    
                                        
                                            y
                                        
                                        ^
                                    
                                
                                
                                    t
                                
                            
                        
                     as input at time step t … The values                         
                            
                                
                                    Q
                                
                                
                                    1
                                
                            
                            ,
                             
                            
                                
                                    Q
                                
                                
                                    2
                                
                            
                            ,
                        
                    …,                        
                             
                            
                                
                                    Q
                                
                                
                                    T
                                
                            
                        
                     computed by the critic are used to approximate the gradient of the expected returns with respect to the parameters of the actor. This gradient is used to train the actor to optimize these expected task specific returns (e.g., BLEU score).”).) …
… using the expected future reward value for the sampled word using the test policy (Examiner’s note: As indicated earlier, Bahdanau teaches an actor-critic algorithm where the actor RNN is the model that generates the word sequences (represented by output tokens                         
                            
                                
                                    
                                        
                                            y
                                        
                                        ^
                                    
                                
                                
                                    t
                                
                            
                        
                    ), and these output tokens are received as inputs to a critic RNN to produce estimates                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                    (a;                        
                            
                                
                                    
                                        
                                            Y
                                        
                                        ^
                                    
                                
                                
                                    1
                                    …
                                    t
                                
                            
                        
                    ), which are then used to approximate the gradient of the returns to optimize the expected rewards (e.g., a BLEU score), where the usage of test-time metrics (such as a BLEU score) as rewards to optimize the expected rewards represents a usage of the test policy as recited in an earlier claim limitation (Bahdanau p.2 2nd paragraph: “… In this work, we propose and study an alternative procedure for training sequence prediction networks that aims to directly improve their test time metrics …”; p.5 Section 3 last paragraph: “Temporal-difference learning: A crucial component of our approach is policy evaluation, that is the training of the critic to produce useful estimates of                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                    . With a naïve Monte-Carlo method, one could use the future return                         
                            
                                
                                    ∑
                                    
                                        τ
                                        =
                                        t
                                    
                                    
                                        T
                                    
                                
                                
                                    
                                        
                                            γ
                                        
                                        
                                            τ
                                             
                                        
                                    
                                    (
                                    
                                        
                                            
                                                
                                                    y
                                                
                                                ^
                                            
                                        
                                        
                                            τ
                                        
                                    
                                    ;
                                
                            
                            
                                
                                    
                                        
                                            Y
                                        
                                        ^
                                    
                                
                                
                                    1
                                    …
                                    τ
                                    -
                                    1
                                
                            
                            )
                        
                     as a target to                         
                            
                                
                                    Q
                                     
                                    (
                                
                                ¨
                            
                            
                                
                                    
                                        
                                            y
                                        
                                        ¨
                                    
                                
                                
                                    t
                                
                            
                            ;
                            
                                
                                    
                                        
                                            Y
                                        
                                        ^
                                    
                                
                                
                                    1
                                    …
                                    τ
                                    -
                                    1
                                
                            
                            )
                        
                     …”; p.6 Figure 1 caption; p.9 Section 5.2 2nd paragraph: “… The return is defined as a smoothed and rescaled version of the BLEU score. Specifically, we start all n-gram counts from 1 instead of 0, and multiply the resulting score by the length of the ground-truth translation. Smoothing is a common practice when sentence-level BLEU score is considered, and it has been used to apply REINFORCE in similar settings (Ranzato et al., 2015).”).) …
… with the normalization of the set of expected future reward estimate(s) utilizing the output of the test-time inference procedure in order to reduce variance (Examiner’s note: As indicated earlier,  Bahdanau teaches an actor-critic algorithm that is based on the REINFORCE algorithm that produces a gradient exhibiting low variance for training sequence prediction networks to improve their test time metric, where the output tokens                         
                            
                                
                                    
                                        
                                            y
                                        
                                        ^
                                    
                                
                                
                                    t
                                
                            
                        
                     produced by the actor RNN model are received as inputs to a critic RNN to produce estimates                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                    (a;                        
                            
                                
                                    
                                        
                                            Y
                                        
                                        ^
                                    
                                
                                
                                    1
                                    …
                                    t
                                
                            
                        
                    ), which are then used to approximate the gradient of the returns to optimize the expected rewards (e.g., a BLEU score), and hence representing a process for normalizing the expected reward estimates. Bahdanau further teaches an extension of the REINFORCE algorithm that leverages extra information from the ground-truth output to further lower the variance of the REINFORCE algorithm, such that this extension in combination with the prior actor-critic gradient teaching represents a process for utilizing the output of the test-time inference in order to reduce variance (Bahdanau p.2 2nd paragraph; Bahdanau p.5 Section 3 2nd-3rd paragraphs; p.6 Figure 1 caption; and p.8 1st paragraph: “… We also propose a novel extension of REINFORCE that leverages the extra information available in the ground-truth output Y. Specifically, we use the                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                     estimates produced by the critic network as the baseline for the REINFORCE algorithm. The motivation behind this approach is that using the ground-truth output should produce a better baseline that lowers the variance of REINFORCE, resulting in higher task-specific scores. We refer to this method as REINFORCE-critic.”).) …
… with the reduced variance resulting in a stable normalization gradient produced through the use of the SCST algorithm (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites steps or techniques for maintaining the gradient that is being produced by the actor-critic based REINFORCE algorithm. As indicated earlier, Bahdanau teaches an actor-critic algorithm that is based on the REINFORCE algorithm that produces a gradient exhibiting low variance to optimize the expected rewards. Bahdanau further teaches additional optimization techniques including penalizing the critic’s variance, combining an RL training objective with log-likelihood to prevent vanishing gradients,  and using suitable training criteria that have a well-behaved gradient, where these optimization techniques will result in reduced variance as well as stabilizing the normalization gradients produced by the algorithm (Bahdanau p.11 Section 6 1st-3rd paragraphs: “… we were able to significantly reduce the gap in the training speed and achieve a better test error using our critic network as the baseline for REINFORCE. … We ran into several optimization issues. The critic would sometimes assign very high values to actions with a very low probability according to the actor. We were able to resolve this by penalizing the critic’s variance … We noticed that the action distribution tends to saturate and become deterministic, causing the gradient to vanish. We found that combining an RL training objection with log-likelihood can help … one can look for suitable training criteria that have a well-behaved gradient even when the policy has little or no stochasticity.”).).
Both Ranzato and Bahdanau are analogous art since both teach generating token/word sequences using a reinforcement learning (REINFORCE) algorithm for caption generation.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the REINFORCE algorithm taught in Ranzato and enhance it to include the actor-critic based REINFORCE algorithm taught in Bahdanau as a way to improve the high variance exhibited in the REINFORCE algorithm. The motivation to combine is taught in Bahdanau, as a way to improve the high variance observed during gradient estimation in the regular REINFORCE algorithms, which also results in faster fitting of the training method to the training data, thus making this algorithm more computationally efficient than the regular REINFORCE algorithm (Bahdanau p.5 Section 3 3rd paragraph: “… In order to improve upon it, we consider the actor-critic estimate from Equation 8, which has a lower variance at the cost of significant bias …”;  p.8 1st paragraph; and p.11 Section 6 1st-2nd paragraphs: “… We showed that our method leads to significant improvements over maximum likelihood training on both a synthetic task and a machine translation benchmark … actor-critic fits the training data much faster …”).
While Ranzato in view of Bahdanau teaches using a convolutional neural network to extract image features from an image using a convolutional neural network, Ranzato in view of Bahdanau does not explicitly teach
… using a residual convolutional neural network …
Lu teaches
… using a residual convolutional neural network (Examiner’s note: Under its broadest reasonable interpretation in light of Applicant’s specification [0085], this limitation broadly recites the use of a ResNet model as a type of residual convolutional neural network. Lu teaches image captioning using the spatial feature outputs of the last convolutional layer of a ResNet model to represent images, where the ResNet model represents a residual convolutional neural network (Lu p.4 Section 3 2nd paragraph: “… Encoder-CNN: The encoder uses a CNN to get the representation of images. Specifically, the spatial feature outputs of the last convolutional layer of ResNet [10] are used … We use                         
                            A
                            =
                            {
                            
                                
                                    a
                                
                                
                                    1
                                
                            
                        
                    ,…,                        
                             
                            
                                
                                    a
                                
                                
                                    k
                                
                            
                            }
                            ,
                             
                            
                                
                                    a
                                
                                
                                    i
                                
                            
                            ∈
                            
                                
                                    R
                                
                                
                                    2048
                                
                            
                        
                     to represent the spatial CNN features at each of the k grid locations …”).) …
Both Ranzato in view of Bahdanau and Lu are analogous art since both teach image captioning using features extracted by a convolutional neural network.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the convolutional neural network taught in Ranzato in view of Bahdanau and substitute it with the residual convolutional neural network taught in Lu as a way to get the representation of images by extracting the spatial features from each image. The motivation to combine is taught in Lu, as residual networks exhibit superior performance, thus improving the performance of the system (Lu p.3 2nd paragraph: “… Our motivation stems from the superior performance of residual network [10] …”).
Regarding original Claim 2,
 Ranzato in view of Bahdanau, in further view of Lu teaches
(original) The method of claim 1, wherein the test-time inference procedure is utilized only in the normalization of the set of expected future reward estimate(s) (Examiner’s note: Under its broadest reasonable interpretation, this limitation is a re-phrasing of an earlier limitation already presented in the independent claim: “… with the normalization of the set of expected future reward estimate(s) utilizing the output of the test-time inference procedure in order to reduce variance”. As indicated earlier,  Bahdanau teaches an actor-critic algorithm that is based on the REINFORCE algorithm that produces a gradient exhibiting low variance for training sequence prediction networks to improve their test time metric, where the output tokens                         
                            
                                
                                    
                                        
                                            y
                                        
                                        ^
                                    
                                
                                
                                    t
                                
                            
                        
                     produced by the actor RNN model are received as inputs to a critic RNN to produce estimates                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                    (a;                        
                            
                                
                                    
                                        
                                            Y
                                        
                                        ^
                                    
                                
                                
                                    1
                                    …
                                    t
                                
                            
                        
                    ), which are then used to approximate the gradient of the returns to optimize the expected rewards (e.g., a BLEU score), and hence representing a process for normalizing the expected reward estimates. As indicated earlier, Bahdanau further teaches an extension of the REINFORCE algorithm that leverages extra information from the ground-truth output to further lower the variance of the REINFORCE algorithm, such that this extension in combination with the prior actor-critic gradient teaching further represents a process for utilizing the output of the test-time inference in order to reduce variance (Bahdanau p.2 2nd paragraph; Bahdanau p.5 Section 3 2nd-3rd paragraphs; p.6 Figure 1 caption; and p.8 1st paragraph: “… We also propose a novel extension of REINFORCE that leverages the extra information available in the ground-truth output Y. Specifically, we use the                         
                            
                                
                                    Q
                                
                                ^
                            
                        
                     estimates produced by the critic network as the baseline for the REINFORCE algorithm. The motivation behind this approach is that using the ground-truth output should produce a better baseline that lowers the variance of REINFORCE, resulting in higher task-specific scores. We refer to this method as REINFORCE-critic.”).).
Regarding amended Claim 3, 
Ranzato in view of Bahdanau, in further view of Lu teaches
(currently amended) The method of claim 1, wherein an Attention Model ("Att2in") is used to dynamically re-weight the set of spatial features of the first image (Examiner’s note: Lu teaches an adaptive spatial attention model that computes a new context vector                         
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                        
                     by learning weight parameters                         
                            
                                
                                    W
                                
                                
                                    x
                                
                            
                        
                     and                         
                            
                                
                                    W
                                
                                
                                    t
                                
                            
                        
                    , where the weight parameter                         
                            
                                
                                    W
                                
                                
                                    x
                                
                            
                        
                     is applied to each input                         
                            
                                
                                    x
                                
                                
                                    t
                                
                            
                        
                     at time step t. Lu further teaches these weight parameters are used to adapt the attention weight                         
                            
                                
                                    α
                                
                                
                                    t
                                
                            
                        
                     over the spatial features in V, such that this adaptation process represents a process in which an attention model is used to dynamically re-weight a set of spatial features weights in an image (Lu pp.2-3 Section 2.2 Spatial Attention Model, including equations (7) and (8) and col.1 1st paragraph: … 𝛂 ∈                         
                            
                                
                                    R
                                
                                
                                    k
                                
                            
                        
                     is the attention weight over features in V. Based on the attention distribution, the context vector                         
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                        
                     can be obtained by <see equation (8)> where                         
                            
                                
                                    c
                                
                                
                                    t
                                
                            
                        
                     and                         
                            
                                
                                    h
                                
                                
                                    t
                                
                            
                        
                     are combined to predict next word                         
                            
                                
                                    y
                                
                                
                                    t
                                    +
                                    1
                                
                            
                        
                     as in Equation 3.”; pp.3-4 Section 2.3 Adaptive Attention Model, including equations (9) and (12), and p.12 col.1 3rd paragraph: “… we extend our spatial attention model, and propose an adaptive model that is able to determine whether it needs to attend the image to predict next word.”; and p.12 col.2 2nd paragraph: “… we propose an adaptive attention model to compute the context vector … our new adaptive context vector is defined as                         
                            
                                
                                    
                                        
                                            c
                                        
                                        ^
                                    
                                
                                
                                    t
                                
                            
                        
                    , which is modeled as a mixture of the spatially attended image features (i.e., context vector of spatial attention model) … we add an additional element to z, the vector containing attention scores … The addition of this extra element is summarized by converting Equation 7 to:                         
                            
                                
                                    
                                        
                                            α
                                        
                                        ^
                                    
                                
                                
                                    t
                                
                            
                        
                     … <see equation (12)> …                         
                            
                                
                                    
                                        
                                            α
                                        
                                        ^
                                    
                                
                                
                                    t
                                
                            
                        
                     ∈                         
                            
                                
                                    R
                                
                                
                                    k
                                    +
                                    1
                                
                            
                        
                     is the attention distribution over both the spatial image feature as well as the visual sentinel vector …”).). 
Regarding previously presented Claim 4, 
Ranzato in view of Bahdanau, in further view of Lu teaches
 (previously presented) The method of claim 1, wherein the training policy is defined by the parameters of a network (Examiner’s note: As indicated earlier, Ranzato further teaches using a reinforcement learning based method for generating text, where in the context of the reinforcement learning based method (REINFORCE), the RNN model is viewed as an agent, and the parameters of the RNN model used during training defines a training policy (Ranzato Section 3.2.1 Para 1: “In order to apply the REINFORCE algorithm … Our generative model (the RNN) can be viewed as an agent, which interacts with the external environment (the words and the context vector it sees as input at every time step).The parameters of this agent defines a policy, whose execution results in the agent picking an action. In the sequence generation setting, an action refers to predicting the next word in the sequence at each time step … we have a training set of optimal sequences of actions … The goal of training is to find the parameters of the agent that maximize the expected reward.”).), and 
the training policy is used to predict a second word for use as the next word (Examiner’s note: As indicated earlier, Ranzato teaches the sequence level training algorithm incorporating cross-entropy training and reinforcement learning, where the cross-entropy training loss function is used to train the model to greedily predict the next word at each time step, and where this next word is used to predict the subsequent next word in the sequence (as shown in Ranzato p.4 Figure 1) (Ranzato pp.4-5 Section 3.1.1 1st-5th paragraphs: “Cross-entropy loss (XENT) maximizes the probability of the observed sequence according to the model … When using an RNN, each term p(                        
                            
                                
                                    w
                                
                                
                                    t
                                
                            
                        
                    |                        
                            
                                
                                    w
                                
                                
                                    1
                                
                            
                        
                    ,…,                        
                             
                            
                                
                                    w
                                
                                
                                    t
                                    -
                                    1
                                
                            
                        
                    ) is modeled as a parametric function as given in Equation (5). This loss function trains the model to be good at greedily predicting the next word at each time step without considering the whole sequence. Training proceeds by truncated back-propagation through time … with gradient clipping … Once trained, one can use the model to generate an entire sequence … Let                         
                            
                                
                                    w
                                
                                
                                    t
                                
                                
                                    g
                                
                            
                        
                     denote the word generated by the model at the t-th time step. Then the next word is generated by:                         
                            
                                
                                    w
                                
                                
                                    t
                                    +
                                    1
                                
                                
                                    g
                                
                            
                        
                     … <see equation (7)> …”; and p.4 Figure 1 caption: “RNN training using XENT (top) … Predictions are produced by either taking the argmax or by sampling from the distribution over words”).).
Regarding previously presented Claim 5, 
Ranzato in view of Bahdanau, in further view of Lu teaches
 (previously presented) The method of claim 1, wherein the selection of words involves 
sampling from the policy being learned, (prioritized) training or experienced data, or any other policy (Examiner’s note: Ranzato p.8 Algorithm 1 teaches sampling from the REINFORCE algorithm during training, where the pseudo code indicates that during RNN model training, XENT loss is used for the first s steps, and REINFORCE (sampling from the model) is used in the remaining T-s steps, where this sampling from the model represents a sampling from the policy being learned (Ranzato p.2 1st paragraph: “… While sampling from the model during training is quite a natural step for the REINFORCE algorithm, optimizing directly for any test metric can also be achieved by it.”; p.8 Algorithm 1 MIXER pseudo-code; and p.8 1st paragraph: “… We call this algorithm Mixed Incremental Cross-Entropy Reinforce (MIXER) because we combine both XENT and REINFORCE, and we use incremental learning … By the end of training, the model can make effective use of its own predictions in-line with its use at test time.”).).
Regarding original Claim 8, 
Ranzato in view of Bahdanau, in further view of Lu teaches
(original) The method of claim 1 wherein the algorithm is a REINFORCE type algorithm (Examiner’s note: As indicated earlier, Ranzato teaches the sequence level training algorithm incorporating cross-entropy training and reinforcement learning, where the reinforcement learning is based on the REINFORCE algorithm, and as such results in the sequence level training algorithm being a REINFORCE type algorithm (Ranzato p.2 1st paragraph: “… we build on the REINFORCE algorithm proposed by Williams (1992) …”; and Section 3.2.1 1st paragraph: “In order to apply the REINFORCE algorithm to the problem of sequence generation we cast our problem in the reinforcement learning (RL) framework …”).).
Regarding original Claim 9, 
Ranzato in view of Bahdanau, in further view of Lu teaches
(original) The method of claim 1 wherein the algorithm is an actor-critic policy-gradient type algorithm (Examiner’s note: Bahdanau teaches an actor-critic algorithm, where the values                         
                            
                                
                                    Q
                                
                                
                                    1
                                
                            
                        
                    ,                        
                             
                            
                                
                                    Q
                                
                                
                                    2
                                
                            
                            ,
                            …
                            ,
                            
                                
                                    Q
                                
                                
                                    T
                                
                            
                        
                     computed by the critic are used to approximate the gradient of the expected returns with respect to the parameters of the actor, where these calculated gradient approximations result in the actor-critic algorithm representing an actor-critic policy-gradient type algorithm (Bahdanau p.4 Algorithm 1 Actor-Critic Training for Sequence Prediction, step 6; p.5 Section 3 1st-2nd paragraphs; and p.6 Figure 1 caption: “… The values                         
                            
                                
                                    Q
                                
                                
                                    1
                                
                            
                        
                    ,                        
                             
                            
                                
                                    Q
                                
                                
                                    2
                                
                            
                            ,
                            …
                            ,
                            
                                
                                    Q
                                
                                
                                    T
                                
                            
                        
                     computed by the critic are used to approximate the gradient of the expected returns with respect to the parameters of the actor …”).).
Regarding amended Claim 11, 
Claim 11 recites a computer program product (CPP) comprising a computer readable storage medium, and computer code stored on the computer readable storage medium for causing a processor(s) set to perform operations comprising of claim limitations that are similar in scope to corresponding claim limitations in Claim 1, and hence is rejected under similar rationale and motivations provided by Ranzato, Bahdanau, and Lu as indicated in Claim 1. In addition, Ranzato teaches interactive AI systems that generate text for various applications, such as machine translation, video/text summarization, question answering, where these systems include text generation models based on recurrent neural networks (Ranzato p.1 Section 1 1st-2nd paragraphs). A person having ordinary skill in the art would understand an interactive AI system that includes text generation models represents a computer system that contains a least one processor and associated computer readable storage medium storing computer code for executing the instructions to train and deploy the text generation models to implement the various mentioned applications.
Regarding original Claim 12,
Claim 12 recites the CPP of claim 11, further comprising of claim limitations that are similar in scope to corresponding claim limitations in Claim 2, and hence is rejected under similar rationale provided by Ranzato in view of Bahdanau, in further view of Lu as indicated in Claim 2, in view of rejections from Claim 11.
Regarding amended Claim 13,
Claim 13 recites the CPP of claim 11, further comprising of claim limitations that are similar in scope to corresponding claim limitations in Claim 3, and hence is rejected under similar rationale provided by Ranzato in view of Bahdanau, in further view of Lu as indicated in Claim 3, in view of rejections from Claim 11.
Regarding previously presented Claim 14,
Claim 14 recites the CPP of claim 11, further comprising of claim limitations that are similar in scope to corresponding claim limitations in Claim 4, and hence is rejected under similar rationale provided by Ranzato in view of Bahdanau, in further view of Lu as indicated in Claim 4, in view of rejections from Claim 11.
Regarding previously presented Claim 15,
Claim 15 recites the CPP of claim 11, further comprising of claim limitations that are similar in scope to corresponding claim limitations in Claim 5, and hence is rejected under similar rationale provided by Ranzato in view of Bahdanau, in further view of Lu as indicated in Claim 5, in view of rejections from Claim 11.
Regarding original Claim 18,
Claim 18 recites the CPP of claim 11, further comprising of claim limitations that are similar in scope to corresponding claim limitations in Claim 8, and hence is rejected under similar rationale provided by Ranzato in view of Bahdanau, in further view of Lu as indicated in Claim 8, in view of rejections from Claim 11.
Regarding original Claim 19,
Claim 19 recites the CPP of claim 11, further comprising of claim limitations that are similar in scope to corresponding claim limitations in Claim 9, and hence is rejected under similar rationale provided by Ranzato in view of Bahdanau, in further view of Lu as indicated in Claim 9, in view of rejections from Claim 11.
Regarding original Claim 20,
Claim 20 recites a computer system comprising a processor(s) set, a computer readable storage medium, and computer code stored on the computer readable storage medium for causing a processor(s) set to perform operations comprising of claim limitations that are similar in scope to corresponding claim limitations in Claim 1, and hence is rejected under similar rationale and motivations provided by Ranzato, Bahdanau, and Lu as indicated in Claim 1. In addition, Ranzato teaches interactive AI systems that generate text for various applications, such as machine translation, video/text summarization, question answering, where these systems include text generation models based on recurrent neural networks (Ranzato p.1 Section 1 1st-2nd paragraphs). A person having ordinary skill in the art would understand an interactive AI system that includes text generation models represents a computer system that contains a least one processor and associated computer readable storage medium storing computer code for executing the instructions to train and deploy the text generation models to implement the various mentioned applications.
Regarding new Claim 21,
Claim 21 recites the computer system of claim 20, further comprising of claim limitations that are similar in scope to corresponding claim limitations in Claim 3, and hence is rejected under similar rationale provided by Ranzato in view of Bahdanau, in further view of Lu as indicated in Claim 3, in view of rejections from Claim 20.
Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over 
Ranzato et al., Sequence Level Training with Recurrent Neural Networks, May 6 2016 [hereafter referred as Ranzato], in view of Bahdanau et al., An Actor-Critic Algorithm for Sequence Prediction, March 3 2017 [hereafter referred as Bahdanau], in further view of Lu et al., Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning, June 6 2017 [hereafter referred as Lu] as applied to Claims 1 and 11; in even further view of Barnard et al., Matching Words and Pictures, Journal of Machine Learning Research 3 (2003), published 2/03 [hereafter referred as Barnard].  
Regarding previously presented Claim 6, 
Ranzato in view of Bahdanau, in further view of Lu as applied to Claim 1 teaches
(previously presented) The method of claim 1.
While Ranzato in view of Bahdanau, in further view of Lu teaches identifying a distribution of top k scoring words predicted at a previous time step (Ranzato pp.5-6 Section 3.1.3), Ranzato in view of Bahdanau, in further view of Lu does not explicitly teach
wherein the selection of words involves clustering words as a part of a distribution characterization procedure.
Barnard teaches
wherein the selection of words involves clustering words as a part of a distribution characterization procedure (Examiner’s note: Under its broadest reasonable interpretation in light of Applicant’s specification [0115], this limitation broadly recites clustering a distribution of words, where this clustering procedure is defined as a distribution characterization procedure. Barnard teaches constructing hierarchical models for associating text with images, where the image regions and associated text are generated by nodes arranged in a tree structure (with the nodes generating the image regions using a Gaussian distribution, and generating the words using a multinomial distribution). Barnard further teaches the image region (“blob”) distributions are characterized by mean and variance of the image region features, and predicting image base word-prediction is based on finding the corresponding distribution of words given an image using these collections of blobs, where these collections of blobs represent a cluster (and hence a clustering of words associated with an image). Barnard further teaches extending the hierarchical model by encoding to some extent the correspondence between specific image regions and words through co-occurrence, where this correspondence between specific image regions and words through co-occurrence further represents a second level of clustering of “topics” at the nodes (Barnard pp.1111-1112 Section 3.1: “… As shown in Figure 1, images and co-occurring text are generated by nodes arranged in a tree structure. The nodes generate … words using a multinomial distribution. Each cluster is associated with a path from a leaf to the root …” and p.1112 Figure 1; p.1113 2nd paragraph: “… To stabilize training, we translate and scale the region feature data to have zero mean and unit variance … We also limit the variance of the Gaussian distribution to be at least 0.001 in the training data space. Similarly, the word frequency is forced to be at least a small value greater than zero (0.01 / vocabulary size) …”; p.1115 2nd paragraph: “… Using these parameters, we perform image based word-prediction by finding the corresponding distribution over words … The distribution over words given an image (that is, a collection of blobs) is p(w|b) … <see equation on p.1115> …”; p.1116 Section 4.2 1st-2nd paragraphs: “Our hierarchical clustering models … do encode this correspondence to some extent through co-occurrence because there is a advantage to having “topics” collect at the nodes …”; p.1117 Section 5 1st paragraph: “… there is a relationship between clustering and correspondence … This suggest building explicit correspondence information into our existing hierarchical clustering models. Building correspondence models involves strengthening the relationship between words and image regions.”; p.1117 Section 5.1 1st-2nd paragraphs: “… Notice that we have chosen to compute the distribution inherited by the words on a cluster by cluster basis … we can consider cluster dependent level distributions …”; and p.1129 2nd paragraph: “… Of course, clustering is warranted for many applications such as browsing and search where characterization of the training set is important …”).).
Both Ranzato in view of Bahdanau, in further view of Lu and Barnard are analogous art since both teach generating models that predict a distribution of words for images.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the output word distributions generated by the REINFORCE-based actor-critic model taught in Ranzato in view of Bahdanau, in further view of Lu and further identify and associate clusters of words based on computing the mean and variance taught in Barnard as a way to improve the prediction of words given a set of images. The motivation to combine is taught in Barnard, as a way to improve the association between images and text by strengthening the relationship between words and image regions by avoiding choosing outlier words that have no associated relationship, thus improving the accuracy and performance of the model (Barnard p.1110 2nd paragraph; p.1117 Section 5 1st paragraph; and p.1129 Section 7.2 2nd paragraph).
Regarding previously presented Claim 16,
Claim 16 recites the CPP of claim 11, further comprising of claim limitations that are similar in scope to corresponding claim limitations in Claim 6, and hence is rejected under similar rationale and motivations provided by Ranzato in view of Bahdanau, in further view of Lu and Barnard as indicated in Claim 6, in view of rejections from Claim 11.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM WAI YIN KWAN whose telephone number is 303-297-4332. The examiner can normally be reached Monday-Friday 8:00am - 4:30pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 571-272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/WILLIAM WAI YIN KWAN/Examiner, Art Unit 2121                                                                                                                                                                                                        



/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121