DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments with respect to the claim(s) have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Prior arts Oura et al., “Multimodal Deep Neural Network with Image Sequence Features for Video Captioning” (Oura), Gu et al., “An Empirical Study of Language CNN for Image Captioning” (Gu), and Devlin et al., “Exploring Nearest Neighbor Approaches for Image Captioning” (Devlin) have been newly added to help teach the newly added claim limitations. Previously used prior art Mao et al, “Deep Captioning with Multimodal Recurrent Neural Networks (M-RNN)" (Mao) is no longer used in this Office Action.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-5 and 7-15 are rejected under 35 U.S.C. 103 as being unpatentable over Oura et al., “Multimodal Deep Neural Network with Image Sequence Features for Video Captioning” (Oura), Gu et al., “An Empirical Study of Language CNN for Image Captioning” (Gu), and further in view of Jin et al., US 2017/0206435 A1 (Jin).
Regarding claim 1, Oura teaches a method the method comprising: for each image of a plurality of images (for each image in a sequence of images of a video clip) (pages 2-3; Section III., 1st paragraph and Fig. 3): 
processing pixels of the image (processing image fragments; wherein image fragments are made of pixels) (p. 2; right column, 1st paragraph and Fig. 2) by a textual generator model (NeuralTalk2, which uses a multimodal RNN model) (p. 2; right column, 1st paragraph and Fig. 2) to obtain a set of phrases that are descriptive of the content of the image (assigning text fragments to image fragments which can be the 19 most important regions that it detects as well as the whole image; which includes words) (p. 2; right column, 1st paragraph and Fig. 2), wherein each phrase is one or more terms (wherein the text is one or more words) (p. 2; Fig. 2); 
training (training the multimodal RNN) (pages 2-3; Section III and Fig. 3) a multimodal image classifier (multimodal deep neural network with image sequence features (MDNNiSF) for generating a sentence description of a given video clip) (p. 1; Abstract and p. 3; Fig. 3) on the predicted text for the images and the image pixels for the images (based on the text fragments and image fragments) (pages 2-3; Section III and Fig. 3) to produce, as output, labels of an output taxonomy to classify an image based on the image as input (outputting, after the remaining training fine-tunes the video caption data, a sentence description for a given video clip) (p. 1; Abstract, p. 3; right column, 2nd paragraph, and p. 6; Fig. 4).
Oura teaches using a model for generating a sentence description for a given video clip (p. 1; Abstract and p. 6; Fig. 4). However, Oura does not explicitly teach a method performed by “one or more data processing apparatus”, “processing the set of phrases by a textual embedding model to obtain an embedding of predicted text for the image; and processing the image using an image embedding model to obtain an embedding of image pixels of the image”. or training based on the “embeddings” of textual and image pixels for the images.
Gu teaches a method the method comprising: for each image of a plurality of images (for each image of a plurality of images) (p. 8; Figure 4): processing a set of phrases (processing a word) (p. 3; Section 3.2, 1st through 3rd paragraphs and Figure 1) by a textual embedding model (by CNNL) (p. 3; Section 3.2, 1st through 3rd paragraphs and Figure 1) to obtain an embedding of predicted text for the image (a word embedding layer) (p. 3; Section 3.2, 1st through 3rd paragraphs and Figure 1); and processing the image using an image model (using CNNI) to obtain image pixels of the image (extracting image features V with CNNI) (p. 2; Section 3.1, 1st paragraph and p. 5; Section 3.6.1, 1st paragraph); training (the model being trained by recursively applying equations 2-5 to predict the word/sentence) (pages 2-3; Section 3.1 and Figure 1) a multimodal image classifier (the model including a multimodal fusion layer that inputs into the recurrent neural network (RNN)) (page 3; Figure 1 and p. 4; Section 3.3) on the embeddings of predicted text for the images and the image pixels for the images (wherein the multimodal fusion layer fuses words representation and image features from the CNNL and the CNNI respectively) (page 3; Figure 1 and p. 4; Section 3.3) to produce, as output, labels of an output taxonomy to classify an image based on the image as input (outputting a label to the image to classify the image as what it depicts) (pages 2-3; Section 3.1, Figure 1 and pages 7-8; Section 4.4, Figure 3).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Oura to include using both the text embedding and image fragments for training since it leads to a faster training process (Gu; p. 4, Section 3.3), which leads to performance improvements (Gu; p. 8, Section 5).
Gu teaches a model that teaches embedding of predicted text for the image (a word embedding layer) (p. 3; Section 3.2, 1st through 3rd paragraphs and Figure 1) and to obtain image pixels of the image (extracting image features V with CNNI) (p. 2; Section 3.1, 1st paragraph and p. 5; Section 3.6.1, 1st paragraph). Gu and Oura both teach a model that would obviously have to be used on a processing device, however, neither explicitly teaches a method performed by “one or more data processing apparatus”, “processing the image using an image embedding model to obtain an embedding of image pixels of the image” or training based on the “embeddings” of image pixels for the images.
Jin teaches a method performed by one or more data processing apparatus (a computing device 102 having a processing system 104 that includes one or more processing devices (e.g., processors)) (Fig. 1; [0027]), the method comprising: for each image of a plurality of images (for each image) (Fig. 3; Abstract, [0034], and [0044]): processing the set of phrases (processing the set of text labels) ([0032-0033]) by a textual embedding model (multi-instance embedding module (MIE) module 114) (Fig. 1; [0032-0033]) to obtain an embedding of predicted text for the image (embedding of predicted text labels for a given region of an image) ([0045] and [0067]); and processing the image using an image embedding model (MIE module 114) (Fig. 1; [0035]) to obtain an embedding of image pixels of the image (based on the proposed candidate region having a large enough pixel size for the image-text embedding space 302, for embedding regions of the images) (Fig. 3; [0032], [0035], and [0055-0057]); training an image classifier (training the multi-instance embedding model 112) (Fig. 1; [0033]) on the embeddings of predicted text for the images (embedding of predicted text labels for a given region of an image) ([0045] and [0067]) and the embeddings of image pixels for the images (for embedding regions of the images) (Fig. 3; [0032], [0035], and [0055-0057]) (training MIE module 112 based on the outputs of MIE 114) (Fig. 1; [0033] and [0046-0048]) to produce, as output, labels of an output taxonomy to classify an image based on the image as input (annotating image regions with text labels which describe/classify the image) ([0004], [0024], [0037-0038], and [0059]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination of prior arts to include the embedding space for images as in Jin since it more accurately represents the relationships between semantic concepts than do embedding spaces that model semantic concepts as single points (Jin; [0081]).

Regarding claim 2, Oura teaches wherein the textual generator model (NeuralTalk2, which uses a multimodal RNN model) (p. 2; right column, 1st paragraph and Fig. 2) is a textual query based model trained on textual query-image pairs (given a set of video clip – sentence pairs trains the model of S2VT using its learning procedures then the multimodal RNN) (p. 3; right column, 2nd paragraph and Fig. 3). Jin also teaches a textual query based model trained on textual query-image pairs (trained using an image-label training data pair) ([0017] and [0060]).

Regarding claim 3, Jin teaches wherein processing the image using an image embedding model (MIE module 114) (Fig. 1; [0035]) to obtain an embedding of image pixels of the image (based on the proposed candidate region having a large enough pixel size for the image-text embedding space 302, for embedding regions of the images) (Fig. 3; [0032], [0035], and [0055-0057]) comprises obtaining image features (computing d-dimensional feature vectors for the regions; image features) ([0057]) from a final fully connected layer of a pre-trained convolutional network (from a pre-trained convolutional neural network (CNN)) ([0057]).  

Regarding claim 4, Gu teaches wherein training (the model being trained by recursively applying equations 2-5 to predict the word/sentence) (pages 2-3; Section 3.1 and Figure 1) the multimodal image classifier (the model including a multimodal fusion layer that inputs into the recurrent neural network (RNN)) (page 3; Figure 1 and p. 4; Section 3.3) comprises: concatenating (combining) (p. 4; Section 3.3) an N-dimensional textual feature vector (words representation y[t], which is a vector, extracted from CNNL) (p. 4; Section 3.3) with M-dimensional visual feature vector (and the image representation V, which is a vector, extracted from CNNI) (p. 4; Section 3.3) into a singular feature vector (combining the two into a single vector m[t]) (p. 4; Section 3.3); and providing, as input to the multi-modal classifier, the singular feature vector (inputting the single vector from the multimodal fusion layer that inputs into the recurrent neural network (RNN)) (page 3; Figure 1 and p. 4; Section 3.3).  

Regarding claim 5, Oura teaches wherein the textual generator model comprises a softmax layer that produces a probability distribution across each possible predicted phrase (including a softmax function layer to determine probability distribution over the words in the dictionary) (p. 3; right column, 1st paragraph). Gu also teaches wherein the textual generator model comprises a softmax layer that produces a probability distribution across each possible predicted phrase (wherein the softmax layer produces a probability for a word) (pages 2-3, Section 3.1, 2nd paragraph, p. 4; Section 3.4, 1st paragraph).  

Regarding claim 7, see the rejection made to claim 1, as well as Jin for a system (processing system 104) (Fig. 1; [0027]), comprising: a data processing apparatus (one or more processing devices) ([0027]); a memory in data communication with the data processing apparatus and storing instructions (one or more computer-readable storage media 106 operable via the processing system 104 having processor executable instructions) (Fig. 1; [0027] and [0109]), for they teach all the limitations within this claim.
Regarding claim 8, see the rejection made to claim 2, as well as Jin for a system (processing system 104) (Fig. 1; [0027]), comprising: a data processing apparatus (one or more processing devices) ([0027]); a memory in data communication with the data processing apparatus and storing instructions (one or more computer-readable storage media 106 operable via the processing system 104 having processor executable instructions) (Fig. 1; [0027] and [0109]), for they teach all the limitations within this claim.
Regarding claim 9, see the rejection made to claim 3, as well as Jin for a system (processing system 104) (Fig. 1; [0027]), comprising: a data processing apparatus (one or more processing devices) ([0027]); a memory in data communication with the data processing apparatus and storing instructions (one or more computer-readable storage media 106 operable via the processing system 104 having processor executable instructions) (Fig. 1; [0027] and [0109]), for they teach all the limitations within this claim.
Regarding claim 10, see the rejection made to claim 4, as well as Jin for a system (processing system 104) (Fig. 1; [0027]), comprising: a data processing apparatus (one or more processing devices) ([0027]); a memory in data communication with the data processing apparatus and storing instructions (one or more computer-readable storage media 106 operable via the processing system 104 having processor executable instructions) (Fig. 1; [0027] and [0109]), for they teach all the limitations within this claim.
Regarding claim 11, see the rejection made to claim 5, as well as Jin for a system (processing system 104) (Fig. 1; [0027]), comprising: a data processing apparatus (one or more processing devices) ([0027]); a memory in data communication with the data processing apparatus and storing instructions (one or more computer-readable storage media 106 operable via the processing system 104 having processor executable instructions) (Fig. 1; [0027] and [0109]), for they teach all the limitations within this claim.

Regarding claim 12, see the rejection made to claim 1, as well as Jin for one or more non-transitory computer storage media storing (computer-readable storage media) ([0027] and [0110]) instructions (storing instructions) ([0114]) that when executed by one or more computers cause the one or more computers to perform operations (executed by the computing device) ([0027-0028]), for they teach all the limitations within this claim.
Regarding claim 13, see the rejection made to claim 2, as well as Jin for one or more non-transitory computer storage media storing (computer-readable storage media) ([0027] and [0110]) instructions (storing instructions) ([0114]) that when executed by one or more computers cause the one or more computers to perform operations (executed by the computing device) ([0027-0028]), for they teach all the limitations within this claim.
Regarding claim 14, see the rejection made to claim 3, as well as Jin for one or more non-transitory computer storage media storing (computer-readable storage media) ([0027] and [0110]) instructions (storing instructions) ([0114]) that when executed by one or more computers cause the one or more computers to perform operations (executed by the computing device) ([0027-0028]), for they teach all the limitations within this claim.
Regarding claim 15, see the rejection made to claim 4, as well as Jin for one or more non-transitory computer storage media storing (computer-readable storage media) ([0027] and [0110]) instructions (storing instructions) ([0114]) that when executed by one or more computers cause the one or more computers to perform operations (executed by the computing device) ([0027-0028]), for they teach all the limitations within this claim.

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Oura et al., “Multimodal Deep Neural Network with Image Sequence Features for Video Captioning” (Oura), Gu et al., “An Empirical Study of Language CNN for Image Captioning” (Gu), Jin et al., US 2017/0206435 A1 (Jin), and further in view of Devlin et al., “Exploring Nearest Neighbor Approaches for Image Captioning” (Devlin).
Regarding claim 6, Oura teaches wherein the textual generator model (NeuralTalk2, which uses a multimodal RNN model) (p. 2; right column, 1st paragraph and Fig. 2) obtain a set of phrases for a given image (assigning text fragments to image fragments which can be the 19 most important regions that it detects as well as the whole image; which includes words) (p. 2; right column, 1st paragraph and Fig. 2). Gu teaches to obtain a set of phrases for a given image (outputting a label to the image to classify the image as what it depicts) (pages 2-3; Section 3.1, Figure 1 and pages 7-8; Section 4.4, Figure 3). Jin teaches to obtain a set of phrases for a given image (annotating image regions with text labels which describe/classify the image) ([0004], [0024], [0037-0038], and [0059]).
However, none of the above prior arts explicitly teach wherein the textual generator model obtain a set of phrases for a given image “using of a nearest-neighbor process”.
Devlin teaches a variety of nearest neighbor baseline approaches for image captioning (p. 1; Abstract); and wherein the textual generator model (language models, recurrent neural networks, and LSTMs and their ability to generate novel captions) (p. 1; Section 1, 1st paragraph) obtain a set of phrases for a given image using of a nearest-neighbor process (using a nearest-neighbor process for image captioning by borrowing a caption from a set of nearest neighbor images) (p. 1; Abstract, Section 1, 3rd and 4th paragraphs, and p. 2; Sections 3.1 and 3.2).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination of prior arts to include using a nearest-neighbor process since the simple NN approach can outperform many novel caption generation approaches (Devlin; p. 1, Section 1, 5th paragraph) while also being effective at finding images from which high-scoring captions may be borrowed (Devlin; p. 1, Section 1, 5th paragraph).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Contact
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL J VANCHY JR whose telephone number is (571)270-1193. The examiner can normally be reached Monday - Friday 9am - 5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Emily Terrell can be reached on (571) 270-3717. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MICHAEL J VANCHY  JR/Primary Examiner, Art Unit 2666                                                                                                                                                                                                        Michael.Vanchy@uspto.gov