DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 01/07/2021 has been entered.
 
Status of Claims
The present application is being examined under the claims filed on 12/07/2020.
Claims 1-7 and 9-20 are amended.
Claims 1-20 are rejected.
Claims 1-20 are pending.

Drawings
The Drawings filed on 12/30/2016 are acceptable for examination purposes.

Specification
The Specification filed on 12/30/2016 is acceptable for examination purposes.

Response to Arguments
In reference to Claim interpretation and Rejections under 35 USC § 112
The rejections under 35 USC § 112 have been withdrawn in view of amendments.

In reference to rejections under 35 USC § 101
Applicant asserts that “The Claims Do Not Recite An Abstract Idea That Falls Within the Enumerated Groupings of Abstract Ideas” see pg. 10 and pg. 11.
Examiner respectfully disagrees. The claim recites “determining a content item”, “determining a set of captions for the content item based at least in part on a machine learning sequence model”, and “generating at least one caption based on the machine learning sequence model, wherein the machine learning sequence model generates the at least one caption based on visual features associated with the content item and a partially entered caption for the content item” which is directed to an abstract idea of a mental process, a claim that encompasses a human performing the step(s) mentally with the aid of a pen and paper recites a mental process. This judicial exception is not integrated into a practical application because the claim is directed to an abstract idea with additional generic computer elements, the generically recited computer elements do not add a meaningful limitation to the abstract idea because they amount to simply implementing the abstract idea on a computer. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception because when considered separately and in combination, they do not add significantly more (also known as an “inventive concept”) to the exception. The claim recites the additional limitation of “providing the set of captions as suggested captions for the content item”. The additional limitation is directed to receiving or transmitting data over a network, these are well-understood, routine, conventional computer functions as recognized by the court decisions listed in MPEP § 2106.05(d). Examiner notes that during the interview on 12/02/2020 examiner suggested that amending a training 
Applicant's arguments filed 12/07/2020 have been fully considered but they are not persuasive.

Applicant asserts that “The Claims Integrate Any Alleged Abstract Idea Into a Practical Application” see pgs. 12-14 “the combination of claim elements integrates the alleged abstract idea into a practical application because the combination provides improvements to a technical field, such as machine learning and prediction based on visual and textual content”.
Examiner respectfully disagrees. The claim is not directed to improving the technical field of machine learning and prediction based on visual and textual content. Examiner notes that the claim is directed to the improvement of the abstract idea of caption generation with the use of a computer as a tool to perform the abstract idea. Examiner notes that during the interview on 12/02/2020 examiner suggested that amending a training step would overcome the rejections under 35 U.S.C. 101. Examiner notes that claims 6-9 are not rejected because the claims include the training step.
Applicant's arguments filed 12/07/2020 have been fully considered but they are not persuasive.

In reference to Claim interpretation and Rejections under 35 USC § 103
Applicant asserts that The Cited References Do Not Disclose: "generating at least one caption based on the machine learning sequence model, wherein the machine learning sequence model generates the at least one caption based on visual features associated with the content item and a partially entered caption for the content item".
Examiner respectfully disagrees. Upon further review, the references cited do disclose the proposed amendments. Examiner would like to clarify that during the interview, based on applicant’s description, examiner understood the limitation of “generating at least one caption based on the .
Applicant's arguments filed 12/07/2020 have been fully considered but they are not persuasive.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim recites “determining […] a content item”, “determining […] a set of captions for the content item based at least in part on a machine learning sequence model”, and “generating […] at least one caption based on the machine learning sequence model, wherein the 
Claim 2 recites additional steps of “determining […] a set of features that describe subject matter captured in the content item” and “generating […] the set of captions based at least in part on provision of the set of features to the machine learning sequence model”. The additional steps do not amount to significantly more because they are directed to an abstract idea of a mental process, a claim that encompasses a human performing the step(s) mentally with the aid of a pen and paper recites a mental process.

Claim 4 recites an additional step of “filtering […] the set of captions based at least in part on a user-specific language model to restrict the set to captions that include terms or phrases determined to be preferred by the user”. The additional step does not amount to significantly more because it is directed to an abstract idea of a mental process, a claim that encompasses a human performing the step(s) mentally with the aid of a pen and paper recites a mental process.
Claim 5 recites an additional step of “determining […] that the user has selected one of the suggested captions in the set”. The additional step does not amount to significantly more because it is directed to an abstract idea of a mental process, a claim that encompasses a human performing the step(s) mentally with the aid of a pen and paper recites a mental process. The claim recites the additional limitation of “providing […] the content item for publication with the caption describing the content item, wherein at least a portion of the caption describing the content item includes the selected caption” which is directed to receiving or transmitting data over a network, these are well-understood, routine, conventional computer functions as recognized by the court decisions listed in MPEP § 2106.05(d).
Claim 10 recites an additional step of “a caption provided as a suggestion includes at least a predicted character, word, term, phrase, or sentence”. The additional step does not amount to significantly more because it is directed to receiving or transmitting data over a network, these are well-

Claim 11 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim recites “determining a content item”, “determining a set of captions for the content item based at least in part on a machine learning sequence model”, and “generating at least one caption based on the machine learning sequence model, wherein the machine learning sequence model generates the at least one caption based on visual features associated with the content item and a partially entered caption for the content item” which is directed to an abstract idea of a mental process, a claim that encompasses a human performing the step(s) mentally with the aid of a pen and paper recites a mental process. This judicial exception is not integrated into a practical application because the claim is directed to an abstract idea with additional generic computer elements, the generically recited computer elements do not add a meaningful limitation to the abstract idea because they amount to simply implementing the abstract idea on a computer. The generic computer element are the “at least one processor” and the “memory storing instructions that, when executed by the at least one processor”. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception because when considered separately and in combination, they do not add significantly more (also known as an “inventive concept”) to the exception. The claim recites the additional limitation of “providing the set of captions as suggested captions for the content item”. The additional limitation is directed to receiving or transmitting data over a network, these are well-understood, routine, conventional computer functions as recognized by the court decisions listed in MPEP § 2106.05(d). Examiner notes that during the interview on 12/02/2020 examiner suggested that amending a training step would overcome the rejections under 35 U.S.C. 101. Examiner notes that claims 6-9 are not rejected because the claims include the training step.

Claim 13 additional steps of “determining a set of features that describe subject matter captured in the content item”, “determining user input that corresponds to the partially entered caption”, and “generating the set of captions based at least in part on provision of the set of features and the user input to the machine learning sequence model”. The additional steps do not amount to significantly more because they are directed to an abstract idea of a mental process, a claim that encompasses a human performing the step(s) mentally with the aid of a pen and paper recites a mental process.
Claim 14 recites an additional step of “filtering the set of captions based at least in part on a user-specific language model to restrict the set to captions that include terms or phrases determined to be preferred by the user”. The additional step does not amount to significantly more because it is directed to an abstract idea of a mental process, a claim that encompasses a human performing the step(s) mentally with the aid of a pen and paper recites a mental process.
Claim 15 recites an additional step of “determining that the user has selected one of the suggested captions in the set”. The additional step does not amount to significantly more because it is directed to an abstract idea of a mental process, a claim that encompasses a human performing the step(s) mentally with the aid of a pen and paper recites a mental process. The claim recites the additional limitation of “providing the content item for publication with the caption describing the content item, wherein at least a portion of the caption describing the content item includes the selected 

Claim 16 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim recites “determining a content item”, “determining a set of captions for the content item based at least in part on a machine learning sequence model”, and “generating at least one caption based on the machine learning sequence model, wherein the machine learning sequence model generates the at least one caption based on visual features associated with the content item and a partially entered caption for the content item” which is directed to an abstract idea of a mental process, a claim that encompasses a human performing the step(s) mentally with the aid of a pen and paper recites a mental process. This judicial exception is not integrated into a practical application because the claim is directed to an abstract idea with additional generic computer elements, the generically recited computer elements do not add a meaningful limitation to the abstract idea because they amount to simply implementing the abstract idea on a computer. The generic computer element is the “non-transitory computer-readable storage medium including instructions that, when executed by at least one processor of a computing system”. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception because when considered separately and in combination, they do not add significantly more (also known as an “inventive concept”) to the exception. The claim recites the additional limitation of “providing the set of captions as suggested captions for the content item”. The additional limitation is directed to receiving or transmitting data over a network, these are well-understood, routine, conventional computer functions as recognized by the court decisions listed in MPEP § 2106.05(d). Examiner notes that during the interview on 12/02/2020 examiner suggested that amending a training step would overcome the 
Claim 17 recites additional steps of “determining a set of features that describe subject matter captured in the content item” and “generating the set of captions based at least in part on provision of the set of features to the machine learning sequence model”. The additional steps do not amount to significantly more because they are directed to an abstract idea of a mental process, a claim that encompasses a human performing the step(s) mentally with the aid of a pen and paper recites a mental process.
Claim 18 additional steps of “determining a set of features that describe subject matter captured in the content item”, “determining user input that corresponds to the partially entered caption”, and “generating the set of captions based at least in part on provision of the set of features and the user input to the machine learning sequence model”. The additional steps do not amount to significantly more because they are directed to an abstract idea of a mental process, a claim that encompasses a human performing the step(s) mentally with the aid of a pen and paper recites a mental process.
Claim 19 recites an additional step of “filtering the set of captions based at least in part on a user-specific language model to restrict the set to captions that include terms or phrases determined to be preferred by the user”. The additional step does not amount to significantly more because it is directed to an abstract idea of a mental process, a claim that encompasses a human performing the step(s) mentally with the aid of a pen and paper recites a mental process.
Claim 20 recites an additional step of “determining that the user has selected one of the suggested captions in the set”. The additional step does not amount to significantly more because it is directed to an abstract idea of a mental process, a claim that encompasses a human performing the step(s) mentally with the aid of a pen and paper recites a mental process. The claim recites the 

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-4, 6, 7, 10-14, and 16-19 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Wang et al. (hereinafter Wang) US 20170200066 A1.
In reference to claim 1. Wang teaches a computer-implemented method comprising:
“determining, by a computing system, a content item” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], and ¶ [0038]-[0045] discloses the image (i.e. content item) for image captioning);
“determining, by the computing system, a set of captions for the content item based at least in part on a machine learning sequence model” (Wang in at least Figs. 2-6 (see also 
wherein the determining further comprises:
“generating, by the computing system, at least one caption based on the machine learning sequence model, wherein the machine learning sequence model generates the at least one caption based on visual features associated with the content item and a partially entered caption for the content item” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses the image (i.e. content item) for image captioning. Particularly ¶ [0038]-[0045] describe the “Image Captioning Framework” which “employs a machine learning approach to generate a captioned image […] the training data 302 includes images 304 and associated text 306, such as captions or metadata associated with the images 304 304. An extractor module 308 is then used to extract structured semantic knowledge 310, e.g., "<Subject,Attribute>, Image" and "<Subject,Predicate,Object>, Image", using natural language processing. Extraction may also include localization of the structured semantic to objects or regions within the image. Structured semantic knowledge 310 may be used to match images to data associated with visually similar images (e.g., captioning), and also to find images that match a particular caption of set of metadata (e.g., searching) […] The model 316 is trained to define a relationship (e.g., visual feature vector) between text features included in the structured semantic knowledge 310 with image features in the images. The image analysis model 202 is then used by a caption generator to process an input image 316 and generate a captioned image 318”. 
“providing, by the computing system, the set of captions as suggested captions for the content item” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses “the RNN 406 outputs descriptions 408 in the form of captions, tags, sentences and other text that is associated with the image 316”).

In reference to claim 2. Wang teaches the computer-implemented method of claim 1 (as mentioned above), wherein determining the set of captions further comprises:
Wang further discloses:
“determining, by the computing system, a set of features that describe subject matter captured in the content item” (Wang in at least Figs. 2-11 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], ¶ [0054], and ¶ [0075] discloses “the caption generator 130 may be configured to apply a semantic attention model to select different keywords for different nodes in the RNN based on context, as discussed in relation to FIGS. 9-11”);
“generating, by the computing system, the set of captions based at least in part on provision of the set of features to the machine learning sequence model” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses “the RNN 406 outputs descriptions 408 in the form of captions, tags, sentences and other text that is associated with the image 316”).

In reference to claim 3. Wang teaches the computer-implemented method of claim 1 (as mentioned above), wherein determining the set of captions further comprises:
Wang further discloses:
“determining, by the computing system, a set of features that describe subject matter captured in the content item” (Wang in at least Figs. 2-11 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], ¶ [0054], and ¶ [0075] discloses “the caption generator 130 may be configured to apply a semantic attention model to select different keywords for different nodes in the RNN based on context, as discussed in relation to FIGS. 9-11”);
“determining, by the computing system, user input that corresponds to the partially entered caption” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses the image (i.e. content item) for image captioning. Particularly ¶ [0038]-[0045] describe the “Image Captioning Framework” which “employs a machine learning approach to generate a captioned image […] the training data 302 includes images 304 and associated text 306, such as captions or metadata associated with the images 304 304. An extractor module 308 is then used to extract structured semantic knowledge 310, e.g., "<Subject,Attribute>, Image" and "<Subject,Predicate,Object>, Image", using natural language processing. Extraction may also include localization of the structured semantic to objects or regions within the image. Structured semantic knowledge 310 may The model 316 is trained to define a relationship (e.g., visual feature vector) between text features included in the structured semantic knowledge 310 with image features in the images. The image analysis model 202 is then used by a caption generator to process an input image 316 and generate a captioned image 318”. In at least ¶ [0047]-[0054] discloses “a filtered list of keywords derived from weakly annotated images is supplied to the RNN. The list may be generated by scoring and ranking the keyword collection according to relevance criteria, and selecting a number of top ranking keywords to include in the filtered list. The filtered list may be filtered based on frequency, probability scores, weight factors or other relevance criteria. In implementations, the entire collection of keywords may be supplied for use in the RNN (e.g., an unfiltered list)”. Examiner notes that weakly annotated data is data that is typically associated with tags, descriptions, and other text data added by users; see at least ¶ [0047] “One or multiple sources may be relied upon for image captioning in different scenarios. Images uploaded to such sources are typically associated with tags, descriptions, and other text data added by users”. Examiner notes that the broadest reasonable interpretation for this limitation is that the machine learning sequence model has been trained with images (i.e. visual features) and caption (i.e. structured semantic knowledge; text describing the image), once trained the model can be used to generate captions. Examiner notes that this interpretation is supported by Instant Specification in ¶ [0024] “the model can be trained to suggest captions, or portions of a captions, for a content item, based, in part, on any portions of the caption that have been already been inputted by the user for the content item (e.g., one or more characters and/or one or more words) and also with respect to the visual features (e.g., image features) 
“generating, by the computing system, the set of captions based at least in part on provision of the set of features and the user input to the machine learning sequence model” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses “the RNN 406 outputs descriptions 408 in the form of captions, tags, sentences and other text that is associated with the image 316”).

In reference to claim 4. Wang teaches the computer-implemented method of claim 3 (as mentioned above), the method further comprising:
Wang further discloses:
“filtering, by the computing system, the set of captions based at least in part on a user-specific language model to restrict the set to captions that include terms or phrases determined to be preferred by the user” (Wang in at least Figs. 2-6 (see also corresponding sections), particularly ¶ [0047]-[0054] discloses “a filtered list of keywords derived from weakly annotated images is supplied to the RNN. The list may be generated by scoring and ranking the keyword collection according to relevance criteria, and selecting a number of top ranking keywords to include in the filtered list. The filtered list may be filtered based on frequency, probability scores, weight factors or other relevance criteria. In implementations, the entire collection of keywords may be supplied for use in the RNN (e.g., an unfiltered list)”. Examiner notes that weakly annotated data is data that is typically associated with tags, descriptions, and other text data added by users; see at least ¶ [0047] “One or multiple sources may be relied upon for image captioning in different scenarios. Images uploaded to tags, descriptions, and other text data added by users”).

In reference to claim 6. Wang teaches the computer-implemented method of claim 1 (as mentioned above), further comprising:
Wang further discloses:
“generating, by the computing system, a set of training examples that each include at least a corresponding set of features that describe a given content item and a respective caption that was provided for the content item” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], and ¶ [0038]-[0045] discloses “Techniques that are used to train models in similar scenarios (e.g., image understanding problems) may rely on users to manually tag the images to form the training data 302. The model may also be trained using machine learning using techniques that are performable automatically and without user intervention.” and “A collection of training images used to train the image captioning framework (e.g., train the caption generator) may provide an additional or alternative source of weak supervision data 204. In this approach, the training data includes a database of images having corresponding captions used to train classifiers for the captioning model. The training image database may be relied upon as a source to discover related images that are similar to each other. Next, the captions for related images are aggregated as the weak supervised text for image captioning. When are target image is matched to a collection of related images, the captions for related images are relied upon as weak supervision data 204 for captioning of the target image”); and
“training, by the computing system, the machine learning sequence model based on the set of training examples” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], The model 316 is trained to define a relationship (e.g., visual feature vector) between text features included in the structured semantic knowledge 310 with image features in the images. The image analysis model 202 is then used by a caption generator to process an input image 316 and generate a captioned image 318”. Examiner notes that the broadest reasonable interpretation for this limitation is that the machine learning sequence model has been trained with images (i.e. visual features) and caption (i.e. structured semantic knowledge; text describing the image), once trained the model can be used to generate captions. Examiner notes that this interpretation is supported by Instant Specification in ¶ [0024] “the model can be trained to suggest captions, or portions of a captions, for a content item, based, in part, on any portions of the caption that have been already been inputted by the user for the content item (e.g., one or more characters and/or one or more words) and also with respect to the visual features (e.g., image features) corresponding to the content item”. Examiner notes that the partially entered caption is provided prior to training the model).

In reference to claim 7. Wang teaches the computer-implemented method of claim 6 (as mentioned above), wherein:
Wang further discloses:
“a training example also includes metadata information for the content item referenced in the training example” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses “The image analysis model 202 represents functionality to process image in various ways including but not limited to feature extraction, metadata parsing, patch analysis, object detection, and so forth” and “the training data 302 includes images 304 and associated text 306, such as captions or metadata associated with the images”).

In reference to claim 10. Wang teaches the computer-implemented method of claim 1 (as mentioned above), wherein:
Wang further discloses:
“a caption provided as a suggestion includes at least a predicted character, word, term, phrase, or sentence” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses “the RNN 406 outputs descriptions 408 in the form of captions, tags, sentences and other text that is associated with the image 316”).

In reference to claim 11. Wang teaches a system comprising:
“at least one processor” (Wang in at least ¶ [0028] and ¶ [0091]-[0093]);
“a memory storing instructions that, when executed by the at least one processor” (Wang in at least ¶ [0028] and ¶ [0091]-[0093]),
cause the system to perform:
“determining a content item” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], and ¶ [0038]-[0045] discloses the image (i.e. content item) for image captioning);
“determining a set of captions for the content item based at least in part on a machine learning sequence model” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses “the RNN 406 outputs descriptions 408 in the form of captions, tags, sentences and other text that is associated with the image 316”),
wherein the determining further comprises:
“generating at least one caption based on the machine learning sequence model, wherein the machine learning sequence model generates the at least one caption based on visual features associated with the content item and a partially entered caption for the content item” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses the image (i.e. content item) for image captioning. Particularly ¶ [0038]-[0045] describe the “Image Captioning Framework” which “employs a machine learning approach to generate a captioned image […] the training data 302 includes images 304 and associated text 306, such as captions or metadata associated with the images 304 304. An extractor module 308 is then used to extract structured semantic knowledge 310, e.g., "<Subject,Attribute>, Image" and "<Subject,Predicate,Object>, Image", using natural language processing. Extraction may also include localization of the structured semantic to The model 316 is trained to define a relationship (e.g., visual feature vector) between text features included in the structured semantic knowledge 310 with image features in the images. The image analysis model 202 is then used by a caption generator to process an input image 316 and generate a captioned image 318”. Examiner notes that the broadest reasonable interpretation for this limitation is that the machine learning sequence model has been trained with images (i.e. visual features) and caption (i.e. structured semantic knowledge; text describing the image), once trained the model can be used to generate captions. Examiner notes that this interpretation is supported by Instant Specification in ¶ [0024] “the model can be trained to suggest captions, or portions of a captions, for a content item, based, in part, on any portions of the caption that have been already been inputted by the user for the content item (e.g., one or more characters and/or one or more words) and also with respect to the visual features (e.g., image features) corresponding to the content item”. Examiner notes that the partially entered caption is provided prior to training the model); and
“providing the set of captions as suggested captions for the content item” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses “the RNN 406 outputs descriptions 408 in the form of captions, tags, sentences and other text that is associated with the image 316”).

In reference to claim 12. Wang teaches the system of claim 11 (as mentioned above), wherein determining the set of captions further causes the system to perform:
Wang further discloses:
 “determining a set of features that describe subject matter captured in the content item” (Wang in at least Figs. 2-11 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], ¶ [0054], and ¶ [0075] discloses “the caption generator 130 may be configured to apply a semantic attention model to select different keywords for different nodes in the RNN based on context, as discussed in relation to FIGS. 9-11”); and
“generating the set of captions based at least in part on provision of the set of features to the machine learning sequence model” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses “the RNN 406 outputs descriptions 408 in the form of captions, tags, sentences and other text that is associated with the image 316”).

In reference to claim 13. Wang teaches the system of claim 11 (as mentioned above), wherein determining the set of captions further causes the system to perform:
Wang further discloses:
“determining a set of features that describe subject matter captured in the content item” (Wang in at least Figs. 2-11 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], ¶ [0054], and ¶ [0075] discloses “the caption generator 130 may be configured to apply a semantic attention model to select different keywords for different nodes in the RNN based on context, as discussed in relation to FIGS. 9-11”);
“determining user input that corresponds to the partially entered caption” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses the image (i.e. content item) for image captioning. Particularly ¶ [0038]-[0045] describe the “Image Captioning Framework” which “employs a machine learning approach to generate a captioned image […] the training data 302 includes images 304 and associated text 306, such as captions or metadata associated with the images 304 304. An extractor module 308 is then used to extract structured semantic knowledge 310, e.g., "<Subject,Attribute>, Image" and "<Subject,Predicate,Object>, Image", using natural language processing. Extraction may also include localization of the structured semantic to objects or regions within the image. Structured semantic knowledge 310 may be used to match images to data associated with visually similar images (e.g., captioning), and also to find images that match a particular caption of set of metadata (e.g., searching) […] The model 316 is trained to define a relationship (e.g., visual feature vector) between text features included in the structured semantic knowledge 310 with image features in the images. The image analysis model 202 is then used by a caption generator to process an input image 316 and generate a captioned image 318”. In at least ¶ [0047]-[0054] discloses “a filtered list of keywords derived from weakly annotated images is supplied to the RNN. The list may be generated by scoring and ranking the keyword collection according to relevance criteria, and selecting a number of top ranking keywords to include in the filtered list. The filtered list may be filtered based on frequency, probability scores, weight factors or other relevance criteria. In implementations, the entire collection of keywords may be supplied for use in the RNN (e.g., an unfiltered list)”. Examiner notes that weakly annotated data is data that is typically associated with tags, descriptions, and other text data added by users; see at least ¶ [0047] “One or multiple sources may be relied upon for image tags, descriptions, and other text data added by users”. Examiner notes that the broadest reasonable interpretation for this limitation is that the machine learning sequence model has been trained with images (i.e. visual features) and caption (i.e. structured semantic knowledge; text describing the image), once trained the model can be used to generate captions. Examiner notes that this interpretation is supported by Instant Specification in ¶ [0024] “the model can be trained to suggest captions, or portions of a captions, for a content item, based, in part, on any portions of the caption that have been already been inputted by the user for the content item (e.g., one or more characters and/or one or more words) and also with respect to the visual features (e.g., image features) corresponding to the content item”. Examiner notes that the partially entered caption is provided prior to training the model); and
“generating the set of captions based at least in part on provision of the set of features and the user input to the machine learning sequence model” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses “the RNN 406 outputs descriptions 408 in the form of captions, tags, sentences and other text that is associated with the image 316”).

In reference to claim 14. Wang teaches the system of claim 13 (as mentioned above), wherein the system further performs:
Wang further discloses:
“filtering the set of captions based at least in part on a user-specific language model to restrict the set to captions that include terms or phrases determined to be preferred by the user” (Wang in at least Figs. 2-6 (see also corresponding sections), particularly ¶ [0047]-filtered list of keywords derived from weakly annotated images is supplied to the RNN. The list may be generated by scoring and ranking the keyword collection according to relevance criteria, and selecting a number of top ranking keywords to include in the filtered list. The filtered list may be filtered based on frequency, probability scores, weight factors or other relevance criteria. In implementations, the entire collection of keywords may be supplied for use in the RNN (e.g., an unfiltered list)”. Examiner notes that weakly annotated data is data that is typically associated with tags, descriptions, and other text data added by users; see at least ¶ [0047] “One or multiple sources may be relied upon for image captioning in different scenarios. Images uploaded to such sources are typically associated with tags, descriptions, and other text data added by users”).

In reference to claim 16. Wang teaches a non-transitory computer-readable storage medium including instructions that, when executed by at least one processor of a computing system (Wang in at least ¶ [0028] and ¶ [0091]-[0093]), cause the computing system to perform a method comprising:
“determining a content item” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], and ¶ [0038]-[0045] discloses the image (i.e. content item) for image captioning);
“determining a set of captions for the content item based at least in part on a machine learning sequence model” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses “the RNN 406 outputs descriptions 408 in the form of captions, tags, sentences and other text that is associated with the image 316”),
wherein the determining further comprises:
“generating at least one caption based on the machine learning sequence model, wherein the machine learning sequence model generates the at least one caption based on visual features associated with the content item and a partially entered caption for the content item” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses the image (i.e. content item) for image captioning. Particularly ¶ [0038]-[0045] describe the “Image Captioning Framework” which “employs a machine learning approach to generate a captioned image […] the training data 302 includes images 304 and associated text 306, such as captions or metadata associated with the images 304 304. An extractor module 308 is then used to extract structured semantic knowledge 310, e.g., "<Subject,Attribute>, Image" and "<Subject,Predicate,Object>, Image", using natural language processing. Extraction may also include localization of the structured semantic to objects or regions within the image. Structured semantic knowledge 310 may be used to match images to data associated with visually similar images (e.g., captioning), and also to find images that match a particular caption of set of metadata (e.g., searching) […] The model 316 is trained to define a relationship (e.g., visual feature vector) between text features included in the structured semantic knowledge 310 with image features in the images. The image analysis model 202 is then used by a caption generator to process an input image 316 and generate a captioned image 318”. Examiner notes that the broadest reasonable interpretation for this limitation is that the machine learning sequence model has been trained with images (i.e. visual features) and caption (i.e. structured semantic knowledge; text describing the image), once trained the model can be used to generate captions. Examiner notes that this interpretation is supported by Instant 
“providing the set of captions as suggested captions for the content item” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses “the RNN 406 outputs descriptions 408 in the form of captions, tags, sentences and other text that is associated with the image 316”).

In reference to claim 17. Wang teaches the non-transitory computer-readable storage medium of claim 16 (as mentioned above), wherein determining the set of captions further causes the computing system to perform:
Wang further discloses:
“determining a set of features that describe subject matter captured in the content item” (Wang in at least Figs. 2-11 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], ¶ [0054], and ¶ [0075] discloses “the caption generator 130 may be configured to apply a semantic attention model to select different keywords for different nodes in the RNN based on context, as discussed in relation to FIGS. 9-11”);
“generating the set of captions based at least in part on provision of the set of features to the machine learning sequence model” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses “the RNN 406 outputs 

In reference to claim 18. Wang teaches the non-transitory computer-readable storage medium of claim 16 (as mentioned above), wherein determining the set of captions further causes the computing system to perform:
Wang further discloses:
“determining a set of features that describe subject matter captured in the content item” (Wang in at least Figs. 2-11 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], ¶ [0054], and ¶ [0075] discloses “the caption generator 130 may be configured to apply a semantic attention model to select different keywords for different nodes in the RNN based on context, as discussed in relation to FIGS. 9-11”);
“determining user input that corresponds to the partially entered caption” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses the image (i.e. content item) for image captioning. Particularly ¶ [0038]-[0045] describe the “Image Captioning Framework” which “employs a machine learning approach to generate a captioned image […] the training data 302 includes images 304 and associated text 306, such as captions or metadata associated with the images 304 304. An extractor module 308 is then used to extract structured semantic knowledge 310, e.g., "<Subject,Attribute>, Image" and "<Subject,Predicate,Object>, Image", using natural language processing. Extraction may also include localization of the structured semantic to objects or regions within the image. Structured semantic knowledge 310 may be used to match images to data associated with visually similar images (e.g., captioning), and also to find images that match a particular caption of set of metadata (e.g., searching) […] The model 316 is trained to define a relationship (e.g., visual feature vector) between text features included in the structured semantic knowledge 310 with image features in the images. The image analysis model 202 is then used by a caption generator to process an input image 316 and generate a captioned image 318”. In at least ¶ [0047]-[0054] discloses “a filtered list of keywords derived from weakly annotated images is supplied to the RNN. The list may be generated by scoring and ranking the keyword collection according to relevance criteria, and selecting a number of top ranking keywords to include in the filtered list. The filtered list may be filtered based on frequency, probability scores, weight factors or other relevance criteria. In implementations, the entire collection of keywords may be supplied for use in the RNN (e.g., an unfiltered list)”. Examiner notes that weakly annotated data is data that is typically associated with tags, descriptions, and other text data added by users; see at least ¶ [0047] “One or multiple sources may be relied upon for image captioning in different scenarios. Images uploaded to such sources are typically associated with tags, descriptions, and other text data added by users
“generating the set of captions based at least in part on provision of the set of features and the user input to the machine learning sequence model” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses “the RNN 406 outputs descriptions 408 in the form of captions, tags, sentences and other text that is associated with the image 316”).

In reference to claim 19. Wang teaches teach the non-transitory computer-readable storage medium of claim 18 (as mentioned above), wherein the computing system further performs:
Wang further discloses:
“filtering the set of captions based at least in part on a user-specific language model to restrict the set to captions that include terms or phrases determined to be preferred by the user” (Wang in at least Figs. 2-6 (see also corresponding sections), particularly ¶ [0047]-[0054] discloses “a filtered list of keywords derived from weakly annotated images is supplied to the RNN. The list may be generated by scoring and ranking the keyword collection according to relevance criteria, and selecting a number of top ranking keywords to include in the filtered list. The filtered list may be filtered based on frequency, probability scores, weight factors or other relevance criteria. In implementations, the entire collection of keywords may be supplied for use in the RNN (e.g., an unfiltered list)”. Examiner notes that weakly annotated data is data that is typically associated with tags, descriptions, and other text data added by users; see at least ¶ [0047] “One or multiple sources may be relied upon for image captioning in different scenarios. Images uploaded to such sources are typically associated with tags, descriptions, and other text data added by users”).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 5, 8, 9, 15, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (hereinafter Wang) US 20170200066 A1 in view of Valliani et al. (hereinafter Valliani) US 20170132821 A1.
In reference to claim 5. Wang teaches the computer-implemented method of claim 1 (as mentioned above), the method further comprising:
Wang does not explicitly disclose:
“determining, by the computing system, that the user has selected one of the suggested captions in the set”;
“providing, by the computing system, the content item for publication with the caption describing the content item, wherein at least a portion of the caption describing the content item includes the selected caption”.
However, Valliani discloses:
“determining, by the computing system, that the user has selected one of the suggested captions in the set” (Valliani in at least Figs. 3-5, ¶ [0025], ¶ [0026], ¶ [0030]-[0032], ¶ 
“providing, by the computing system, the content item for publication with the caption describing the content item, wherein at least a portion of the caption describing the content item includes the selected caption” (Valliani in at least Figs. 3-5, ¶ [0025], ¶ [0026], ¶ [0030]-[0032], ¶ [0035]-[0037], ¶ [0039], ¶ [0040], and ¶ [0093] “The user may adopt or edit the caption. The user can use a text editor to modify the caption prior to saving. If adopted, the caption can be associated with the image by forming an embedded overlay or as metadata associated with the image. The image, along with the overlay information, can then be communicated to one or more recipients designated by the user. For example, the user may choose to post the image and associated caption on one or more social networks. Alternatively, the user could communicate the image to a designated group of persons via text, email, or through some other communication mechanism. Finally, the user could choose to save the picture for later use in their photo album along with the associated caption”).


In reference to claim 8. Wang teaches the computer-implemented method of claim 6 (as mentioned above), wherein generating the set of training examples further comprises:
Wang does not explicitly disclose:
“determining, by the computing system, at least one set of content items, wherein a content item included in the at least one set captures subject matter that has a threshold similarity to other content items in the set”;
However, Valliani discloses:
“determining, by the computing system, at least one set of content items, wherein a content item included in the at least one set captures subject matter that has a threshold similarity to other content items in the set” (Valliani in at least the sections previously mentioned, Fig. 5, ¶ [0031], and ¶ [0128] “A similarity analysis between a current picture and previously posted pictures could be used to help generate a caption” and “a threshold 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Wang and Valliani. Wang teaches techniques for image captioning with word vector representations. Valliani teaches automatically generating captions for visual media, such as a photograph or video. One of ordinary skill would have motivation to combine Wang and Valliani because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E) "Obvious to try" choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art.

Wang further discloses:
“determining, by the computing system, one or more captions that are associated with all content items included in the at least one set” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ [0038]-[0045], and ¶ [0054] discloses “the RNN 406 outputs descriptions 408 in the form of captions, tags, sentences and other text that is associated with the image 316”. Examiner notes that given the broadest reasonable interpretation the set could be 1 content item);
“associating, by the computing system, each of the one or more captions with each content item included in the at least one set, wherein each training example references data describing a content item in the at least one set and a caption associated with the content item” (Wang in at least Figs. 2-6 (see also corresponding sections), ¶ [0032], ¶ [0033], ¶ The model 316 is trained to define a relationship (e.g., visual feature vector) between text features included in the structured semantic knowledge 310 with image features in the images. The image analysis model 202 is then used by a caption generator to process an input image 316 and generate a captioned image 318”. Examiner notes that the broadest reasonable interpretation for this limitation is that the machine learning sequence model has been trained with images (i.e. visual features) and caption (i.e. structured semantic knowledge; text describing the image), once trained the model can be used to generate captions. Examiner notes that this interpretation is supported by Instant Specification in ¶ [0024] “the model can be trained to suggest captions, or portions of a captions, for a content item, based, in part, on any portions of the caption that have been already been inputted by the user for the content item (e.g., one or more characters and/or one or more words) and also with respect to the visual features (e.g., image features) corresponding to the content item”. Examiner notes that the partially entered caption is provided prior to training the model).

In reference to claim 9. Wang and Valliani teach the computer-implemented method of claim 8 (as mentioned above), wherein determining the at least one set of content items further comprises:
Valliani further discloses:
“determining, by the computing system, a respective set of features that describe subject matter captured for each of a plurality of content items published through the computing system” (Valliani in at least Figs. 3-5, ¶ [0025], ¶ [0026], ¶ [0030]-[0032], ¶ [0035]-[0037], ¶ [0039], ¶ [0040], ¶ [0063] and ¶ [0093] discloses “The user may adopt or edit the caption. The user can use a text editor to modify the caption prior to saving. If adopted, the caption can be associated with the image by forming an embedded overlay or as metadata associated with the image. The image, along with the overlay information, can then be communicated to one or more recipients designated by the user. For example, the user may choose to post the image and associated caption on one or more social networks. Alternatively, the user could communicate the image to a designated group of persons via text, email, or through some other communication mechanism. Finally, the user could choose to save the picture for later use in their photo album along with the associated caption” and “[…] context features or variables associated with events and user-related activity, such as caption generation and media sharing. Contextual information may be determined from the user data of one or more users provided by user-data collection component […]”);

Wang further discloses:
“clustering, by the computing system, the plurality of content items into one or more clusters, wherein the content items are clustered based at least in part on their respective 

In reference to claim 15. Wang teaches the system of claim 11 (as mentioned above), wherein the system further performs:
Wang does not explicitly disclose:
“determining that the user has selected one of the suggested captions in the set”;
“providing the content item for publication with the caption describing the content item, wherein at least a portion of the caption describing the content item includes the selected caption”.
However, Valliani discloses:
“determining that the user has selected one of the suggested captions in the set” (Valliani in at least Figs. 3-5, ¶ [0025], ¶ [0026], ¶ [0030]-[0032], ¶ [0035]-[0037], ¶ [0039], ¶ [0040], ¶ [0063] and ¶ [0093] “The user may adopt or edit the caption. The user can use a text editor to modify the caption prior to saving. If adopted, the caption can be associated with the image by forming an embedded overlay or as metadata associated with the image. The image, along with the overlay information, can then be communicated to one or more recipients designated by the user. For example, the user may choose to post the image and associated caption on one or more social networks. Alternatively, the user could communicate the image to a designated group of persons via text, email, or through some other communication mechanism. Finally, the user could choose to save the picture for later use in their photo album along with the associated caption”);
“providing the content item for publication with the caption describing the content item, wherein at least a portion of the caption describing the content item includes the selected caption” (Valliani in at least Figs. 3-5, ¶ [0025], ¶ [0026], ¶ [0030]-[0032], ¶ [0035]-[0037], ¶ [0039], ¶ [0040], and ¶ [0093] “The user may adopt or edit the caption. The user can use a text editor to modify the caption prior to saving. If adopted, the caption can be associated with the image by forming an embedded overlay or as metadata associated with the image. The image, along with the overlay information, can then be communicated to one or more recipients designated by the user. For example, the user may choose to post the image and associated caption on one or more social networks. Alternatively, the user could 
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Wang and Valliani. Wang teaches techniques for image captioning with word vector representations. Valliani teaches automatically generating captions for visual media, such as a photograph or video. One of ordinary skill would have motivation to combine Wang and Valliani because MPEP 2143 sets forth the Supreme Court rationales for obviousness including: (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results; (E) "Obvious to try" choosing from a finite number of identified, predictable solutions, with a reasonable expectation of success; (F) Known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art.

In reference to claim 20. Wang teaches the non-transitory computer-readable storage medium of claim 16 (as mentioned above), wherein the computing system further performs:
Wang does not explicitly disclose:
“determining that the user has selected one of the suggested captions in the set”;
“providing the content item for publication with the caption describing the content item, wherein at least a portion of the caption describing the content item includes the selected caption”.
However, Valliani discloses:
“determining that the user has selected one of the suggested captions in the set” (Valliani in at least Figs. 3-5, ¶ [0025], ¶ [0026], ¶ [0030]-[0032], ¶ [0035]-[0037], ¶ [0039], ¶ [0040], ¶ 
“providing the content item for publication with the caption describing the content item, wherein at least a portion of the caption describing the content item includes the selected caption” (Valliani in at least Figs. 3-5, ¶ [0025], ¶ [0026], ¶ [0030]-[0032], ¶ [0035]-[0037], ¶ [0039], ¶ [0040], and ¶ [0093] “The user may adopt or edit the caption. The user can use a text editor to modify the caption prior to saving. If adopted, the caption can be associated with the image by forming an embedded overlay or as metadata associated with the image. The image, along with the overlay information, can then be communicated to one or more recipients designated by the user. For example, the user may choose to post the image and associated caption on one or more social networks. Alternatively, the user could communicate the image to a designated group of persons via text, email, or through some other communication mechanism. Finally, the user could choose to save the picture for later use in their photo album along with the associated caption”).
It would have obvious to one of ordinary skill in the art before the effective filing date of the present application to combine Wang and Valliani. Wang teaches techniques for image captioning with word vector representations. Valliani teaches automatically generating captions for visual media, such 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Viker A. Lamardo whose telephone number is (571)270-5871.  The examiner can normally be reached on Mon. - Fri. 9 AM - 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann J. Lo can be reached on (571)272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 






/VIKER A LAMARDO/Examiner, Art Unit 2126