DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 10/29/2020, 10/29/2020, 10/29/2020, 10/29/2020, 10/29/2020, 10/29/2020, 10/29/2020, 10/29/2020, 10/29/2020, 10/29/2020, 10/29/2020, 10/29/2020, 10/29/2020 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Election/Restriction
Restriction to one of the following inventions is required under 35 U.S.C. 121:
I. Claims 1-7 and 14-20, drawn to determining a description for an image, classified in G06N 3/0454.
II. Claims 8-13, drawn to selecting an image, classified in G06T 9/002.


The inventions are independent or distinct, each from the other because:
Inventions I. and II. are directed to related processes. The related inventions are distinct if: (1) the inventions as claimed are either not capable of use together or can have a materially different design, mode of operation, function, or effect; (2) the inventions do not overlap in scope, i.e., are mutually exclusive; and (3) the inventions as claimed are not obvious variants.  See MPEP § 806.05(j). In the instant case, the inventions as claimed I. generates a descriptive sentence based on an image and II. is directed to selecting an image with a probability value above a threshold.  Furthermore, the inventions as claimed do not encompass overlapping subject matter and there is nothing of record to show them to be obvious variants.
Restriction for examination purposes as indicated is proper because all the inventions listed in this action are independent or distinct for the reasons given above and there would be a serious search and/or examination burden if restriction were not required because one or more of the following reasons apply:
The species are distinct inventions and would require searching in separate areas.
Applicant is advised that the reply to this requirement to be complete must include (i) an election of an invention to be examined even though the requirement may be traversed (37 CFR 1.143) and (ii) identification of the claims encompassing the elected invention. 
The election of an invention may be made with or without traverse. To reserve a right to petition, the election must be made with traverse. If the reply does not distinctly and specifically point out supposed errors in the restriction requirement, the election shall be treated as an election without traverse. Traversal must be presented at the time of election in order to be considered timely. Failure to timely traverse the requirement will result in the loss of right to petition under 37 CFR 1.144. If claims are added after the election, applicant must indicate which of these claims are readable upon the elected invention.
Should applicant traverse on the ground that the inventions are not patentably distinct, applicant should submit evidence or identify such evidence now of record showing the inventions to be obvious variants or clearly admit on the record that this is the case. In either instance, if the examiner finds one of the inventions unpatentable over the prior art, the evidence or admission may be used in a rejection under 35 U.S.C. 103 or pre-AIA  35 U.S.C. 103(a) of the other invention.
During a telephone conversation with Bradley Baugh on 06/17/2022 a provisional election was made without traverse to prosecute the invention of I., claims 1-7 and 14-20.  Affirmation of this election must be made by applicant in replying to this Office action.  Claims 8-13 are withdrawn from further consideration by the examiner, 37 CFR 1.142(b), as being drawn to a non-elected invention.

Drawings
The drawings submitted on 08/19/2019 are deemed acceptable for examination.
Allowable Subject Matter
Claim 5 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  The closest prior art Mao et al. (Mao, Junhua, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. "Explain images with multimodal recurrent neural networks." arXiv preprint arXiv:1410.1090 (2014)), teaches a multimodal recurrent neural network for sentence retrieval and generation based on an image. However the prior art does not teach alone or in combination each and every claimed limitation, including “wherein the at least one word embedding component also receives as an input an image representation of the input image from the convolution neural network layer component”, therefore the claim distinguishes over the prior art.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-7 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Mao et al. (Mao, Junhua, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. "Explain images with multimodal recurrent neural networks." arXiv preprint arXiv:1410.1090 (2014)), hereinafter Mao.

Regarding claim 1, Mao teaches a computer-implemented method for generating a sentence-level description of an input image (p. 1, “the task of generating novel sentences descriptions for images”), the method comprising:
	inputting the input image into a multimodal recurrent neural network (m-RNN) (p. 3 Fig. 2 “l. The input of our model is an image”), the m-RNN comprising:
	a convolution neural network layer component that generates an image representation of the input image (p. 3 Fig. 2 “Deep CNN”));
	at least one word embedding component that embeds a word into a word representation (p. 3, Fig. 2 and “Embedding 1”);
	a recurrent layer component that maps a recurrent layer activation of a prior time frame into a same vector space as a word representation at a current time frame and combines them (p.4 ¶ 1, “Instead of concatenating the word representation at time t (denoted as w(t)) and the recurrent layer activation at time t−1 (denoted as r(t−1)), we first map r(t−1) into the same vector space as w(t)t”);
	a multimodal component that is distinct from the recurrent layer component and that receives a first input from the recurrent layer component and a second input from the convolution neural network layer component and combines them (p. 3, Fig. 2b and p.4 ¶3 “After the recurrent layer, we set up a 512 dimensional multimodal layer that connect the language model part and the image part of the m-RNN model (see Figure 2(b))”).
	a softmax layer component that uses an output of the multimodal component to generate a probability distribution of a next word in the sentence-level description (p. 4, ¶5 “our m-RNN model has a softmax layer that will generate the probability distribution of the next word”); and

	outputting the sentence-level description of the input image (p. 5 “We can use the trained m-RNN model for three tasks: 1) Sentences generation”).
	
		 
Regarding claim 2, Mao teaches the computer-implemented method of Claim 1 wherein the multimodal component further receives a third input from the at least one word embedding component (p. 3 “The two word embedding layers embed the one-hot input into a dense word representation”).

Regarding claim 3, Mao teaches  the computer-implemented method of Claim 2 wherein the at least one word embedding layer comprises two word embedding layers, in which an output of a first word embedding layer is provided as an input to a second word embedding layer and an output of the second word embedding layer is provided as the third input to the multimodal component (p.3 Fig. 2 “and p. 3 “The two word embedding layers embed the one-hot input into a dense word representation”).

    PNG
    media_image1.png
    136
    537
    media_image1.png
    Greyscale


Regarding claim 4, Mao teaches the computer-implemented method of Claim 2 wherein the at least one word embedding layer comprises two word embedding layers, in which an output of the first word embedding layer is provided as:
	an input to a second word embedding layer, which provides its output to the recurrent layer component; and
	the third input to the multimodal component (Fig. 2)

    PNG
    media_image1.png
    136
    537
    media_image1.png
    Greyscale
.

Regarding claim 5, Mao teaches the computer-implemented method of Claim 1 wherein the at least one word embedding component also receives as an input an image representation of the input image from the convolution neural network layer component.

Regarding claim 6, Mao teaches the computer-implemented method of Claim 1 further comprising:
determining whether an end sign has been generated as a next word based upon the probability distribution; responsive to the next word not being the end sign: adding the next word to the sentence-level description; and for a next time frame: setting the next word as the current word; and
inputting the current word into the at least one word embedding component of the m-RNN to obtain a next word; and returning to the step of determining whether an end sign has been generated as a next word; and responsive to the next word being the end sign, outputting the sentence-level description (p. 5 “Then we can sample from this probability distribution to pick
the next word. In practice, we find that selecting the word with the maximum probability performs
slightly better than sampling. After that, we input the picked word to the model and continue the
process until the model outputs the end sign “##END##””).

Regarding claim 7, Mao teaches the computer-implemented method of Claim 1 wherein the step of combining the recurrent layer activation of a prior time frame with a word representation at a current time frame comprises using a rectified linear unit function (p.4 
    PNG
    media_image2.png
    34
    559
    media_image2.png
    Greyscale
).


Regarding claim 14, Mao teaches a computer-implemented method for retrieving a caption related to an input image, the method comprising:
	for each candidate caption from a set of candidate captions, using a multimodal recurrent neural network (m-RNN) to obtain a probability value of generating the candidate caption given the input image, the m-RNN comprising:
	a convolution neural network layer component that generates an image representation of the input image (p. 3 Fig. 2 “Deep CNN”);
	at least one word embedding component that encodes a word or words from the candidate caption into a word representation or representations(p. 3, Fig. 2 and “Embedding 1”);
	a recurrent layer component that maps a recurrent layer activation of a prior time frame into a same vector space as a word representation at a current time frame and combines them (p.4 ¶ 1, “Instead of concatenating the word representation at time t (denoted as w(t)) and the recurrent layer activation at time t−1 (denoted as r(t−1)), we first map r(t−1) into the same vector space as w(t)t”);
	a multimodal component that is distinct from the recurrent layer component and that receives a first input from the recurrent layer component and a second input from the convolution neural network layer component and combines them(p. 3, Fig. 2b and p.4 ¶3 “After the recurrent layer, we set up a 512 dimensional multimodal layer that connect the language model part and the image part of the m-RNN model (see Figure 2(b))”); and
	a softmax layer component that uses an output of the multimodal component to generate one or more probability distributions of words in the candidate caption related to the input image, the one or more probability distributions being used to obtain the probability value for the candidate caption (p. 4, ¶5 “our m-RNN model has a softmax layer that will generate the probability distribution of the next word”); and
	selecting one or more captions that have a probability value above a threshold level (p. 5 “we use the normalized probability for each sentence:
 
    PNG
    media_image3.png
    21
    200
    media_image3.png
    Greyscale

    PNG
    media_image4.png
    17
    405
    media_image4.png
    Greyscale
”)


Regarding claim 15, Mao teaches the computer-implemented method of Claim 14 wherein the probability value represents a normalized probability obtained using a normalization factor that represents a marginal probability of the candidate caption (p. 5 “we use the normalized probability for each sentence:
 
    PNG
    media_image3.png
    21
    200
    media_image3.png
    Greyscale

    PNG
    media_image4.png
    17
    405
    media_image4.png
    Greyscale
”).

Regarding claim 16, Mao teaches the computer-implemented method of Claim 15 in which the normalized probability is approximately equivalent to a probability obtaining the input image in an image retrieval given the candidate caption  (p. 5 “we use the normalized probability for each sentence:
 
    PNG
    media_image3.png
    21
    200
    media_image3.png
    Greyscale

    PNG
    media_image4.png
    17
    405
    media_image4.png
    Greyscale
”).


Regarding claim 17, Mao teaches the computer-implemented method of Claim 14 further comprising the step of:
	ranking the candidate captions from the set of candidate captions based upon each candidate caption's respective probability value (p.5 “For the image retrieval task, we rank the images based on their perplexity with the query sentence and output the top ranked ones…  Instead of looking at the perplexity or the probability of generating the sentences given the query
image, we use the normalized probability for each sentence”).

Regarding claim 18, Mao teaches the computer-implemented method of Claim 14 wherein the multimodal component further receives a third input from the at least one word embedding component (p. 3 “The two word embedding layers embed the one-hot input into a dense word representation”).

Regarding claim 19, Mao teaches the computer-implemented method of Claim 18 wherein the at least one word embedding layer comprises two word embedding layers, in which an output of a first word embedding layer is provided as an input to a second word embedding layer and an output of the second word embedding layer is provided as the third input to the multimodal component (p.3 Fig. 2 “and p. 3 “The two word embedding layers embed the one-hot input into a dense word representation”).

    PNG
    media_image1.png
    136
    537
    media_image1.png
    Greyscale

.

Regarding claim 20, Mao teaches the computer-implemented method of Claim 14 wherein the first input from the recurrent layer component is related to one or more word representations and the second input from the convolution neural network component is related to the image representation of the input image  (Fig. 2)

    PNG
    media_image1.png
    136
    537
    media_image1.png
    Greyscale
..

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Ma et al. (Ma, Lin, Zhengdong Lu, Lifeng Shang, and Hang Li. "Multimodal convolutional neural networks for matching image and sentence." In Proceedings of the IEEE international conference on computer vision, pp. 2623-2631. 2015) discloses using a multimodal network for generating descriptions of images.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHON G FOLEY whose telephone number is (469)295-9092. The examiner can normally be reached 10AM-6PM CT M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, James Lee can be reached on (571) 270-5965. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SHON G FOLEY/Examiner, Art Unit 3668