DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments, see Remarks, filed 06/27/2022, with respect to the rejection(s) of claim(s) 1-20 under 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Lev-Tov and Lin. The examiner contends that the newly added subject matter of “executing a first encoder neural network that generates a first encoding of the first phrase and a second encoding of the second phrase” and “determining that the first phrase matches a first region of a source image based on the first encoding and the second encoding”, has changed the scope of the claim and promoted the new grounds of rejection.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Lev-Tov et al. (US Patent 10445431; hereinafter “Lev-Tov”) in view of Lin (US PG Pub 20180260698).
	
	As per claims 1, 11 and 20, Lev-Tov discloses: 	A computer implemented method, one or more non-transitory computer readable media and system, comprising: 	one or more memories storing instructions (Lev-Tov; Fig. 8, item 804; Col. 16, lines 11-43); and 	one or more processors that are coupled to the one or more memories and, when executing the instructions (Lev-Tov; Fig. 8, item 804; Col. 16, lines 11-43), are configured to: 	extract a first phrase and a second phrase from a source sentence (Lev-Tov; Fig. 6, items 602 and 606; Col. 13, lines 58-61 - receiving a first text string in a source language from a user, via a client device; Col. 14, lines 3-22 - identifying a second text string vector that is closer to the first text string vector than a pre-selected threshold in the embedded set, the second text string vector associated with a second text string in a target language);
	executing a first encoder neural network that generates a first encoding of the first phrase and a second encoding of the second phrase (Lev-Tov; Fig. 6, items 604 and 606; Col. 13, lines 62-67 - associating the first text string in the source language to a first text string vector in the source language; Col. 14, lines 3-22 - identifying a second text string vector that is closer to the first text string vector than a pre-selected threshold in the embedded set, the second text string vector associated with a second text string in a target language; also see Col. 7, lines 38-63 - NN 244 includes a neural network configured as a language model. Accordingly, a DNN language model as disclosed herein is trained to map variable length sentences (character strings) into fixed length query vectors in the embedded set (where the length is fixed to the pre-selected vector dimension of the embedded set));	determine that the first phrase matches a first region of a source image based on the first encoding and the second encoding (Lev-Tov; Fig. 6, item 608; Col. 14, lines 23-42 - associating the first text string in the source language with the second text string in the target language, wherein the embedded set includes a first image vector for a first image associated with the first text string and a second image vector for a second image associated with the second text string in the target language. In some embodiments, the second image vector is the same as the first image vector);	generate a first matched pair that specifies the first phrase and the first region (Lev-Tov; Fig. 6, item 608; Col. 14, lines 23-42 - associating the first text string in the source language with the second text string in the target language, wherein the embedded set includes a first image vector for a first image associated with the first text string and a second image vector for a second image associated with the second text string in the target language. In some embodiments, the second image vector is the same as the first image vector).	Lev-Tov, however, fails to disclose wherein one or more annotation operations are subsequently performed on the source image based on the first matched pair.	Lin does teach wherein one or more annotation operations are subsequently performed on the source image based on the first matched pair (Lin; p. 0026 - A fourth technique uses image-to-text embedding in a phrase-providing process. Image-to-text embedding modifies a CNN to produce feature maps that better represent the features of the image, thus improving accuracy of a provided phrase. In this technique, user-provided image tag data is retrieved from input image data. The tag data is likely to accurately identify at least one feature in the image, and thus can be accounted for by the CNN in determining features).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method, non-transitory computer readable media and system of Lev-Tov to include wherein one or more annotation operations are subsequently performed on the source image based on the first matched pair, as taught by Lin, in order to improve accuracy of a provided phrase (Lin; p. 0026).
	As per claims 2 and 12, Lev-Tov in view of Lin discloses: 	The computer implemented method and system of claims 1 and 11, further comprising: determining that the first phrase matches a second region of the source image based on the first phrase and the at least the second phrase; and generating a second matched pair that specifies the first phrase and the second region (Lev-Tov; Fig. 6, item 608; Col. 14, lines 23-42 - associating the first text string in the source language with the second text string in the target language, wherein the embedded set includes a first image vector for a first image associated with the first text string and a second image vector for a second image associated with the second text string in the target language. In some embodiments, the second image vector is the same as the first image vector).	Lev-Tov, however, fails to disclose wherein one or more annotation operations are subsequently performed on the source image based on the second matched pair.	Lin does teach wherein one or more annotation operations are subsequently performed on the source image based on the second matched pair (Lin; p. 0026 - A fourth technique uses image-to-text embedding in a phrase-providing process. Image-to-text embedding modifies a CNN to produce feature maps that better represent the features of the image, thus improving accuracy of a provided phrase. In this technique, user-provided image tag data is retrieved from input image data. The tag data is likely to accurately identify at least one feature in the image, and thus can be accounted for by the CNN in determining features).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method and non-transitory computer readable media of Lev-Tov to include wherein one or more annotation operations are subsequently performed on the source image based on the second matched pair, as taught by Lin, in order to improve accuracy of a provided phrase (Lin; p. 0026).

	As per claims 3 and 13, Lev-Tov in view of Lin discloses:	The computer implemented and system of claims 1 and 11, further comprising: determining that the second phrase matches the first region based on the second phrase and the first matched pair; and generating a second matched pair that specifies the second phrase and the first region (Lev-Tov; Fig. 6, item 608; Col. 14, lines 23-42 - associating the first text string in the source language with the second text string in the target language, wherein the embedded set includes a first image vector for a first image associated with the first text string and a second image vector for a second image associated with the second text string in the target language. In some embodiments, the second image vector is the same as the first image vector).	Lev-Tov, however, fails to disclose wherein one or more annotation operations are subsequently performed on the source image based on the second matched pair.	Lin does teach wherein one or more annotation operations are subsequently performed on the source image based on the second matched pair (Lin; p. 0026 - A fourth technique uses image-to-text embedding in a phrase-providing process. Image-to-text embedding modifies a CNN to produce feature maps that better represent the features of the image, thus improving accuracy of a provided phrase. In this technique, user-provided image tag data is retrieved from input image data. The tag data is likely to accurately identify at least one feature in the image, and thus can be accounted for by the CNN in determining features).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method and non-transitory computer readable media of Lev-Tov to include wherein one or more annotation operations are subsequently performed on the source image based on the second matched pair, as taught by Lin, in order to improve accuracy of a provided phrase (Lin; p. 0026).

	As per claims 4 and 14, Lev-Tov in view of Lin discloses: 	The computer-implemented method and system of claims 1 and 11, wherein determining that the first phase matches the first region comprises: generating a first plurality of grounding decisions based on the source sentence, the source image, and a trained phase grounding model that sequentially maps each phrase included in a sentence to a plurality of grounding decisions based on any unmapped phrases included in the sentence ; and performing one or more comparison operations on the plurality of grounding decisions to determine that a first grounding decision included in the plurality of grounding decisions indicates that the first phrase matches the first region (Lev-Tov; Col. 12, lines 22-43 - Each cluster 440 may be associated with images (e.g., image 310) belonging in a class of images. For example, cluster 440 includes image vector 435-1. Further, each cluster 440 may be associated with a conceptual representation of the images in the cluster (e.g., image 435-1), included in text string vectors 435-2a, 435-2b, and 435-2c (collectively referred to, hereinafter, as “text string vectors 435-2”). The conceptual representation of images in cluster 440 may be expressed in multiple languages. For example, text string vector 435-2a may be associated with the conceptual representation of image vector 435-1, in English (e.g., “Apple”). Further, text string vector 435-2b may be associated with the conceptual representation of image vector 435-1 in German (e.g., “Apfel”). And text string vector 435-2c may be associated with the conceptual representation of image vector 435-1 in French (e.g., “Pomme”). In some embodiments, text string vectors 435-2 may include, in addition to values in the multiple dimensions (e.g., X.sub.1 and X.sub.2), an indicator to determine the language of text string associated with the text string vector).
	As per claims 5 and 15, Lev-Tov in view of Lin discloses: 	The computer-implemented method and system of claims 4 and 14, further comprising performing one or more machine learning operations on an untrained phrase grounding model to generate the trained phrase grounding model (Lev-Tov; Col. 7, lines 38-63 - In some embodiments, NN 244 includes a neural network configured as a language model. Accordingly, a DNN language model as disclosed herein is trained to map variable length sentences (character strings) into fixed length query vectors in the embedded set (where the length is fixed to the pre-selected vector dimension of the embedded set). The DNN language model is trained using a dataset of pairs, including an image from the image database and a text associated with the image (e.g., an image descriptor, or a comment posted by a user in an image file). For each image we generate its image embedding. The DNN language model may include a deep long short term memory (LSTM) network (also known as RNN) or a CNN and takes a variable length text (e.g., an input text string) in any language and maps it into a text string vector associated with the language. The text string vector associated with the language has the same dimensionality as the pre-selected vector dimension of the embedded set. The system trains the DNN language model, forming a text string vector (e.g., in the same manner as it would form a text string vector from a user input query in a search engine) and minimizing a distance in embedded set 230 between the text string vector and the image vector from the associated image. In some embodiments, the system further trains the DNN language model by maximizing a distance between the image vector from the associated image and text string vectors associated with other images).

	As per claims 6 and 16, Lev-Tov in view of Lin discloses:	The computer-implemented method and system of claims 4 and 14, further comprising: performing one or more pre-training operations on an untrained phrase encoder and an untrained visual encoder to generate a pre-trained phrase encoder and a pre-trained visual encoder; and performing one or more training operations on an untrained phrase grounding model that includes both the pre-trained phrase encoder and the pre-trained visual encoder to generate the trained phrase grounding model (Lev-Tov; Col. 7, lines 16-37 - In some embodiments, translation tool 242 is configured to execute commands and instructions from a neural network (NN) 244. NN 244 may include a language neural network (LNN), a deep neural network (DNN), or a convolutional neural network (CNN). In some embodiments, NN 242 may include a neural network configured as a vision model. In a DNN vision model as disclosed herein is trained as a feature extractor which maps variable sized images in the image database into image vectors in the embedded set, having a predetermined vector dimension. The DNN vision model is trained in a supervised manner in which the DNN vision model may be a classifier. In some embodiments, the DNN vision model could be trained purely unsupervised. The DNN vision model may also be trained using semi-supervised techniques in which each image has possibly multiple soft labels. Accordingly, the DNN vision model is trained to form an image vector in embedded set 230 by selecting a fixed-length subset of network activations such that there is a fixed mapping from images to the image vector in the embedded set (e.g., the fixed-length subset of network activations has a length equal to the pre-selected vector dimension of the embedded set)).

	As per claims 7 and 17, Lev-Tov in view of Lin discloses: 	The computer-implemented method and system of claims 1 and 11, wherein determining that the first phase matches the first region comprises: performing one or more object detection operations on the source image to generate a plurality of bounding boxes, wherein a first bounding box included in the plurality of bounding boxes defines the first region; and determining that the first phrase matches the first bounding box based on the first phrase, the at least the second phrase, and the sequence of bounding boxes (Lev-Tov; Fig. 3, item 311; Col. 9, lines 55-61 - Image embedder 322 may include a domain-specific DNN classifier, which classifies image 310 into one of multiple classes. For example, image embedder 322 may select a feature 311 (e.g., red shiny skin) to follow through a CNN classification with stages 323-1 through 323-4 (collectively referred to hereinafter as stages 323) to obtain image vector 335-1. The CNN classification includes a number of classes that may be derived from prior image searches stored in interaction history 254 and may increase as image database 252 increases in size).

	As per claims 8 and 18, Lev-Tov in view of Lin discloses:	The computer-implemented method and system of claims 1 and 11, wherein determining that the first phase matches the first region comprises applying a state of a recurrent neural network (RNN) to a neural network (NN) to generate a grounding decision (Lev-Tov; Col. 7, lines 38-63 - In some embodiments, NN 244 includes a neural network configured as a language model. Accordingly, a DNN language model as disclosed herein is trained to map variable length sentences (character strings) into fixed length query vectors in the embedded set (where the length is fixed to the pre-selected vector dimension of the embedded set). The DNN language model is trained using a dataset of pairs, including an image from the image database and a text associated with the image (e.g., an image descriptor, or a comment posted by a user in an image file). For each image we generate its image embedding. The DNN language model may include a deep long short term memory (LSTM) network (also known as RNN) or a CNN and takes a variable length text (e.g., an input text string) in any language and maps it into a text string vector associated with the language. The text string vector associated with the language has the same dimensionality as the pre-selected vector dimension of the embedded set. The system trains the DNN language model, forming a text string vector (e.g., in the same manner as it would form a text string vector from a user input query in a search engine) and minimizing a distance in embedded set 230 between the text string vector and the image vector from the associated image. In some embodiments, the system further trains the DNN language model by maximizing a distance between the image vector from the associated image and text string vectors associated with other images).

	As per claims 9 and 19, Lev-Tov in view of Lin discloses:	The computer-implemented method and system of claims 1 and 11, wherein determining that the first phase matches the first region comprises applying a first state of a first recurrent neural network (RNN) and a first state of a second RNN to a first neural network (NN) to generate a grounding decision (Lev-Tov; Col. 7, lines 38-63 - In some embodiments, NN 244 includes a neural network configured as a language model. Accordingly, a DNN language model as disclosed herein is trained to map variable length sentences (character strings) into fixed length query vectors in the embedded set (where the length is fixed to the pre-selected vector dimension of the embedded set). The DNN language model is trained using a dataset of pairs, including an image from the image database and a text associated with the image (e.g., an image descriptor, or a comment posted by a user in an image file). For each image we generate its image embedding. The DNN language model may include a deep long short term memory (LSTM) network (also known as RNN) or a CNN and takes a variable length text (e.g., an input text string) in any language and maps it into a text string vector associated with the language. The text string vector associated with the language has the same dimensionality as the pre-selected vector dimension of the embedded set. The system trains the DNN language model, forming a text string vector (e.g., in the same manner as it would form a text string vector from a user input query in a search engine) and minimizing a distance in embedded set 230 between the text string vector and the image vector from the associated image. In some embodiments, the system further trains the DNN language model by maximizing a distance between the image vector from the associated image and text string vectors associated with other images).

	As per claim 10, Lev-Tov in view of Lin discloses:	The computer-implemented method of claim 1, wherein determining that the first phrase matches the first region comprises applying a first state of a first recurrent neural network (RNN), a first state of a second RNN, and a first state of a bi-directional RNN to a first neural network (NN) to generate a grounding decision (Lev-Tov; Col. 7, lines 38-63 - In some embodiments, NN 244 includes a neural network configured as a language model. Accordingly, a DNN language model as disclosed herein is trained to map variable length sentences (character strings) into fixed length query vectors in the embedded set (where the length is fixed to the pre-selected vector dimension of the embedded set). The DNN language model is trained using a dataset of pairs, including an image from the image database and a text associated with the image (e.g., an image descriptor, or a comment posted by a user in an image file). For each image we generate its image embedding. The DNN language model may include a deep long short term memory (LSTM) network (also known as RNN) or a CNN and takes a variable length text (e.g., an input text string) in any language and maps it into a text string vector associated with the language. The text string vector associated with the language has the same dimensionality as the pre-selected vector dimension of the embedded set. The system trains the DNN language model, forming a text string vector (e.g., in the same manner as it would form a text string vector from a user input query in a search engine) and minimizing a distance in embedded set 230 between the text string vector and the image vector from the associated image. In some embodiments, the system further trains the DNN language model by maximizing a distance between the image vector from the associated image and text string vectors associated with other images).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art made of record and not relied upon includes:
  Saito (US PG Pub 20200042592) which provides for determination of correspondence between each of the phrases of the article body text and the images by calculating a correlation between the caption and each of the phrases of the article body text on a basis of the result of the morphological analysis performed by the morphological analysis unit (Saito; Abstract).	Lee (US PG Pub 20200097604) which provides for concepts that relate to matching data of two different modalities using two stages of attention (Lee; Abstract).
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Rodrigo A Chavez whose telephone number is (571)270-0139.  The examiner can normally be reached on Monday - Friday 9-6 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on 5712727602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/RODRIGO A CHAVEZ/Examiner, Art Unit 2658

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658