DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 02/11/2022 has been entered.

Response to Arguments
Applicant’s arguments, see Remarks, filed 02/11/2022, with respect to the rejection(s) of claim(s) 1-20 under 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Lin and Dutta.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention 

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Lin (US PG Pub 20180260698) in view of Dutta (US PG Pub 20170139955).
	
	As per claims 1, 11 and 20, Lin discloses: 	A computer implemented method, one or more non-transitory computer readable media and system, comprising: 	one or more memories storing instructions (Lin; Fig. 6, item 604 and 606; p. 0122-0123); and 	one or more processors that are coupled to the one or more memories and, when executing the instructions (Lin; Fig. 6, item 602; p. 0122-0123), are configured to: 	generate a first matched pair that specifies the first phrase and the first region, wherein one or more annotation operations are subsequently performed on the source image based on the first matched pair (Lin; p. 0018-0022 - The first LSTM can also provide, based on training, a respective attention for each image feature, and can generate an attention map associating skeletal words with respective locations of image features).	Lin, however, fails to disclose extracting a first phrase and a second phrase from a source sentence and determining that the first phase matches a first region of a source image based on an interrelationship between the first phrase and the second phrase.	Dutta does teach extracting a first phrase and a second phrase from a source sentence (Dutta; p. 0036 – the text sentence is split 152 into a number of text sentence fragments) and The image database is queried and determines 156 whether the semantic role of each text sentence fragment is captured by one image corresponding to each fragment… For unrepresented text sentence fragments, the fragments are split into smaller fragments that are identified with the verb of the fragment as well as the roles of the adjuncts associated with the verb. That is, for each verb in the sentence fragment, the actors, recipients (sometimes referred to as “patients” in the literature), and instruments are identified along with the causation, location, and direction adjuncts modifying the verb (interrelationships), all of which are then used to query the image database for an image associated with tags that correspond to the same or similar semantics of the sentence fragment).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method non-transitory computer readable media and system of Lin to include extracting a first phrase and a second phrase from a source sentence and determining that the first phase matches a first region of a source image based on an interrelationship between the first phrase and the second phrase, as taught by Dutta, in order to provide for representing a text sentence as one or more images, in which each semantic role of the text sentence is represented by the one or more images (Dutta; p. 0001).
	As per claims 2 and 12, Lin in view of Dutta discloses: 	The computer implemented method and system of claims 1 and 11, further comprising: determining that the first phrase matches a second region of the source image based on the The second LSTM can be trained to determine the attribute words based on a second set of ground truth phrases comprising words describing attributes in a second set of ground truth images); and generating a second matched pair that specifies the first phrase and the second region, wherein one or more annotation operations are subsequently performed on the source image based on the second matched pair (Lin; p. 0018-0022 – Finally, the skeletal words in the skeletal phrase and their respective attributes are merged to form an output phrase).

	As per claims 3 and 13, Lin in view of Dutta discloses:	The computer implemented and system of claims 1 and 11, further comprising: determining that the second phrase matches the first region based on the second phrase and the first matched pair (Lin; p. 0018-0022 - The second LSTM can be trained to determine the attribute words based on a second set of ground truth phrases comprising words describing attributes in a second set of ground truth images); and generating a second matched pair that specifies the second phrase and the first region, wherein one or more annotation operations are subsequently performed on the source image based on the second matched pair (Lin; p. 0018-0022 – Finally, the skeletal words in the skeletal phrase and their respective attributes are merged to form an output phrase).

	As per claims 4 and 14, Lin in view of Dutta discloses: 	The computer-implemented method and system of claims 1 and 11, wherein determining that the first phase matches the first region comprises: generating a first plurality The first set of ground truth phrases includes words describing image features in a first set of ground truth images and relationships of the image features in the first set of ground truth images); and performing on or more comparison operations on the plurality of grounding decisions to determine that a first grounding decision included in the plurality of grounding decisions indicates that the first phrase matches the first region (Lin; p. 0018-0022 - the first LSTM is trained to receive the image data as input and provide a skeletal phrase that describes the objects and relationships of objects in the image, without describing the attributes of the object).
	As per claims 5 and 15, Lin in view of Dutta discloses: 	The computer-implemented method and system of claims 4 and 14, further comprising performing one or more machine learning operations on an untrained phrase grounding model to generate the trained phrase grounding model (Lin; p. 0033 - An LSTM is a type of recurrent feed-forward neural network architecture which can be trained to classify input data, such as to identify a word describing feature data. An LSTM is trained on a training data set, such as a training data set having images and known-accurate respective phrases describing each of the images. Thus, the LSTM can receive input feature data and provide words which describe, with a high level of probability, the salient features of the input feature data).

claims 6 and 16, Lin in view of Dutta discloses:	The computer-implemented method and system of claims 4 and 14, further comprising: performing one or more pre-training operations on an untrained phrase encoder and an untrained visual encoder to generate a pre-trained phrase encoder and a pre-trained visual encoder (Lin; p. 0033 - An LSTM is a type of recurrent feed-forward neural network architecture which can be trained to classify input data, such as to identify a word describing feature data. An LSTM is trained on a training data set, such as a training data set having images and known-accurate respective phrases describing each of the images. Thus, the LSTM can receive input feature data and provide words which describe, with a high level of probability, the salient features of the input feature data; p. 0018 - The CNN is trained to extract the image features based on pixel values (e.g., color, grayscale value) of pixels within locations of the image); and performing one or more training operations on an untrained phrase grounding model that includes both the pre-trained phrase encoder and the pre-trained visual encoder to generate the trained phrase grounding model (Lin; p. 0033 - An LSTM is a type of recurrent feed-forward neural network architecture which can be trained to classify input data, such as to identify a word describing feature data. An LSTM is trained on a training data set, such as a training data set having images and known-accurate respective phrases describing each of the images. Thus, the LSTM can receive input feature data and provide words which describe, with a high level of probability, the salient features of the input feature data; p. 0018 - The CNN is trained to extract the image features based on pixel values (e.g., color, grayscale value) of pixels within locations of the image).

claims 7 and 17, Lin in view of Dutta discloses: 	The computer-implemented method and system of claims 1 and 11, wherein determining that the first phase matches the first region comprises: performing one or more object detection operations on the source image to generate a plurality of bounding boxes, wherein a first bounding box included in the plurality of bounding boxes defines the first region (Lin; p. 0065 - The feature maps are created by processing the image with a convolutional neural network (CNN) which is trained to extract the image features (e.g., data describing objects) based on pixel values (e.g., color, grayscale value) of pixels (bounding boxes) within locations of the image. The CNN produces feature maps from the extracted image features); and determining that the first phrase matches the first bounding box based on the first phrase, the at least the second phrase, and the sequence of bounding boxes (Lin; p. 0066 - The first LSTM neural network is trained to determine the skeletal phrase based on the first set of ground truth phrases. The first LSTM analyzes the feature maps for objects and relationships between the objects, and provides skeletal words describing objects in the image data. A combination of skeletal words forms a skeletal phrase).

	As per claims 8 and 18, Lin in view of Dutta discloses:	The computer-implemented method and system of claims 1 and 11, wherein determining that the first phase matches the first region comprises applying a state of a recurrent neural network (RNN) to a neural network (NN) to generate a grounding decision (Lin; p. 0033 - An LSTM is a type of recurrent feed-forward neural network architecture which can be trained to classify input data, such as to identify a word describing feature data. An LSTM is trained on a training data set, such as a training data set having images and known-accurate respective phrases describing each of the images. Thus, the LSTM can receive input feature data and provide words which describe, with a high level of probability, the salient features of the input feature data).

	As per claims 9 and 19, Lin in view of Dutta discloses:	The computer-implemented method and system of claims 1 and 11, wherein determining that the first phase matches the first region comprises applying a first state of a first recurrent neural network (RNN) and a first state of a second RNN to a first neural network (NN) to generate a grounding decision (Lin; p. 0016-0021 - The second LSTM provides, for each word in the skeletal phrase, associated attributes which further describe the respective skeletal word. The inputs to the second LSTM can include the feature maps and information from the first LSTM, such as the skeletal words, hidden states of the first LSTM that identify potential skeletal words, and/or attention maps that identify the portions of the image having a high probability of having a significant feature).

	As per claim 10, Lin in view of Dutta discloses:	The computer-implemented method of claim 1, wherein determining that the first phrase matches the first region comprises applying a first state of a first recurrent neural network (RNN), a first state of a second RNN, and a first state of a bi-directional RNN to a first neural network (NN) to generate a grounding decision (Lin; p. 0016-0021 - The second LSTM provides, for each word in the skeletal phrase, associated attributes which further describe the respective skeletal word. The inputs to the second LSTM can include the feature maps and information from the first LSTM, such as the skeletal words, hidden states of the first LSTM that identify potential skeletal words, and/or attention maps that identify the portions of the image having a high probability of having a significant feature).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art made of record and not relied upon includes:
  Saito (US PG Pub 20200042592) which provides for determination of correspondence between each of the phrases of the article body text and the images by calculating a correlation between the caption and each of the phrases of the article body text on a basis of the result of the morphological analysis performed by the morphological analysis unit (Saito; Abstract).	Lee (US PG Pub 20200097604) which provides for concepts that relate to matching data of two different modalities using two stages of attention (Lee; Abstract).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Rodrigo A Chavez whose telephone number is (571)270-0139.  The examiner can normally be reached on Monday - Friday 9-6 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/RODRIGO A CHAVEZ/Examiner, Art Unit 2658

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658