DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Allowable Subject Matter

Regarding claim 6, the prior art made record neither renders obvious nor anticipates the combination of claimed elements, as recited in claim 6. Therefore, claim 6 is objected to as being directly or indirectly dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claim 7-9 are objected for their dependencies on claim 6. 


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1-2 are rejected under 35 U.S.C. 103 as being unpatentable over Trott, Alexander Richard et al (PGPUB Document No. 20190130206), hereafter referred as to “Trott”, in view of Battach, Barak et al (PGPUB Document No. 20190266236), hereafter, referred to as “Battach”.

Regarding Claim 1,  Trott teaches A method for text-image retrieval including text and image branches, said method comprising: receiving as input a text query and an image(Trott, Fig. 1-2 disclose a method for querying an image by a text query using an input image (element 110) and query in text (element 120)); 
locating visual object candidates in the input image(Trott, Fig. 2  discloses objects in an image are located by bounding boxes element 240) scoring correspondences between the entity embeddings and visual object candidates(Trott, para 0031-0032 disclose scoring related to question embedding and objects in the image “Question embedding q from language processing module 440 and object embeddings v from image processing module 420 are provided to a scorer 450. Scorer 450 generates a scoring vector s, which includes a score si∈  for each of the object embedding vi in object embeddings v.”); providing, visualized in a bounding box(Trott, Fig. 2  discloses objects in an image are located by bounding boxes element 240), the object corresponding to the query text entity with the highest probability score, to a user of the system. wherein no specific embedding or object feature extraction is used in the method(Trott, para 0054-0056 disclose based on the scoring(logit values) objects are getting selected in the image “Each of the initial logit values corresponds to how well a corresponding object matches the criteria in the question. The initial logit values are generated based on a scoring vector, such as the scoring vector generated during process 850.”; here object embeddings are considered not extracted features).
Trott teaches identifying objects in an image but he does not explicitly teach parsing the input text query into tokens and converting them to entity embedding vectors; 
However, in the same field of endeavor of converting data into embedding vectors Battach teaches parsing the input text query into tokens (Battach, para 0020 disclose parsing and segmentation (tokenization) of text “NLP model 140 is arranged to perform any number and variety of NLP operations (e.g., translation, speech recognition, text-to-speech, speech segmentation, summarization, coreference resolution, grammar induction, optical character recognition, word segmentation, sentence breaking, parsing, etc.) for lexicon”) and converting them to entity embedding vectors (Battach, Fig. 4 and para 0038-0039 disclose converting tokens into embedding vectors  “encoders 411 can receive an input vector (or matrix) of tokens and process the tokens to generate an output vector (or matrix) of tokens… the softmax layer of the classifiers 420 includes nodes for each of the possible outputs and is scaled based in the embedding size of the encoder”).
Therefore, would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to incorporate the identification of objects in an image of Trott into transformation of text into embedded vectors of Battach to produce an expected result of comparing the query texts in embedded vectors with objects in image. The modification would be obvious because one of ordinary skill in the art would be motivated to convert query text in a format which can reliably be compared with candidate objects in an image.

Regarding claim 2, Trott and Battach teach all the limitations of claim 1 and Battach further teaches pre-training the text branch utilizing a BERT, Bidirectional Encoder Representations from Transformers, base model(Battach, para 0021 discloses using BERT transformer for pre-training model  “NLP model 140 can be a bidirectional encoder representations from transformers (BERT) model, embeddings from language models (ELMo) model, generative pre-training (GPT) model, or the like”).

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Trott, Alexander Richard et al (PGPUB Document No. 20190130206), hereafter referred as to “Trott”, in view of Battach, Barak et al (PGPUB Document No. 20190266236), hereafter, referred to as “Battach”, in further view of Vo, Nhat et al (PGPUB Document No. 20200302168), hereafter, referred to as “Vo”.

Regarding claim 3, Trott and Battach teach all the limitations of claim 2 but they don’t explicitly teach receiving, by the image branch, region of interest (Rol) features as input objects from an object detector.
However, in the same field of endeavor of identifying objects Vo teaches receiving, by the image branch, region of interest (Rol) features as input objects from an object detector (Vo, para 0110 disclose inputting region of interest (ROI) object detection  “Object Detection Networks 930 and 950 accept as input the identified region of interests and present as outputs probabilities of presence of a class of objects in the region of interest”).
Therefore, would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to incorporate the identification of objects in an image of Trott and Battach into inputting region of interest (ROI) for object detection of Vo to produce an expected result of searching objects within a specified region. The modification would be obvious because one of ordinary skill in the art would be motivated to search objects within a probable region instead of searching objects in the whole image.

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Trott, Alexander Richard et al (PGPUB Document No. 20190130206), hereafter referred as to “Trott”, in view of Battach, Barak et al (PGPUB Document No. 20190266236), hereafter, referred to as “Battach”, in view of Vo, Nhat et al (PGPUB Document No. 20200302168), hereafter, referred to as “Vo”, in view of Connell, Simon et al (PGPUB Document No. 20210293729), hereafter, referred to as “Connell”.

Regarding claim 4, Trott, Battach and Vo teach all the limitations of claim 3 but they don’t explicitly teach training, a two-layer multi-layer perceptron (MLP) to generate spatial embedding given absolute spatial information of the Rol location and size normalized to the entire image.
However, in the same field of endeavor of identifying objects Connell teaches training, a two-layer multi-layer perceptron (MLP) to generate spatial embedding given absolute spatial information of the Rol location and size normalized to the entire image (Connell, para 0195 discloses training with Learn multi-Layer Perception on  “the MLP classifier with one hidden layer and approximately five perceptrons may be trained with classification data in the form of image feature”; where Vo in para 0110 further teaches using ROI for object detection in an image  “Object Detection Networks 930 and 950 accept as input the identified region of interests and present as outputs probabilities of presence of a class of objects in the region of interest” ).
Therefore, would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to incorporate the identification objects in an image of Trott, Battach and Vo into use of Multi-Layer Perceptron of Connell to produce an expected result of embedding query tokens. The modification would be obvious because one of ordinary skill in the art would be motivated to use MLP for distinguishing data which is not linearly separable (a feature of MLP).

Regarding claim 5, Trott, Battach, Vo and Connell teach all the limitations of claim 4 and Vo further teaches adding, by both branches, positional and spatial embedding to tokens and RoIs respectively as input to a first interaction layer of the MLP (Vo, para 0110 discloses inputting region of interest (ROI) object detection  “Object Detection Networks 930 and 950 accept as input the identified region of interests and present as outputs probabilities of presence of a class of objects in the region of interest”
and para 0089 further discloses multi-layer perception (MLP) a processing layer “The final layer one or more layers of a CNN may be a traditional multi-layer perceptron neural network that uses the high-level features extracted by the convolutional and pooling layers to produce outputs” ).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. 
US Publication No. 20120303562, discloses identifying objects using region of interest (ROI) and Multi-Layer Perceptions. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDULLAH A DAUD whose telephone number is (469)295-9283.  The examiner can normally be reached on M~F: 9:30 am~6:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.  
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ashish Thomas can be reached on 571-272-0631.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/ABDULLAH A DAUD/Examiner, Art Unit 2164   

/ASHISH THOMAS/Supervisory Patent Examiner, Art Unit 2164