DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 05/03/2019, 01/31/2020, 09/21/2021 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
The information disclosure statement filed 5/8/2020 is not considered as it appears to be filed for another application.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

s 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Stoop et al. (US 20180285700) in view of Lee et al. (US 20180336183).
As per claims 1, 11, 16, Stoop et al. teaches
a method performed by one or more data processing apparatus, the method comprising: generating a candidate set of training examples, wherein each training example comprises: (i) a search query comprising a sequence of one or more words, (ii) an image, and (iii) selection data characterizing how often users selected the image in response to the image being identified by a search result for the search query (para. 6: train a visual-concept recognition system and describe concepts appearing in visual-media items, rather than resorting to more processor- and labor intensive efforts that may be required in training a system to classify concepts in visual-media items; para. 42: as a user is entering text to make a declaration, the typeahead feature may attempt to match the string of textual characters being entered in the declaration to strings of characters (e.g., names, descriptions) as a user is entering text to make a declaration, the typeahead feature may attempt to match the string of textual characters being entered in the declaration to strings of characters (e.g., names, descriptions); para. 58, 61: in these examples, the social-networking system may determine associated social-graph concepts by using a topic index to match the extracted text with keywords indexed with respective social-graph concepts, and may determine the vector representation based on these concepts; para. 78: the supervised training process may also use other metrics such as click-through rate to determine if an n-gram has been properly trained with respect to a visual concept. If querying users who submitted a search query including the n-gram "yeezy" frequently click on music videos or photos of the artist Kanye West, the social-networking system may determine that a current association of "yeezy" to a visual concept associated with Kanye West is correct).
selecting a plurality of training examples from the candidate set of training examples, based at least in part on the selection data of the training examples, for use in jointly training: (i) an image embedding model having a plurality of image embedding model parameters, and (ii) a text embedding model having a plurality of text embedding model parameters (para. 6: searching for visual-media items by using an image-recognition process to segment images of visual-media items and identify visual concepts therein and by then tying those visual concepts to text supplied by user communications, where the text is determined to be likely to describe those visual concepts. The described joint embedding model may be advantageous in that it allows the social-networking system to leverage what is effectively crowdsourced information from text associated with visual-media items (e.g., from communications, metadata, etc.) to determine associations between n-grams and visual-media items, and ultimately between n-grams and visual concepts; para. 8: the social-networking system may train these popular n-grams to their respective visual concepts using any suitable method such as the ones described here (e.g., by mapping these n-grams onto n-embeddings in the joint embedding model); para. 52, 62-63: engage in a training phase that makes use of one or more training techniques to determine the locations of n-embeddings and v-embeddings in the d-imensional space, and the n-grams that are associated with visual concepts based on these locations; train the joint embedding model using a triplet loss algorithm, which may analyze a large number (e.g., thousands, millions) of information triplets; para. 67); and 
using the training data to jointly train the image embedding model and the text embedding model, wherein the training comprises, for each selected training example: processing the image of the training example using the image embedding model to generate an embedding of the image; processing a representation of the search query of the training example using the text embedding model to generate an embedding of the search query (para. 62-63, 67: the social-networking system may make use of the joint embedding model to identify visual-media items to return as search results in response to a search query for visual-media items. In the joint embedding training model, the locations of n-embeddings may be used to identify visual-media items responsive to a search query, based on the locations of the v-embeddings corresponding to the visual-media items; para. 70-72: as an example, the n-gram "smartphone" and its associated visual concept may not have existed before the first smartphone was released, such that the requisite associations may not have yet been trained for…to strategically select the visual concepts and n-grams to train for; para. 78.)
determining a measure of similarity between the embedding of the image and the embedding of the search query (para. 6: searching for visual-media items by using an image-recognition process to segment images of visual-media items and identify visual concepts therein and by then tying those visual concepts to text supplied by user communications, where the text is determined to be likely to describe those visual concepts. The described joint embedding model may be advantageous in that it allows the social-networking system to leverage what is effectively crowdsourced information from text associated with visual-media items (e.g., from communications, metadata, etc.) to determine associations between n-grams and visual-media items, and ultimately between n-grams and visual concepts; para. 39-40, 57: the social-networking system may identify a shared visual concept in two or more visual-media items, identifying visual-media items with segments having greater than a threshold degree of similarity); and
	Stoop does not explicitly teach adjusting the image embedding model parameters.
	Lee teaches 
adjusting the image embedding model parameters and the text embedding model parameters based at least in part on the measure of similarity between the embedding of the image and the embedding of the search query (para. 18-19: represent language data objects, e.g., words, sentences, and images, as one or more values that machines can easily perform operations on, so that the machine can measure semantic similarity and utilize that semantic similarity to perform cognitive operations within the computing environment to assist human beings. This technique, known as "embedding"; para. 90: generating answers based on the corpus of data and the input question, verifying answers to a collection of questions for the corpus of data, correcting errors in digital text using a corpus of data, and selecting answers to questions from a pool of potential answers, i.e. candidate answers; para. 158: for images, the illustrative embodiments may adopt existing image embedding approaches, such as convolutional neural networks, and provide RGB pixels. For tables and knowledge base facts, the illustrative embodiments may adopt existing knowledge base embedding techniques that operate on a triple such as (row, column, value) or (entity 1, relation, entity 2); para. 168: based on the computed loss, parameters for the operation of the neural network may be modified so as to reduce the loss. Once the loss is below a threshold value, the parameter change is negligible, or the number of training iterations is above a threshold, the neural network is considered to have been trained. It should be appreciated that the particular loss computation and the parameters modified based on the loss may be implementation specific). Thus, it would have been obvious to one or ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Stoop et al. and Lee in order to effectively improve the dimensionality correction or contextual similarity.


As per claims 2, 12, 17, Stoop teaches 
wherein generating the candidate set of training examples comprises processing data from a historical query log of a web search system (para. 31, 40: the search results may be personalized for the querying user based on, for example, social-graph information, user information, search or browsing history of the user, or other suitable information related to the user; para. 74, 82).  

As per claims 3, 13, 18, Stoop teaches
wherein the selection data for each training example indicates a fraction of times users selected the image of the training example in response to the image of the training example being identified by a search result for the search query of the training example (para. 58, 61: in these examples, the social-networking system may determine associated social-graph concepts by using a topic index to match the extracted text with keywords indexed with respective social-graph concepts, and may determine the vector representation based on these concepts; para. 78: the supervised training process may also use other metrics such as click-through rate to determine if an n-gram has been properly trained with respect to a visual concept. If querying users who submitted a search query including the n-gram "yeezy" frequently click on music videos or photos of the artist Kanye West, the social-networking system may determine that a current association of "yeezy" to a visual concept associated with Kanye West is correct).  

As per claims 4, 14, 19, Stoop teaches
wherein selecting a plurality of training examples from the candidate set of training examples comprises: selecting a plurality of training examples for which the image of the training example is most frequently selected by users in response to the image being identified by a search result for the search query of the training example (para. 51, 78: the supervised training process may also use other metrics such as click-through rate to determine if an n-gram has been properly trained with respect to a visual concept. If querying users who submitted a search query including the n-gram "yeezy" frequently click on music videos or photos of the artist Kanye West, the social-networking system may determine that a current association of "yeezy" to a visual concept associated with Kanye West is correct.)
Lee also teaches claim 4 at para. 88-90: generating answers based on the corpus of data and the input question, verifying answers to a collection of questions for the corpus of data, correcting errors in digital text using a corpus of data, and selecting answers to questions from a pool of potential answers, i.e. candidate answers; para.141: to look for the exact term from an input question or synonyms to that term in the input question, e.g., the exact term or synonyms for the term "movie," and generate a score based on a frequency of use of these exact terms or synonyms.  

As per claims 5, 15, 20, Stoop teaches
wherein the image embedding model and the text embedding model comprise one or more neural networks (fig. 5: training operation, neural network 520 and embedding operation; para. 56: the system may include three convolutional neural networks; para. 59: mapping a visual-media item to an embedding space, referencing FIG. 4, when the post 410 is first posted, the social-networking system 160 may map the visual-media item onto a vector using a deep-learning model (e.g., a convolutional neural network) based on information associated with the visual-media item).  

As per claim 6, Stoop teaches
wherein adjusting the image embedding model and the text embedding model comprises (para. 52: using an image-recognition process to segment images of visual-media items and identify visual concepts therein and by then tying those visual concepts to text supplied by user communications, where the text is determined to be likely to describe those visual concepts. The described joint embedding model may be advantageous in that it allows the social-networking system  to leverage what is effectively crowdsourced information from text associated with visual-media items (e.g., from communications, metadata, etc.) to determine associations between n-grams and visual-media items, and ultimately between n-grams and visual concepts).  
determining a gradient of a loss function that depends on the measure of similarity between the embedding of the image and the embedding of the search query (para. 55: a feature-detection algorithm may identify shapes by evaluating the pixels of an image for the presence of image-edges (e.g., sets of points in an image that have a strong gradient magnitude), comers ( e.g., sets of points with low levels of curvature), biogs (relatively smooth areas), and/or ridges; para. 57: segments having greater than a threshold degree of similarity in their visual features may be determined to correspond to a depiction of a shared visual concept; para. 63: engage in a training phase that makes use of one or more training techniques to determine the locations of n-embeddings and v-embeddings in the d-imensional space, and the n-grams that are associated with visual concepts based on these locations; train the joint embedding model using a triplet loss algorithm, which may analyze a large number (e.g., thousands, millions) of information triplets; the distance between the embedding for each positive n-gram and the embedding for the particular visual-media item may be less than the distance between the embedding for each negative n-gram and the embedding for the particular visual-media item; para. 67, 78).
	Stoop does not explicitly teach using the gradient to adjust the image embedding model parameters and the text embedding model parameters.
	Lee teaches 
adjusting the image embedding model parameters and the text embedding model parameters based at least in part on the measure of similarity between the embedding of the image and the embedding of the search query (para. 18-19: represent language data objects, e.g., words, sentences, and images, as one or more values that machines can easily perform operations on, so that the machine can measure semantic similarity and utilize that semantic similarity to perform cognitive operations within the computing environment to assist human beings. This technique, known as "embedding"; para. 90: generating answers based on the corpus of data and the input question, verifying answers to a collection of questions for the corpus of data, correcting errors in digital text using a corpus of data, and selecting answers to questions from a pool of potential answers, i.e. candidate answers; para. 158: for images, the illustrative embodiments may adopt existing image embedding approaches, such as convolutional neural networks, and provide RGB pixels. For tables and knowledge base facts, the illustrative embodiments may adopt existing knowledge base embedding techniques that operate on a triple such as (row, column, value) or (entity 1, relation, entity 2); para. 168: based on the computed loss, parameters for the operation of the neural network may be modified so as to reduce the loss. Once the loss is below a threshold value, the parameter change is negligible, or the number of training iterations is above a threshold, the neural network is considered to have been trained. It should be appreciated that the particular loss computation and the parameters modified based on the loss may be implementation specific). Thus, it would have been obvious to one or ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Stoop et al. and Lee in order to effectively improve the dimensionality correction or contextual similarity.

As per claim 7, Stoop teaches
wherein the loss function depends on the selection data of the training example (para. 58, 61-63: in these examples, the social-networking system may determine associated social-graph concepts by using a topic index to match the extracted text with keywords indexed with respective social-graph concepts, and may determine the vector representation based on these concepts,… train the joint embedding model using a triplet loss algorithm, which may analyze a large number (e.g., thousands, millions) of information triplets; para. 78: the supervised training process may also use other metrics such as click-through rate to determine if an n-gram has been properly trained with respect to a visual concept. If querying users who submitted a search query including the n-gram "yeezy" frequently click on music videos or photos of the artist Kanye West, the social-networking system may determine that a current association of "yeezy" to a visual concept associated with Kanye West is correct).  

As per claim 8, Stoop teaches
wherein the loss function is a classification loss function or a triplet loss function (para. 63: engage in a training phase that makes use of one or more training techniques to determine the locations of n-embeddings and v-embeddings in the d-dimensional space, and the n-grams that are associated with visual concepts based on these locations; train the joint embedding model using a triplet loss algorithm, which may analyze a large number (e.g., thousands, millions) of information triplets.)  

As per claim 9, Stoop teaches
wherein the embedding of the image has a same dimensionality as the embedding of the search query (para. 52, 62-63: by embedding both n-grams and visual media items in the same d-dimensional space, the social networking system creates what may be termed a "joint embedding model". The distance between the embedding for each positive n-gram and the embedding for the particular visual-media item may be less than the distance between the embedding for each negative n-gram and the embedding for the particular visual-media item; para. 68-70: the social-networking system may identify visual-media items responsive to the search query based on the location of the reconstructed embedding of the search query in the d-dimensional space with respect to the locations of the visual-media items in the d-dimensional space (e.g., based on proximity as determined by Euclidean distance calculations, based on cosine similarities of the respective vectors)).  
As per claim 10, Stoop teaches
wherein determining a measure of similarity between the embedding of the image and the embedding of the search query comprises: determining a Euclidean distance between the embedding of the image and the embedding of the search query (para. 57: segments having greater than a threshold degree of similarity in their visual features may be determined to correspond to a depiction of a shared visual concept; para. 63: the distance between the embedding for each positive n-gram and the embedding for the particular visual-media item may be less than the distance between the embedding for each negative n-gram and the embedding for the particular visual-media item; para. 65: the distance between the embedding for each positive n-gram and the embedding for the particular visual-media item may be less than the distance between the embedding for each negative n-gram and the embedding for the particular visual-media item; para. 68.)
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Rhoads et al. (US 20130273968 A1) teaches at para. 485-486: search history, browsing history; para. 569-571: in this initial training phase, the user may capture several images of the same visual sign-perhaps from different distances and perspectives. The feature extraction algorithm processes the collection to extract a feature set that captures shared similarities of all of the training images. Gottemukkula (US 20200143137) teaches at para. 6: the current values of the encoder neural network parameters are adjusted using the gradient of the loss function.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LINH BLACK whose telephone number is (571)272-4106. The examiner can normally be reached 9AM-5PM EST M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tony Mahmoudi can be reached on 571-272-4078. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/LINH BLACK/Examiner, Art Unit 2163                                                                                                                                                                                                        



1/17/2022
/TONY MAHMOUDI/Supervisory Patent Examiner, Art Unit 2163