DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments, filed with amendment on 08/29/2022, have been fully considered but they are not persuasive. Applicant’s argues that “

    PNG
    media_image1.png
    78
    612
    media_image1.png
    Greyscale
”; 
The examiner disagrees. As shown below, Ranzinger clearly teaches training a machine learning algorithm (MLA) by using a set of training objects to categorize a new image by providing a set of training images for each object class to a convolutional neural network (CNN) and, for each object class, the CNN is trained to recognize an object in the region of an image (i.e. classification, which is equivalent to categorization) (see Ranzinger, steps 305 and 308 below).

    PNG
    media_image2.png
    967
    540
    media_image2.png
    Greyscale
.
	Li clearly teaches identifying the most representative image associated with each query cluster from the image search results by forming a visual media file set 345 that include both query vector clusters 340-1 and 340-2; when a user submits a query 401 the system returns a set of search results 431 that are then selected by the user from the cluster 340-1; by clicking on or downloading at least one of the search results the system forms the visual media file set 345 (see Li, FIG. 3, visual media file set 345 below)

    PNG
    media_image3.png
    849
    995
    media_image3.png
    Greyscale
.
	Therefore, the rejection of Ranzinger, in view of Li, is maintained in light of the amended limitations submitted by applicant. See Rejections below for additional details.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

Claims 1 and 17 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, regards as the invention. The claim limitation “the second set of image results containing images most representative of the query vector cluster” is indefinite because it is unclear exactly what the metes and bounds are when an image is “most representative” of the query vector cluster, or when it is not. Applicant cited para. [00135] of the specification as support for this amended limitation; the only support in this paragraph the examiner notes is that “for each query part of the query clusters, the most representative image search results with the query clusters, as selected by the users of the search engine server 210”. According to the specification, it appears the only criteria for determining if an image is “most representative” or not is by the user selecting the image. For purposes of examination, the Examiner is interpreting the claim limitation “most representative” as selected by a user.
Dependent claims 2-10 and 18-21 do not remedy the deficiencies of independent claims 1 and 17 respectively, and therefore are also rejected under 35 U.S.C. 112(b).  Corrections are requested.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 6, and 17, are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Patent No.: 10,503,775 (Ranzinger et al.) (hereinafter Ranzinger) in view of U.S. Patent No.: 10,353,951 (Li).
Regarding claim 1, Ranzinger teaches a method for generating a set of training objects for a Machine Learning Algorithm (MLA), the MLA for categorization of images, the method executable at a server that executes the MLA, the method comprising: (Ranzinger, col. 6, lines 7-16: "the system may use an algorithm that detects the set of objects from a set of example images, referred to as training data; the disclosed system includes training of a series of computer-operated neural networks, such as a convolutional neural network, to teach the neural network to identify features of images mapping to certain object classes for identifying those images that are responsive to an input search query with a probability that a region of an image is deemed salient"; Ranzinger, col. 6, lines 25-29: “the disclosed system produces a set of vectors representing the object classifications for the corpus of images in the training data; the objects are learned from running the convolutional neural network over a corpus of the training data")
obtaining, from a search log, an indication of search queries having been executed in an image vertical search, each search query being associated with a first set of image search results (Ranzinger, col. 12, lines 64-67; col. 13, lines 1-8: "the process 300 begins by proceeding from start step to step 301 where the processor 236, using the behavioral data engine 245, obtains session logs associated with one or more users; subsequently, in step 302, the processor 236, using the behavioral data engine 245, extracts the most popular search queries from the obtained session logs; next, in step 303, for each extracted search query, the processor 236, using the image search engine 242, determines a set of images from an image collection (e.g., 252); subsequently, in step 304, each image in each set of images is decomposed into a set of saliency regions for the image");
generating a query vector for each of the search queries (Ranzinger, col. 15, lines 36-41: "next, in step 402, the processor 236 provides each specific query to a trained language model (e.g., the convolutional neural network 240); subsequently, in step 403, the processor 236, using the language model engine 244, obtains a query vector for each specific query of the user input from the trained language model"); and
training the MLA by using the set of training objects to categorize a new image (Ranzinger, col. 12, lines 59-67; col. 13, lines 1-12, 30-33, 50-56: “

    PNG
    media_image4.png
    177
    452
    media_image4.png
    Greyscale
”; “

    PNG
    media_image5.png
    211
    454
    media_image5.png
    Greyscale
”; “

    PNG
    media_image6.png
    81
    448
    media_image6.png
    Greyscale
”; “

    PNG
    media_image7.png
    134
    452
    media_image7.png
    Greyscale
”;
the term “classify” is equivalent to the term “categorize”).
Ranzinger fails to teach
clustering the query vectors into a plurality of query vector clusters; 
for each of the query vector clusters, associating a second set of image search results, the second set of image search results including at least a portion of each first set of image search results associated with the query vectors that are part of each of the respective query vector clusters, the second set of image search results containing images most representative of the query vector cluster; and
generating a set of training objects by storing, for each of the query vector clusters, each image search result of the second set of image search results as a training object in the set of training objects, each image search result being associated with a cluster label, the cluster label being indicative of the query vector cluster the image search result is associated with.
Li teaches
clustering the query vectors into a plurality of query vector clusters (Li, col. 6, lines 5-7: "training vectors for each of the visual media files may be clustered into a number of clusters according to a clustering algorithm, for example, using k-means clustering");
for each of the query vector clusters, associating a second set of image search results, the second set of image search results including at least a portion of each first set of image search results associated with the query vectors that are part of each of the respective query vector clusters (Li, col. 8, lines 49-52: "multi-dimensional space 330 may include a visual media file vector space having visual media file vectors 331, 335a, 335b (hereinafter collectively referred to as “vectors 335”), and 332"; see FIG. 3, some vectors including the associated images from the entire collection are excluded from the clusters 340-1 and 340-2 (dots outside the clusters);

    PNG
    media_image3.png
    849
    995
    media_image3.png
    Greyscale
),
the second set of image search results containing images most representative of the query vector cluster (interpreted as noted in 112(b) rejection above) (Li, col. 9, lines 37-61; col. 10, lines 26-48: “

    PNG
    media_image8.png
    552
    544
    media_image8.png
    Greyscale
”; see FIG. 3; “

    PNG
    media_image9.png
    439
    446
    media_image9.png
    Greyscale
”;
see visual media file set 345 (FIG. 3) which is made up query vectors 335A and 335B that have a subset of images from each cluster 340-1 and 340-2 that are selected by the user and are representative of the query vector clusters based on user search input on the vertical image search in the search engine);
generating a set of training objects by storing, for each of the query vector clusters, each image search result of the second set of image search results as a training object in the set of training objects, each image search result being associated with a cluster label, the cluster label being indicative of the query vector cluster the image search result is associated with (Li, col. 5, lines 51-67: "training database 248 may include multiple instances (or sets) of training data, where each instance (or set) of training data is associated with a particular style class; in some embodiments, training database 248 includes a label indicating the style class strength (e.g., very candid, somewhat candid, not candid, very “cute,” very “hideous,” and the like) as well as the visual media files; training database 248 also may include visual media vector information and image cluster information; the visual media vector information identifies training vectors representing a large sample of training visual media files, and annotated training database 250 includes respective semantic concepts for each visual media file in training database 248 (e.g., image or video caption and search queries); in this respect, the vectors corresponding to a semantic concept (e.g., ‘beach’) may be clustered into one cluster representing that semantic concept; moreover, the cluster may include at least one visual media file stored in visual media database 252").
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify the method of generating a set of training objects for a Machine Learning Algorithm (MLA), as taught by Ranzinger, to include the steps of clustering the query vectors into a plurality of query vector clusters and associating a second set of image search results for each of the query vector clusters, the second set of image search results including at least a portion of each first set of image search results associated with the query vectors that are part of each of the respective query vector clusters, the second set of images search results containing images most representative of the query vector cluster, as taught by Li.
The suggestion/motivation for doing so would have been selecting and grouping certain queries and associated images from a user search via clustering algorithms, to more accurately label groups of images, which in turn creates stronger image training data to be input into a machine learning algorithm.
Therefore, it would have been obvious to combine Ranzinger with Li to obtain the invention as specified in claim 1.
Regarding claim 6, Ranzinger, in view of Li, teaches the method of claim 1, wherein the clustering is performed by using one of: a k-means clustering algorithm, an expectation maximization clustering algorithm, a farthest first clustering algorithm, a hierarchical clustering algorithm, a cobweb clustering algorithm and a density clustering algorithm (Li, col. 6, lines 5-26: "training vectors for each of the visual media files may be clustered into a number of clusters according to a clustering algorithm, for example, using k-means clustering; for example, the training vectors for video clips in database visual media database 252 can be assigned to clusters by the clustering algorithm based on a similarity threshold; the number of clusters can be manually selected, for example, so that visual media database 252 be divided into one-thousand (1000) clusters; training vectors for image files in visual media database 252 are associated with one of the clusters based on a similarity threshold using the clustering algorithm; the similarity threshold can indicate visual similarity, conceptual similarity, keyword similarity, or another measurement of similarity between the visual media files; other clustering algorithms may be used, including methods of vector quantization, or other clustering approaches such as affinity propagation clustering, agglomerative clustering, Birch clustering, density-based spatial clustering of applications with noise (DBSCAN), feature agglomeration, mini-batch k-means clustering, mean shift clustering using a flat kernel, or spectral clustering, among others").
Regarding claim 17, Ranzinger teaches the same method as claim 1 as a system for generating a set of training objects for a Machine Learning Algorithm (MLA), the MLA for categorization of images, the system comprising: a processor and a non-transitory computer-readable medium comprising instructions (“Ranzinger, col. 1, lines 56-60: “according to one embodiment of the present disclosure, a system is provided including one or more processors and a computer-readable storage medium coupled to the one or more processors, the computer-readable storage medium including instructions”). Regarding the remaining limitations of claim 17, the analysis in rejecting claim 1 is equally applicable to claim 17.
Claims 2-5 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Ranzinger, in view of Li, and further in view of “Query intent detection using convolutional neural networks," International Conference on Web Search and Data Mining, Workshop on Query Understanding. 2016 (Hashemi et al.) (hereinafter Hashemi).
Regarding claim 2, Ranzinger, in view of Li, teaches the method of claim 1.
Ranzinger, in view of Li, fails to teach
wherein generating the query vector comprises applying a word embedding algorithm to each search query.
Hashemi teaches 
wherein generating the query vector comprises applying a word embedding algorithm to each search query (Hashemi, page 2, Section 3.2.3 (Aggregated Word Vector Features): "aggregation of query word embeddings is another simple set of features; the goal is to find an embedding for a query and use it as a feature to train the intent classifier; instead of passing query word vectors through the convolutional neural network, we can simply get the word vectors of each query word and sum together (Sum w2V) or get their average (Average w2v); the resulting query embedding will have the same dimension of word vectors; in our experiment we use word2vec word embeddings").
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify the step of generating a query vector for each of the search queries, as taught by Ranzinger, in view of Li, to include applying a word embedding algorithm to each search query, as taught by Hashemi.
The suggestion/motivation for doing so would have been converting search queries from words to mathematical vectors to encode meaning of words, so vector mathematics may be used to more clearly see what words in the search query are more similar in meaning.
Therefore, it would have been obvious to combine Ranzinger and Li with Hashemi to obtain the invention as specified in claim 2.
Regarding claim 3, Ranzinger, in view of Li, and in view of Hashemi, teaches the method of claim 2, wherein the method further comprises, prior to the associating the second set of  images search results for each of the query vector clusters: for each of the first set of image search results, acquiring a respective set of metrics, each respective metric of the respective set of metrics being indicative of user interactions with a respective image search result in the first set of image search results (Ranzinger, col. 10, lines 13-44: "the behavioral data engine 245 may be a module executed by the processor 236 that is configured to monitor (and/or track) user interactions with the search results from the image search engine 242; at runtime, the behavioral data engine 245 may facilitate incorporation of the gathered feedback by logging each occurrence of the query, image, salient object (or region) shown, and salient object (or region) selected; the behavioral data engine 245 may keep track of the frequency that a certain salient object or region is selected or which salient objects or regions are commonly selected; the memory 232 also includes user interaction data 254; in certain aspects, the processor 236 is configured to determine the user interaction data 254 by obtaining user interaction data identifying interactions with images from image search results that are responsive to search queries; in this respect, the search results may be personalized based on the salient objects or regions of the most-recent images downloaded or clicked by the user; for example, the processor 236 may determine that a user interacted with an image from a search result, such as, by clicking on a segment (or region) of the image identified as salient, saving the image for subsequent access, or downloaded the image to a client (e.g., client 110), or the like; the processor 236 may keep track of the user interactions with a number of images over a given time period; in one or more implementations, the processor 236 may track the learned salient objects or regions of the last N images that the user interacted with as well as a current user query, where N is a positive integer value; the interaction history 254 may also include data indicating search behavior (and/or patterns) relating to prior image search queries");  and
	wherein the associating the second set of image search results for each of the query vector clusters comprises: selecting the at least the portion of each first set of image search results included in the second set of image search results based on the respective metrics of the image search results in the first set of image search results being over a predetermined threshold (Li, col. 12, lines 44-67; col. 13, lines 1-5: "step 608 includes selecting a plurality of similar visual media files having a visual similarity with the responsive visual media file; in some embodiments, step 608 may include obtaining a measure of visual similarity between the user-selected first search result and the search result from each of the proposed queries; further, in some embodiments step 608 may include using a weighted probability for each of multiple responsive visual media files to select, with the search engine, a similar visual media file from the database; in some embodiments, step 608 may include adjusting the weighting factors according to the type of interaction between the user and the responsive media file; that is, step 608 may include adding a heavier weight to select visual media files that are similar to a responsive visual media file that the user has downloaded, compared to the weight for selecting a visual media file that is similar to a responsive media file that the user has only highlighted or light-boxed; further, step 608 may include adding a lower weight to select a visual media file that the user has only clicked on”; the images are compared to a visual media file to determine similarity which includes meeting a predetermined threshold based on the user metrics, such as clicking or highlighting (adjusted weights) and then selected).
Regarding claim 4, Ranzinger, in view of Li, in view of Hashemi, teaches the method of claim 3, wherein the query vector clusters are generated based on a proximity of the query vectors in an N-dimensional space (Li, col. 8, lines 40-63: "FIG. 3 illustrates a chart of a multi-dimensional space 330 accessible by search engine 242 to refine a query for visual media search based on user selection of the visual media files; in some embodiments, multi-dimensional space 330 is formed by NN 240 using relevance feedback database 246, training database 248, annotated training database 250, visual media database 252, and interaction history database 254; further, multi-dimensional space 330 may be stored in memory 232, or may be external to memory 232 and directly or remotely accessible to search engine 242; multi-dimensional space 330 may include a visual media file vector space having visual media file vectors 331, 335a, 335b (hereinafter collectively referred to as “vectors 335”), and 332; vectors 331, 332 and 335 have an abscissa X1 and an ordinate X2, selected according to NN 240; moreover, each of vectors 331, 332 and 335 may be associated to a caption, a keyword, or some other text descriptor (e.g., through annotated training database 250); in some embodiments, NN 240 is configured so that vectors 331 and 332, associated with visual media files having similar or common text descriptors are located, or “clustered,” in close proximity to each other in multi-dimensional space 330, wherein a distance, D 350, between any two of vectors 331, 332 or 335 (‘A,’ and ‘B’) may be defined as a “cosine” distance, D"; see FIG. 3, proximity between vectors A and B in cluster 340-1 separated by distance 350).
Regarding claim 5, Ranzinger, in view of Li, and in view of Hashemi, teaches the method of claim 2, wherein the word embedding algorithm is one of: word2vec, global vectors for word representation (GloVe), LDA2Vec, sense2vec and wang2vec (Hashemi, page 2, Section 3.2.3 (Aggregated Word Vector Features): "aggregation of query word embeddings is another simple set of features; the goal is to find an embedding for a query and use it as a feature to train the intent classifier; instead of passing query word vectors through the convolutional neural network, we can simply get the word vectors of each query word and sum together (Sum w2V) or get their average (Average w2v); the resulting query embedding will have the same dimension of word vectors; in our experiment we use word2vec word embeddings").
Regarding claim 21, Ranzinger, in view of Li, teaches the system of claim 17. Claim 21 recites the same functions as claim 2, but as a system. Thus, analyses in rejecting claim 2 is equally applicable to claim 21.
Claim 7-10, 12-13, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Ranzinger, in view of Li, and further in view of U.S. Patent No.: 8,909,563 (Jing et al.) (hereinafter Jing).
Regarding claim 7, Ranzinger, in view of Li, teaches the method of claim 1, wherein each image search result of the first set of image search results is associated with a respective metric, the respective metric being indicative of user interactions with the image search result (Ranzinger, col. 10, lines 13-44: "the behavioral data engine 245 may be a module executed by the processor 236 that is configured to monitor (and/or track) user interactions with the search results from the image search engine 242; at runtime, the behavioral data engine 245 may facilitate incorporation of the gathered feedback by logging each occurrence of the query, image, salient object (or region) shown, and salient object (or region) selected; the behavioral data engine 245 may keep track of the frequency that a certain salient object or region is selected or which salient objects or regions are commonly selected; the memory 232 also includes user interaction data 254; in certain aspects, the processor 236 is configured to determine the user interaction data 254 by obtaining user interaction data identifying interactions with images from image search results that are responsive to search queries; in this respect, the search results may be personalized based on the salient objects or regions of the most-recent images downloaded or clicked by the user; for example, the processor 236 may determine that a user interacted with an image from a search result, such as, by clicking on a segment (or region) of the image identified as salient, saving the image for subsequent access, or downloaded the image to a client (e.g., client 110), or the like; the processor 236 may keep track of the user interactions with a number of images over a given time period; in one or more implementations, the processor 236 may track the learned salient objects or regions of the last N images that the user interacted with as well as a current user query, where N is a positive integer value; the interaction history 254 may also include data indicating search behavior (and/or patterns) relating to prior image search queries"); and
wherein the generating the query vector comprises: generating a feature vector for each image search result associated with the search query; weighting each feature vector (Ranzinger, col. 11, lines -34: "the neural language model is trained to learn to match the direction of the feature vector produced by the vision model (e.g., after the average-over-width-height layer of the convolution neural network 240) for an image that is highly correlated to a given class; for example, for a given class “tree”, the trained neural language model may return an arbitrary number of primary features that identify the class “tree”, which corresponds to the features recognized by the vision model; a given class (or concept) may be represented over a number of dimensions, and the convolutional neural network 240 may be allowed to use up to D features to identify the given class, where D is a positive integer; in one or more implementations, the processor 236, using the language model engine 244, obtains raw outputs of the class weights from the vision model (e.g., the spatial outputs from a spatial operator layer of the convolutional neural network 240) via the vision model engine 243; the processor 236, using the language model engine 244, feeds the raw class weights through the neural language model (e.g., the convolutional neural network 240) to generate a new set of class weights for that query (including queries not seen or trained against); in this respect, the neural language model with the new class weights attempts to learn how to map a query to the same manifold that the vision model learned").
Ranzinger, in view of Li, fails to teach
	generating for each image search result of a selected subset of image search results associated with the search query; weighting by the associated respective metric; and aggregating by the associated respective metrics.
	Jing teaches
	generating for each image search result of a selected subset of image search results associated with the search query (Jing, col. 3, lines 38-52: “the images (e.g., of image collection database 116) may be arranged into a plurality of image groups 114; image groups 114 may include groupings or categorizations of images of system 100; the images of an image group 114 may be grouped based on visual similarity; according to an embodiment, images are first grouped according to a semantic concept, for example, based on queries to which respective images correspond; for example, using an image search service all images returned from searching the web for a query “engine” may be considered as corresponding to the semantic concept “engine”; then, for each semantic concept, the group of images is further divided into sub-groups based upon visual similarity; each of the images may include, be associated with, or otherwise correspond to one or more labels 111 or weighted labels 115”);
 weighting by the associated respective metric (Jing, col. 3, lines 66-67; col. 4, lines 1-5: “associations between images and labels can also be determined based upon image search queries and/or the resulting image sets generated; when considering queries and result sets, user click data such as the one or more images that were selected (e.g. clicked on) by a user in response to the query result may be used to refine any determined associations); and 
aggregating by the associated respective metrics (Jing, col. 8, lines 11-22:“a label aggregator 138 may analyze labels 111 or weighted labels 115 for the images in image groups 114 corresponding to trained classifiers 113 that returned scores greater than or equal to threshold 119; for each of image groups 114 with scores exceeding threshold 119 for a new image, label aggregator 138 may compare and aggregate those labels that occur most often and/or that have the greatest weight amongst the images of the image groups 114; according to an embodiment, label aggregator 138 may aggregate all the labels and/or their associated weights to determine which labels 111 or weighted labels 115 should be associated with or annotated to an incoming image”).
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify the step of generating the query vector, as taught by Ranzinger, in view of Li, to include the steps of generating for each image search result of a selected subset of image search results associated with the search query, weighting by the associated respective metric, and aggregating by the associated respective metrics, as taught by Jing.
The suggestion/motivation for doing so would have been to incorporate and reflect user preferences in search queries by weighting different images in search queries based on user interactions metrics.
Therefore, it would have been obvious to combine Ranzinger and Li with Jing to obtain the invention as specified in claim 7.
Regarding claim 8, Ranzinger, in view of Li, and in view of Jing teaches the method of claim 7, wherein the method further comprises, prior to generating the feature vector for each image search result of the selected subset of image search results: selecting at least a portion of each first set of image search results included in the selected subset of image search results based on the respective metrics of the image search results in the first set of image search results being over a predetermined threshold (Jing, col. 3, lines 66-67; col. 4, lines 1-5: “associations between images and labels can also be determined based upon image search queries and/or the resulting image sets generated; when considering queries and result sets, user click data such as the one or more images that were selected (e.g. clicked on) by a user in response to the query result may be used to refine any determined associations; Jing, col. 8, lines 11-29:“a label aggregator 138 may analyze labels 111 or weighted labels 115 for the images in image groups 114 corresponding to trained classifiers 113 that returned scores greater than or equal to threshold 119; for each of image groups 114 with scores exceeding threshold 119 for a new image, label aggregator 138 may compare and aggregate those labels that occur most often and/or that have the greatest weight amongst the images of the image groups 114; according to an embodiment, label aggregator 138 may aggregate all the labels and/or their associated weights to determine which labels 111 or weighted labels 115 should be associated with or annotated to an incoming image; in an embodiment, the label aggregator 138 may perform additional functions and/or filtering to determine how to annotate an incoming image; for example, label aggregator 138 may determine how many times a label appears in the selection of image groups 114 exceeding threshold 119, and those labels that appear fewer than a certain number of times may be discarded and/or only the top five, ten or other number of labels may be used”).
Regarding claim 9, Ranzinger, in view of Li, and in view of Jing, teaches the method of claim 8, wherein the second set of image search results includes all of the image search results of the first set of image search results associated with the query vectors that are part of each of the respective clusters (Li, See FIG. 3; the second set of image search results associated with each query vector (dots in FIG. 3) that are part of the respective clusters 340-1 and 340-2 are all included; all query vectors outside the clusters are ignored; see FIG. 3 below
 
    PNG
    media_image3.png
    849
    995
    media_image3.png
    Greyscale
).
Regarding claim 10, Ranzinger, in view of Li, and in view of Jing, teaches the method of claim 7, wherein the respective metric is one of: a click- through ratio (CTR), and a number of clicks (Ranzinger, col. 10, lines 13-44: "search results may be personalized based on the salient objects or regions of the most-recent images downloaded or clicked by the user; for example, the processor 236 may determine that a user interacted with an image from a search result, such as, by clicking on a segment (or region) of the image identified as salient, saving the image for subsequent access, or downloaded the image to a client (e.g., client 110), or the like; the processor 236 may keep track of the user interactions with a number of images over a given time period; in one or more implementations, the processor 236 may track the learned salient objects or regions of the last N images that the user interacted with as well as a current user query, where N is a positive integer value; the interaction history 254 may also include data indicating search behavior (and/or patterns) relating to prior image search queries").
Regarding claim 12, Ranzinger teaches a method for training a Machine Learning Algorithm (MLA), the MLA for categorization of images, the method executable at a server that executes the MLA, the method comprising: (Ranzinger, col. 6, lines 7-16: "the system may use an algorithm that detects the set of objects from a set of example images, referred to as training data; the disclosed system includes training of a series of computer-operated neural networks, such as a convolutional neural network, to teach the neural network to identify features of images mapping to certain object classes for identifying those images that are responsive to an input search query with a probability that a region of an image is deemed salient"; Ranzinger, col. 6, lines 25-29: the disclosed system produces a set of vectors representing the object classifications for the corpus of images in the training data; the objects are learned from running the convolutional neural network over a corpus of the training data")
obtaining, from a search log, an indication of search queries having been executed in an image vertical search, each search query being associated with a first set of image search results, (Ranzinger, col. 12, lines 64-67; col. 13, lines 1-8: "the process 300 begins by proceeding from start step to step 301 where the processor 236, using the behavioral data engine 245, obtains session logs associated with one or more users; subsequently, in step 302, the processor 236, using the behavioral data engine 245, extracts the most popular search queries from the obtained session logs; next, in step 303, for each extracted search query, the processor 236, using the image search engine 242, determines a set of images from an image collection (e.g., 252); subsequently, in step 304, each image in each set of images is decomposed into a set of saliency regions for the image") each of the image search results being associated with a respective metric, the respective metric being indicative of user interactions with the image search result (Ranzinger, col. 10, lines 13-44: "the behavioral data engine 245 may be a module executed by the processor 236 that is configured to monitor (and/or track) user interactions with the search results from the image search engine 242; at runtime, the behavioral data engine 245 may facilitate incorporation of the gathered feedback by logging each occurrence of the query, image, salient object (or region) shown, and salient object (or region) selected; the behavioral data engine 245 may keep track of the frequency that a certain salient object or region is selected or which salient objects or regions are commonly selected; the memory 232 also includes user interaction data 254; in certain aspects, the processor 236 is configured to determine the user interaction data 254 by obtaining user interaction data identifying interactions with images from image search results that are responsive to search queries; in this respect, the search results may be personalized based on the salient objects or regions of the most-recent images downloaded or clicked by the user; for example, the processor 236 may determine that a user interacted with an image from a search result, such as, by clicking on a segment (or region) of the image identified as salient, saving the image for subsequent access, or downloaded the image to a client (e.g., client 110), or the like; the processor 236 may keep track of the user interactions with a number of images over a given time period; in one or more implementations, the processor 236 may track the learned salient objects or regions of the last N images that the user interacted with as well as a current user query, where N is a positive integer value; the interaction history 254 may also include data indicating search behavior (and/or patterns) relating to prior image search queries");
generating a feature vector for each image search result of the respective selected subset of image search results associated with each search query (Ranzinger, col. 11, lines 11-34: "the neural language model is trained to learn to match the direction of the feature vector produced by the vision model (e.g., after the average-over-width-height layer of the convolution neural network 240) for an image that is highly correlated to a given class; for example, for a given class “tree”, the trained neural language model may return an arbitrary number of primary features that identify the class “tree”, which corresponds to the features recognized by the vision model; a given class (or concept) may be represented over a number of dimensions, and the convolutional neural network 240 may be allowed to use up to D features to identify the given class, where D is a positive integer; in one or more implementations, the processor 236, using the language model engine 244, obtains raw outputs of the class weights from the vision model (e.g., the spatial outputs from a spatial operator layer of the convolutional neural network 240) via the vision model engine 243; the processor 236, using the language model engine 244, feeds the raw class weights through the neural language model (e.g., the convolutional neural network 240) to generate a new set of class weights for that query (including queries not seen or trained against); in this respect, the neural language model with the new class weights attempts to learn how to map a query to the same manifold that the vision model learned");
generating a query vector for each of the search queries based on the feature vectors and the respective metrics of the image search results of the respective selected subset of image search results (Ranzinger, col. 15, lines 36-41: "next, in step 402, the processor 236 provides each specific query to a trained language model (e.g., the convolutional neural network 240); subsequently, in step 403, the processor 236, using the language model engine 244, obtains a query vector for each specific query of the user input from the trained language model"; Ranzinger, col. 12, lines 19-35: "at runtime, given an arbitrary text query, the trained language model can construct a vector that matches the image that also is associated with that query;  for example, the neural language model learns to construct a vector that points in approximately the same direction as the feature vectors produced by the convolutional neural network 240 in the vision model for images highly related to the given text query ... the processor 236, using the image search engine 242, then takes a dot product of the vector that the neural language model generated, for every cell within the grid, across every image in the image collection (e.g., the index data 256)"); 
generating a set of training objects by storing, for each of the query vector clusters, each image search result of the second set of image search results as a training object in the set of training objects, each image search result being associated with a cluster label, the cluster label being indicative of the query vector cluster the image search result is associated with (Ranzinger, col. 15, lines 36-41: "next, in step 402, the processor 236 provides each specific query to a trained language model (e.g., the convolutional neural network 240); subsequently, in step 403, the processor 236, using the language model engine 244, obtains a query vector for each specific query of the user input from the trained language model"; Ranzinger, col. 12, lines 19-35: "at runtime, given an arbitrary text query, the trained language model can construct a vector that matches the image that also is associated with that query;  for example, the neural language model learns to construct a vector that points in approximately the same direction as the feature vectors produced by the convolutional neural network 240 in the vision model for images highly related to the given text query ... the processor 236, using the image search engine 242, then takes a dot product of the vector that the neural language model generated, for every cell within the grid, across every image in the image collection (e.g., the index data 256)"); and
training the MLA to categorize images using the stored set of training objects (Ranzinger, col. 12, lines 4-18: “for example, the processor 236 of the server 130 executes instructions to submit a plurality of training images containing content identifying different semantic concepts to the convolutional neural network 240 that is configured to analyze image pixel data for each of the plurality of training images to identify features, in each of the plurality of training images, corresponding to a particular semantic concept and receive, from the convolutional neural network 240 and for each of the plurality of training images, an identification of one or more object classes corresponding to the image processed by the convolutional neural network 240”).
Ranzinger fails to teach
for each search query, selecting image search results of the first set of image search results having a respective metric over a predetermined threshold to add to a respective selected subset of image search results.
Jing teaches
for each search query, selecting image search results of the first set of image search results having a respective metric over a predetermined threshold to add to a respective selected subset of image search results (Jing, col. 3, lines 66-67; col. 4, lines 1-5: “associations between images and labels can also be determined based upon image search queries and/or the resulting image sets generated; when considering queries and result sets, user click data such as the one or more images that were selected (e.g. clicked on) by a user in response to the query result may be used to refine any determined associations; Jing, col. 8, lines 11-29:“a label aggregator 138 may analyze labels 111 or weighted labels 115 for the images in image groups 114 corresponding to trained classifiers 113 that returned scores greater than or equal to threshold 119; for each of image groups 114 with scores exceeding threshold 119 for a new image, label aggregator 138 may compare and aggregate those labels that occur most often and/or that have the greatest weight amongst the images of the image groups 114; according to an embodiment, label aggregator 138 may aggregate all the labels and/or their associated weights to determine which labels 111 or weighted labels 115 should be associated with or annotated to an incoming image; in an embodiment, the label aggregator 138 may perform additional functions and/or filtering to determine how to annotate an incoming image; for example, label aggregator 138 may determine how many times a label appears in the selection of image groups 114 exceeding threshold 119, and those labels that appear fewer than a certain number of times may be discarded and/or only the top five, ten or other number of labels may be used”).
Ranzinger, in view of Jing, fails to teach
clustering the query vectors into a plurality of query vector clusters, and for each of the query vector clusters, associating a second set of image search results, the second set of image search results including the respective selected subsets of image search results associated with the query vectors that are part of each of the respective query vector clusters.
Li teaches
clustering the query vectors into a plurality of query vector clusters (Li, col. 6, lines 5-7: "training vectors for each of the visual media files may be clustered into a number of clusters according to a clustering algorithm, for example, using k-means clustering"; Li, col. 9, lines 10-23: "multi-dimensional space 330 is dense, including clusters 340-1 and 340-2 (hereinafter, collectively referred to as “clusters 340”), of closely related vectors 331 and 332, respectively; each cluster 340 may be associated with visual media files belonging in a class of visual media files for a common, or similar text descriptor (e.g., caption, or keyword); further, each cluster 340 may be associated with a conceptual representation of the visual media files in the cluster (e.g., based on a caption or keyword associated with the visual media file in annotated training database 250); accordingly, multi-dimensional space 330 may be separated in two or many more clusters 340, each cluster 340 grouping together visual media files expressing a coherent idea, as expressed in a keyword, caption, or text descriptor"; see FIG. 3; visual media files includes images; training vectors are similar to query vectors); and
for each of the query vector clusters, associating a second set of image search results, the second set of image search results including the respective selected subsets of image search results associated with the query vectors that are part of each of the respective query vector clusters (Li, col. 8, lines 49-52: "multi-dimensional space 330 may include a visual media file vector space having visual media file vectors 331, 335a, 335b (hereinafter collectively referred to as “vectors 335”), and 332"; see FIG. 3, some vectors including the associated images from the entire collection are excluded from the clusters 340-1 and 340-2 (dots outside the clusters)

    PNG
    media_image3.png
    849
    995
    media_image3.png
    Greyscale
).
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify the method for training a machine learning Algorithm (MLA) for categorization of images, as taught by Ranzinger, to include the step of selecting, for each search query, image search results of the first set of image search results having a respective metric over a predetermined threshold to add to a respective selected subset of image search results, as taught by Jing, and to include the step of clustering the query vectors into a plurality of query vector clusters, and for each of the query vector clusters, associating a second set of image search results, the second set of image search results including the respective selected subsets of image search results associated with the query vectors that are part of each of the respective query vector clusters, as taught by Li.
The suggestion/motivation for doing so would have been to incorporate user interaction data into the selection process of relevant image results in a search query, and to easily choose the most relevant images for the clusters, which then will more accurately reflect their cluster label.
Therefore, it would have been obvious to combine Ranzinger with Jing and Li to obtain the invention as specified in claim 12.
Regarding claim 13, Ranzinger, in view of Jing, and in view of Li, teaches the method of claim 12, wherein the training is a first phase training for coarse training of the MLA to categorize images (Ranzinger, col. 9, lines 46-61: "also included in the memory 232 of the server 130 is a set of training data 248; the set of training data 248 can be, for example, a dataset of content items (e.g., images) corresponding to an arbitrary number of object classes with a predetermined number of content items (e.g., about 10,000 images) per object class; the set of training data 248 may include multiple instances (or sets) of training data, where at least one instance (or set) of training data is associated with an intended object class; for example, the set of training data 248 may include images that include features that represent positive instances of a desired class so that the convolutional neural network 248 can be trained to distinguish between images with a feature of the desired class and images without a feature of the desired class; the set of training data 248 also may include image vector information and image cluster information").
Regarding claim 18, Ranzinger, in view of Jing, and in view of Li, teaches the system of claim 17. Claim 18 recites the same functions as claim 7, but as a system. Thus, analyses in rejecting claim 7 is equally applicable to claim 18.
Regarding claim 19, Ranzinger, in view of Jing, and in view of Li, teaches the system of claim 18. Claim 19 recites the same functions as claim 8, but as a system. Thus, analyses in rejection of claim 8 is equally applicable to claim 19.
Regarding claim 20, Ranzinger, in view of Jing, and in view of Li, teaches the system of claim 19. Claim 20 recites the same functions as claim 9, but as a system. Thus, analyses in rejection of claim 9 is equally applicable to claim 20.
Claims 14-16 are rejected under 35 U.S.C. 103 as being unpatentable over Ranzinger, in view of Jing, in view of Li, and further in view of U.S. Patent Application Publication No.: 2017/0097948 (Kerr et al.) (hereinafter Kerr).
Regarding claim 14, Ranzinger, in view of Jing, and in view of Li, teaches the method of claim 13.
Ranzinger, in view of Jing, and in view of Li, fails to teach
wherein the method further comprises fine training the MLA using an additional set of fine-tuned training objects.
Kerr teaches
wherein the method further comprises fine training the MLA using an additional set of fine-tuned training objects (Kerr, para. [0018], lines 6-16: "training images may be utilized to implement a generic system initially that identifies visual similarity generally, but without any understanding of specific attributes; the generic system may then be trained with a new set of training data for a specific attribute; in this way, the system may be fine-tuned at different output layers to detect different attributes with each layer being independently evolved from the generic system; in other words, the transformations necessary to extract a particular feature vector at a particular layer of the system is learned based on set of training data for each specific attribute").
It would have been obvious to a person having ordinary skill in the art before the time of the effective filing date of the claimed invention of the instant application to modify the method as taught by Ranzinger, in view of Jing, and in view of Li, to include the step of fine training the MLA using an additional set of fine-tuned training objects, as taught by Kerr.
The suggestion/motivation for doing so would have been to strengthen and fine-tune a machine learning algorithm by including additional training data sets, which helps the algorithm categorize images for user web search more robustly.
Therefore, it would have been obvious to combine Ranzinger, Jing, and Li with Kerr to obtain the invention as specified in claim 14.
Regarding claim 15, Ranzinger, in view of Jing, in view of Li, and in view of Kerr, teaches the method of claim 14, wherein the MLA is an artificial neural network (ANN) learning algorithm (Ranzinger, col. 8, lines 37-46: "in one or more implementations, the convolutional neural network 240 may be a series of neural networks, one neural network for each object classification; as discussed herein, a convolutional neural network 240 is a type of feed-forward artificial neural network where individual neurons are tiled in such a way that the individual neurons respond to overlapping regions in a visual field; the architecture of the convolutional neural network 240 may be in the object of existing well-known image classification architectures such as AlexNet, GoogLeNet, or Visual Geometry Group models").
Regarding claim 16, Ranzinger, in view of Jing, in view of Li, and in view of Kerr teaches the method of claim 15, wherein the MLA is a deep learning algorithm (Ranzinger, col. 17, lines 14-19: "in one or more implementations, the processor 236, using the vision model engine 243, trains a deep learning model (e.g., the convolutional neural network 240) using the training data 248, where the deep learning model is trained to predict which query an image is more likely to belong to given the image").
Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL A SHARIFF whose telephone number is (571)272-9741.  The examiner can normally be reached on M-TH 7:30 AM EST – 5:30 PM EST.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, SUMATI LEFKOWITZ can be reached at 571-272-3638 or through e-mail at sumati.lefkowitz@uspto.gov.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
/MICHAEL ADAM SHARIFF/
Examiner, Art Unit 2662

/GANDHI THIRUGNANAM/Primary Examiner, Art Unit 2662