DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.



Claims 10-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  The claims do not fall within at least one of the four categories of patent eligible subject matter because the terms “device-storage media” and “computer storage medium” can be directed to a transitory signal, carrier wave, or similar embodiment capable of storing information.
Claims 10-20 would be directed to an appropriate article of manufacture within the meaning of 35 U.S.C. 101 if the media would only reasonably be interpreted by one of ordinary skill in the art as covering embodiments which are articles produced from raw or prepared materials and which are structurally and functionally interconnected to the program in such a manner as to enable the program to act as a computer component and realize its functionality.
Regarding Claims 10 and 19 respectfully, regarding the claimed “device-storage media” and “computer storage medium”, under a recent precedential opinion, the scope of the recited “device-storage media” and “computer storage medium” encompasses transitory media such as signals or carrier waves, where, as here the Specification does not limit the “device-storage media” and “computer storage medium” to non-transitory forms, per ¶ 128, " device-storage media specifically and unequivocally excludes carrier waves, modulated data signals, and other such transitory media, at least some of which are covered under the term "signal medium" discussed below", also see ¶ 129, “The term "signal medium" shall be taken to include any form of modulated data signal, carrier wave, and so forth”, also ¶ 130, "device- readable medium … defined to include both machine-storage media and signal media. Thus, the terms include both storage devices/media and carrier waves/modulated data signals.”. See Ex parte Mewherter, 107 USPQ2d 1857, 1862. 
Regarding Dependent Claims 11-18 and 20, fail to cure the deficiency of independent Claims 10 and 19, and therefore are also rejected under 35 USC § 101 as being directed to non-statutory subject matter for the same reason addressed above.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 2 and 11 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 2 and 11 recite “wherein the taxonomy includes categories comprising… and other”, It’s not clear what the applicant refer to as “and other”. Since it is comprising and more categories can be added to the list, it is not clear if it is meant as claiming other categories not listed OR if it’s a category named “other”, where everything else is categorized as other. For the purpose of examining, is interpreted as a category named other. Appropriate clarification/correction is required. 

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 5-8, 10, 14-16 and 19-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by El-Saban et al. (US 2015/0331929 A1) 


Regarding claim 1. 
El-Saban teaches a computer implemented method, comprising: receiving an image having at least one subject (see fig 4 element 402, untagged images, also see ¶ 42, “image tagging server 102 receives an untagged image 402 and generates a tagged image 404. A tagged image 404 is one that has one or more tags associated with it where a tag describes a feature of the image.”); 
submitting the image to a trained visual intent classifier, the trained visual intent classifier being trained as a multilabel classifier (see ¶ 57, “The objection recognition module 406 is configured to identify objects in the images, classify the identified objects and assign the objects one or more tags based on the classification. The objection recognition module 404 may be configured to classify elements of the image into one of a fixed number of object classes using a discriminative technique. For example, a trained random decision forest may be used to classify the pixels of the image using pixel difference features.”, also see ¶¶ 55-59); 
receiving from the trained visual intent classifier at least one classification label from a taxonomy used to train the multilabel classifier, the at least one classification label corresponding to the at least one subject of the image (see ¶ 57-58, “The objection recognition module 406 is configured to identify objects in the images, classify the identified objects and assign the objects one or more tags based on the classification.”, also see ¶ 44); 
based on the at least one classification label, initiating at least one of: triggering a query related to the image (see ¶ 89, “The image search and navigation module 104 retrieves the tags associated with the selected image or the selected object 802 and displays the image tags for the selected image or object in a graphical user interface 804”); 
causing presentation of information to help the user formulate a query related to the image (see figures 2-3 showing images with suggested tags such as person, car and street, also see ¶ 23, “automatically mapping the natural language query terms to one or more image tags”, also see ¶¶ 88-90 and figure 7 and 8 describing different ways to suggest query based on images); 
initiating a visual search using a data store comprising images having classification labels that comprise the at least one classification label associated with the image (see ¶ 41, “the mapped tags are then used to identify and retrieve images that match the natural language query terms and/or phrase. The identified images (or part thereof or a version thereof) are then provided to the user (e.g. via an end-user device 116)”, also see ¶ 85, “the image search and navigation module 104 uses the mapped image tags to identify and retrieve one or more imaged from the tagged images database that match the natural language query terms and/or phrases.”); 
and initiating visual intent detection on the image (see ¶ 44, “Tags related to a particular region (or bounding box) in the image may be identified as “region” tags. In some cases the user may automatically update the query terms by clicking on or otherwise selecting one of the tags. For example, if the user clicked on or otherwise selected the tag “person”, the term “person” may be added to the query term entry box.”).

Regarding claim 5. 
El-Saban teaches the method of claim 1,
 El-Saban further teaches wherein triggering a query comprises: sending the at least one classification label associated with the image to a user device (see ¶ 89, “The image search and navigation module 104 retrieves the tags associated with the selected image or the selected object 802 and displays the image tags for the selected image or object in a graphical user interface 804”); 
and receiving from the user device a query to be executed by a search service (see ¶ 89, “Where the user has selected an image the image tags for the image may be displayed as list in the graphical user interface as shown in FIG. 2. Where, however, the user has selected an object within an image the image tag associated with the object may be displayed on top of the bounding box, for example, or within the query term entry box as shown in FIG. 2.”).

Regarding claim 6. 
El-Saban teaches the method of claim 1,
 El-Saban further teaches wherein causing presentation of information to help the user formulate a query related to the image (see ¶ 23, “automatically mapping the natural language query terms to one or more image tags”) comprises: 
selecting a plurality of potential activities based on the at least one classification label associated with the image (see fig. 8 and ¶ 88, “At block 800, the image search and navigation module 104 receives an indication from an end-user device 116 that the user has selected one of a displayed image or an object within a displayed image (indicated by, for example, a bounding box).”); 
sending the plurality of potential activities to a user device (see ¶ 89, “The image search and navigation module 104 retrieves the tags associated with the selected image or the selected object 802 and displays the image tags for the selected image or object in a graphical user interface 804.”, also see ¶ 78, “blocks 606 and 608 where an ontology distance and one or more semantic space distances are computed between the natural language query term or phrase and individual image tags”, where the potential activity is determined using ontology distance and semantic space distance such as showing in figure 200 in figure 2, person, car and street); 
receiving from the user device, selection of at least one activity of the plurality of potential activities (see ¶ 89, “Where the user has selected an image the image tags for the image may be displayed as list in the graphical user interface as shown in FIG. 2. Where, however, the user has selected an object within an image the image tag associated with the object may be displayed on top of the bounding box, for example, or within the query term entry box as shown in FIG. 2.”); 
formulating a query based on the selected at least one activity (see ¶ 90, “the user has selected an image, the retrieved images may be images that comprise the query terms in the query term entry box (which now includes the image tag associated with the selected object. Once the images have been retrieved from the tagged image database the method may proceed to block 808 or it may proceed directly to block 810.”); 
and sending the query to a query engine for execution (see ¶ 92, “At block 810 the image search and navigation module 104 outputs the ranked or not-ranked list of retrieved images to a graphical user interface displayed on the end-user device 116. Where the user selected an image the retrieved images (the images similar to the selected images) may be displayed in a secondary window of the GUI as shown in FIG. 2. Where, however, the user selected an object the retrieved images (the images matching the query terms) may be displayed in a main results window of the GUI as shown in FIG. 2”).

Regarding claim 7. 
El-Saban teaches the method of claim 1,
 El-Saban further teaches wherein initiating a visual search using a data store comprising images having classification labels that comprise the at least one classification label associated with the image (see ¶ 41, “the mapped tags are then used to identify and retrieve images that match the natural language query terms and/or phrase. The identified images (or part thereof or a version thereof) are then provided to the user (e.g. via an end-user device 116)”, also see ¶ 85, “the image search and navigation module 104 uses the mapped image tags to identify and retrieve one or more imaged from the tagged images database that match the natural language query terms and/or phrases.”) comprises: 
selecting a subset of images from the data store, each image in the subset having at least one associated classification label that matches the at least one classification label associated with the image (see ¶ 85, “the image search and navigation module 104 uses the mapped image tags to identify and retrieve one or more imaged from the tagged images database that match the natural language query terms and/or phrases.”); 
performing a visual search on the subset of images (see ¶ 85, “the image search and navigation module 104 may retrieve images that have been tagged with the mapped image tags. Where the search request comprised a proximity indicator may only retrieve images that have been tagged with the mapped image tags and have the objects identified by the mapped image tags in close proximity. Once the matching images have been retrieved from the tagged image database the method may proceed to block 208 or the method may proceed directly to block 210”); 
ranking images that are indicated as a match by the visual search (see ¶ 86, “the image search and navigation module 104 ranks the retrieved images based on how well they match the search criteria. For example, as described above, in some cases the image tagging server 102 may be configured to assign a confidence value to each image tag assigned to an image. The confidence value indicates the accuracy of the tag (e.g. the likelihood that the image contains the item identified by the tag).”); 
and returning a subset of the ranked images (see ¶ 87, “At block 210 the image search and navigation module 104 may output the ranked or not ranked retrieved images to a graphical user interface of the end-user device 116.”).

Regarding claim 8. 
El-Saban teaches the method of claim 1,
 El-Saban further teaches wherein initiating visual intent detection on the image (see ¶ 44, “Tags related to a particular region (or bounding box) in the image may be identified as “region” tags. In some cases the user may automatically update the query terms by clicking on or otherwise selecting one of the tags. For example, if the user clicked on or otherwise selected the tag “person”, the term “person” may be added to the query term entry box”) comprises: 
selecting a visual intent detection mode, the visual intent detection mode selected from a mode that identifies a plurality of subjects in the image and a mode that identifies a single subject in the image (see ¶ 42, “shown in FIG. 2 the user has searched the set of images using the natural language query term “car”. The images 130 (or a thumbnail or a version thereof) matching the query (e.g. images that were associated with the tag “car”) are displayed to the user via the graphical user interface 124”, i.e. where car is a single subject in the image as seen in figure 2, element 124, also see ¶ 44, “the window 200 may display a list of the tags 202 that have been associated with the image 130. For example, in FIG. 2, the window 200 shows that the selected image 130 is associated (or has been tagged) with the tags 202 “person”, “car” and “street”.”, i.e. where person, car and street is plurality of subject mode as showing in figure 2, element 200); 
selecting a trained visual intent detection model corresponding to the visual intent detection mode (see ¶ 57, “a trained random decision forest may be used to classify the pixels of the image using pixel difference features. In some cases, each node of the trees of the random decision forest is associated with either appearance or shape. One or more tags are then assigned to the image, or to an element of the image such as a bounding box, pixel or group of pixels, based on the classification.”); 
presenting the image to the trained visual intent detection model (see figure 2, all elements show bounding boxes with different subjects, also see ¶ 46, “a rectangular box 208 (also referred to as a bounding box) may be shown around the identified object”); 
receiving from the trained visual intent detection model a number of bounding boxes that correspond to the visual intent detection mode, each of the bounding boxes substantially bounding a corresponding subject and each of the bounding boxes comprising at least one associated classification label which identifies the corresponding subject (see ¶ 46, “a rectangular box 208 (also referred to as a bounding box) may be shown around the identified object. The bounding box around the object can just pop up over the image, without actually being drawn. Box 208 when clicked can be used to navigate between images by searching for images with related region tags. For example, if the bounding box 208 contains a person then the region tag may be “person”. When user input is received selecting the bounding box 208 the region tag may be used as a query to retrieve images.”, also see ¶ 88-89); 
and returning to a user device the image comprising the bounding boxes and the at least on associated classification label (see ¶ 46, “the user may be able to see what objects were identified in the selected image 130 by moving the cursor, for example, over the display of the selected image 130 in the window 200. When the cursor is situated over an identified object, the identified object may be indicated or highlighted as such”, also see 47, “if the user moves the cursor over one of the people shown in the selected image 130, a rectangular box will be displayed over the person. If the user then clicks anywhere in the rectangular box the term “person” may be added to the query term entry box so that it comprises two query terms—“car” and “person”.”, also see ¶ 48 and 88-89).

Regarding claim 10. 
El-Saban teaches a system comprising: a processor (see ¶ 95, “Computing-based device 900 comprises one or more processors 902”) and device-storage media (see ¶ 96, “computer-readable media that is accessible by computing based device 900. Computer-readable media may include, for example, computer storage media such as memory 910 and communications media”) having executable instructions which, when executed by the processor, implement visual intent classification (see ¶ 41, “the mapped tags are then used to identify and retrieve images that match the natural language query terms and/or phrase. The identified images (or part thereof or a version thereof) are then provided to the user (e.g. via an end-user device 116)”, also see ¶ 85, “the image search and navigation module 104 uses the mapped image tags to identify and retrieve one or more imaged from the tagged images database that match the natural language query terms and/or phrases.”), visual intent detection (see ¶ 44, “Tags related to a particular region (or bounding box) in the image may be identified as “region” tags. In some cases the user may automatically update the query terms by clicking on or otherwise selecting one of the tags. For example, if the user clicked on or otherwise selected the tag “person”, the term “person” may be added to the query term entry box.”), or both, comprising: receiving a request comprising an image having at least one associated subject (see fig 4 element 402, untagged images, also see ¶ 42, “image tagging server 102 receives an untagged image 402 and generates a tagged image 404. A tagged image 404 is one that has one or more tags associated with it where a tag describes a feature of the image.”); 
when the request is for visual intent classification (see ¶ 41, “the mapped tags are then used to identify and retrieve images that match the natural language query terms and/or phrase. The identified images (or part thereof or a version thereof) are then provided to the user (e.g. via an end-user device 116)”, also see ¶ 85, “the image search and navigation module 104 uses the mapped image tags to identify and retrieve one or more imaged from the tagged images database that match the natural language query terms and/or phrases.”), performing operations comprising: 
submitting the image to a trained visual intent classifier, the trained visual intent classifier being trained as a multilabel classifier (see ¶ 57, “The objection recognition module 406 is configured to identify objects in the images, classify the identified objects and assign the objects one or more tags based on the classification. The objection recognition module 404 may be configured to classify elements of the image into one of a fixed number of object classes using a discriminative technique. For example, a trained random decision forest may be used to classify the pixels of the image using pixel difference features.”, also see ¶¶ 55-59); 
receiving from the trained visual intent classifier at least one classification label from a taxonomy used to train the multilabel classifier, the at least one classification label corresponding to the at least one subject of the image (see ¶ 57-58, “The objection recognition module 406 is configured to identify objects in the images, classify the identified objects and assign the objects one or more tags based on the classification.”, also see ¶ 44); 
based on the at least one classification label, initiating at least one of: triggering a query related to the image (see ¶ 89, “The image search and navigation module 104 retrieves the tags associated with the selected image or the selected object 802 and displays the image tags for the selected image or object in a graphical user interface 804”);
causing presentation of information to help the user formulate a query related to the image (see figures 2-3 showing images with suggested tags such as person, car and street, also see ¶ 23, “automatically mapping the natural language query terms to one or more image tags”, also see ¶¶ 88-90 and figure 7 and 8 describing different ways to suggest query based on images); 
initiating a visual search using a data store comprising images having classification labels that comprise the at least one classification label associated with the image (see ¶ 41, “the mapped tags are then used to identify and retrieve images that match the natural language query terms and/or phrase. The identified images (or part thereof or a version thereof) are then provided to the user (e.g. via an end-user device 116)”, also see ¶ 85, “the image search and navigation module 104 uses the mapped image tags to identify and retrieve one or more imaged from the tagged images database that match the natural language query terms and/or phrases.”); 
and initiating visual intent detection on the image (see ¶ 44, “Tags related to a particular region (or bounding box) in the image may be identified as “region” tags. In some cases the user may automatically update the query terms by clicking on or otherwise selecting one of the tags. For example, if the user clicked on or otherwise selected the tag “person”, the term “person” may be added to the query term entry box.”); 
and when the request is for visual intent detection (see ¶ 44, “Tags related to a particular region (or bounding box) in the image may be identified as “region” tags. In some cases the user may automatically update the query terms by clicking on or otherwise selecting one of the tags. For example, if the user clicked on or otherwise selected the tag “person”, the term “person” may be added to the query term entry box.”), performing operations comprising: 
presenting the image to the trained visual intent detection model, the trained visual intent detection model being trained in one of two training modes, the first training mode identifying a plurality of subjects in the image and the second training mode a single subject in the image (see ¶ 42, “shown in FIG. 2 the user has searched the set of images using the natural language query term “car”. The images 130 (or a thumbnail or a version thereof) matching the query (e.g. images that were associated with the tag “car”) are displayed to the user via the graphical user interface 124”, i.e. where car is a single subject in the image as seen in figure 2, element 124, also see ¶ 44, “the window 200 may display a list of the tags 202 that have been associated with the image 130. For example, in FIG. 2, the window 200 shows that the selected image 130 is associated (or has been tagged) with the tags 202 “person”, “car” and “street”.”, i.e. where person, car and street is plurality of subject mode as showing in figure 2, element 200); 
receiving from the trained visual intent detection model a number of bounding boxes that correspond to the training mode, each of the bounding boxes substantially bounding a corresponding subject and each of the bounding boxes comprising at least one associated classification label which identifies the corresponding subject (see ¶ 46, “a rectangular box 208 (also referred to as a bounding box) may be shown around the identified object. The bounding box around the object can just pop up over the image, without actually being drawn. Box 208 when clicked can be used to navigate between images by searching for images with related region tags. For example, if the bounding box 208 contains a person then the region tag may be “person”. When user input is received selecting the bounding box 208 the region tag may be used as a query to retrieve images.”, also see ¶ 88-89); 
and returning to a user device the image comprising the bounding boxes and the at least on associated classification label (see ¶ 46, “the user may be able to see what objects were identified in the selected image 130 by moving the cursor, for example, over the display of the selected image 130 in the window 200. When the cursor is situated over an identified object, the identified object may be indicated or highlighted as such”, also see 47, “if the user moves the cursor over one of the people shown in the selected image 130, a rectangular box will be displayed over the person. If the user then clicks anywhere in the rectangular box the term “person” may be added to the query term entry box so that it comprises two query terms—“car” and “person”.”, also see ¶ 48 and 88-89).

Claim 14 recites a system to perform the method recited in claim 5. Therefore the rejection of claim 5 above applies equally here.
Claim 15 recites a system to perform the method recited in claim 6. Therefore the rejection of claim 6 above applies equally here.
Claim 16 recites a system to perform the method recited in claim 7. Therefore the rejection of claim 7 above applies equally here.

Regarding claim 19. 
El-Saban teaches a computer storage medium comprising executable instructions that, when executed by a processor of a machine (see ¶ 95, “Computing-based device 900 comprises one or more processors 902”), cause the machine to perform acts comprising: receiving an image having at least one subject (see fig 4 element 402, untagged images, also see ¶ 42, “image tagging server 102 receives an untagged image 402 and generates a tagged image 404. A tagged image 404 is one that has one or more tags associated with it where a tag describes a feature of the image.”); 
submitting the image to a trained visual intent detector, the trained visual intent detector being trained to identify a plurality of subjects in the image (see ¶ 57, “The objection recognition module 406 is configured to identify objects in the images, classify the identified objects and assign the objects one or more tags based on the classification. The objection recognition module 404 may be configured to classify elements of the image into one of a fixed number of object classes using a discriminative technique. For example, a trained random decision forest may be used to classify the pixels of the image using pixel difference features.”, also see ¶¶ 55-59); 
receiving from the trained visual intent classifier at least one classification label from a taxonomy and an associated bounding box, the at least one classification label corresponding to the at least one subject of the image (see ¶ 57-58, “The objection recognition module 406 is configured to identify objects in the images, classify the identified objects and assign the objects one or more tags based on the classification.”, also see ¶ 44) and the associated bounding box delineating the bounds of the at least one subject (see ¶ 46, “a rectangular box 208 (also referred to as a bounding box) may be shown around the identified object. The bounding box around the object can just pop up over the image, without actually being drawn. Box 208 when clicked can be used to navigate between images by searching for images with related region tags. For example, if the bounding box 208 contains a person then the region tag may be “person”. When user input is received selecting the bounding box 208 the region tag may be used as a query to retrieve images.”, also see ¶ 88-89); 
based on the at least one classification label, the bounding box, or both initiating at least one of: triggering a query related to the image (see ¶ 89, “The image search and navigation module 104 retrieves the tags associated with the selected image or the selected object 802 and displays the image tags for the selected image or object in a graphical user interface 804”);
causing presentation of information to help the user formulate a query related to the image (see figures 2-3 showing images with suggested tags such as person, car and street, also see ¶ 23, “automatically mapping the natural language query terms to one or more image tags”, also see ¶¶ 88-90 and figure 7 and 8 describing different ways to suggest query based on images); 
and initiating a visual search using a data store (see ¶ 41, “the mapped tags are then used to identify and retrieve images that match the natural language query terms and/or phrase. The identified images (or part thereof or a version thereof) are then provided to the user (e.g. via an end-user device 116)”, also see ¶ 85, “the image search and navigation module 104 uses the mapped image tags to identify and retrieve one or more imaged from the tagged images database that match the natural language query terms and/or phrases.”).

Regarding claim 20. 
El-Saban teaches the medium of claim 19 
El-Saban further teaches further comprising passing the at least one classification label and the associated bounding box to a suppression model, the suppression suppressing at least one classification label along with its associated bounding box (see ¶ 75, “The ontology distances generated by the ontology distance module 526 are also provided to the threshold module 522 where any distances above a certain threshold are discarded or ignored and any distances that fall below the predetermined threshold are provided to the selection module 524 where they provide a vote for the corresponding tag.”).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 2 and 11 are rejected under 35 USC 103 as being unpatentable over El-Saban et al. (US 2015/0331929 A1)  in view of Li Fei-Fei et al ("Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories," 2004)

Regarding claim 2. 
El-Saban teaches the method of claim 1,
 El-Saban further teaches wherein the taxonomy includes categories in figure 4, also in ¶ 44 and 57-58, classification of objects, landmark, face, age, scene, text, gender and expressions but does not teach all the list claimed below, comprising: animal; two-dimensional artwork; three-dimensional artwork; barcode; book; cosmetics; electronics; face; people; fashion; food or drink; gift; home_or_office_furnishing_or_decor; logo; man made structure; map; money; musical instrument; nature_object; newspaper; plant; productivity; school or office supply; sports or outdoor_accessories; tatoo; toy; training-workout_item; vehicle; packaged product; and other. 
Li Fei-Fei teaches categories , comprising: animal; two-dimensional artwork; three-dimensional artwork; barcode; book; cosmetics; electronics; face; people; fashion; food or drink; gift; home_or_office_furnishingor_decor; logo; man made structure; map; money; musical instrument; nature_object; newspaper; plant; productivity; school or office supply; sports or outdoor_accessories; tatoo; toy; training-workout_item; vehicle; packaged product; and other (see Fig. 2e on page 5 and Fig. 7 on page 9, The 101 object categories and the background clutter category, where they include a list of categories that can be obviously mapped to the list claimed above).
Both El-Saban and Li Fei-Fei pertain to the problem of object categorization, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine El-Saban and Li Fei-Fei to add a list of categories as listed above to the categories trained in El-Saban to classify and label objects. The motivation for doing so would be to increase the likelihood of labeling objects with reduced error, “As the number of training examples increases, we observe that the shape model is more defined and structured with reducing variance.” (See Li Fei-Fei e.g. page 5, right columns second paragraph).
Claim 11 recites a system to perform the method recited in claim 2. Therefore the rejection of claim 2 above applies equally here.

Claims 3-4 and 12-13 are rejected under 35 USC 103 as being unpatentable over El-Saban et al. (US 2015/0331929 A1)  in view of Pons et al. ("Multi-task, multi-label and multi-domain learning with residual convolutional networks for emotion recognition", 2018)

Regarding claim 3. 
El-Saban teaches the method of claim 1,
El-Saban  does not teach wherein the trained visual intent classifier comprises a MobileNet backbone trained using an error function comprising two multilabel classification losses, a first multilabel classification loss being a multilabel elementwise sigmoid loss and a second multilabel classification loss being a multilabel softmax loss.
Pons teaches wherein the trained visual intent classifier comprises a MobileNet backbone trained using an error function comprising two multilabel classification losses, a first multilabel classification loss being a multilabel elementwise sigmoid loss and a second multilabel classification loss being a multilabel softmax loss (see page 3, “a selective soft-max cross entropy with the objective of not penalizing the training of a task when feeding the model with images from another task”, also in page 3 under proposed approach, right column first paragraph “novel dataset-wise selective sigmoid cross-entropy loss function to address multi-task, multi-label and multidomain problems”, i.e. uses two multilabel classification losses).
Both El-Saban  and Pons pertain to the problem of multi-label object recognition, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine El-Saban  and Pons to incorporate the teaching of Pons of a first multilabel classification loss being a multilabel elementwise sigmoid loss and a second multilabel classification loss being a multilabel softmax loss to the teaching of El-Saban multilabel classification of objects in an image and query. The motivation for doing so would be to be able to handle multiple tasks of classification, “The proposed loss function addresses the problem of learning multiple tasks with heterogeneously labeled data, improving previous multi-task approaches” (See Pons Abstract).

Regarding claim 4. 
El-Saban teaches the method of claim 1,
El-Saban  does not teach wherein the visual intent classifier is trained using a  
    PNG
    media_image1.png
    41
    563
    media_image1.png
    Greyscale


Pons teaches the cross-entropy loss giving by E
    PNG
    media_image1.png
    41
    563
    media_image1.png
    Greyscale
(see page 3, under proposed approach, right column, 
    PNG
    media_image2.png
    133
    378
    media_image2.png
    Greyscale
).
The motivation utilized in the combination of claim 3, applies equally as well to claim 4.

Claim 12 recites a system to perform the method recited in claim 3. Therefore the rejection of claim 3 above applies equally here.
Claim 13 recites a system to perform the method recited in claim 4. Therefore the rejection of claim 4 above applies equally here.

Claim 9 is rejected under 35 USC 103 as being unpatentable over El-Saban et al. (US 2015/0331929 A1)  in view of Liston et al. (US 2018/0189951 A1)

Regarding claim 9. 
El-Saban teaches the method of claim 8 
El-Saban does not teach wherein the trained visual intent detection model is trained using both web images and images collected from imaging devices.
Liston teaches wherein the trained visual intent detection model is trained using both web images and images collected from imaging devices (see ¶ 22, “The image capture devices 114 and 118 may be, for example, digital cameras, video cameras (e.g., security cameras, web cameras), streaming cameras, etc. that capture still or moving images of one or more persons and provides these images to the training data system 112 or the a semantic image segmentation device 116”).
Both El-Saban and Liston pertain to the problem of pre-labeled training data, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine El-Saban and Liston to use both web images and images collected from imaging devices to train the visual intent model. The motivation for doing so would be to increase the training of different objects, “The segmentation data generation system automatically generates a mask of the training image to delineate the object from the background and, based on the mask automatically generates a masked image. The masked image includes only the object present in the training image” (See Liston e.g. Abstract).

Claims 17-18 are rejected under 35 USC 103 as being unpatentable over El-Saban et al. (US 2015/0331929 A1)  in view of Howard et al. (“MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, 2017)

Regarding claim 17. 
El-Saban teaches the system of claim 10,
El-Saban does not teach wherein the visual intent detection model comprises: a first series of convolutional layers that represent a subset of layers of a VGG-16 detection model; a second series of convolutional layers comprising: a 3 x 3 x 1024 convolutional layer; and a 1 x 1 x 1024 convolutional layer; a detection layer; and a non-maximum suppression layer.
Howard teaches a first series of convolutional layers that represent a subset of layers of a VGG-16 detection model (see page 7, section 4.5 face attribution and 4.6 object detection, “The Faster-RCNN model evaluates 300 RPN proposal boxes per image.”); a second series of convolutional layers comprising: a 3 x 3 x 1024 convolutional layer; and a 1 x 1 x 1024 convolutional layer; a detection layer; and a non-maximum suppression layer (see page 4 and table 1, 3d convolution layers are presented which teaches all the layers claimed above.).
Both El-Saban and Howard pertain to the problem of object detection, thus being analogous. It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine El-Saban and Howard to include different size layers of convolution neural network as claimed in the above limitation. The motivation for doing so would be choose the right size of layers in a convolution neural network, “a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depthwise separable convolutions to build light weight deep neural networks. We introduce two simple global hyperparameters that efficiently tradeoff between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem.” (See Howard e.g. Abstract).

Regarding claim 18. 
El-Saban teaches the system of claim 17,
El-Saban does not teach wherein the second series of convolutional layers further comprise: a 3 x 3 x 512 convolutional layer; a 1 x 1 x 256 convolutional layer; a 3 x 3 x 256 convolutional layer; and a 1 x 1 x 128 convolutional layer.
Howard teaches a 3 x 3 x 512 convolutional layer; a 1 x 1 x 256 convolutional layer; a 3 x 3 x 256 convolutional layer; and a 1 x 1 x 128 convolutional layer (see page 4 and table 1, 3d convolution layers are presented in table 1, which teaches all the layers claimed above).
The motivation utilized in the combination of claim 17, applies equally as well to claim 18.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IMAD M KASSIM whose telephone number is (571)272-2958. The examiner can normally be reached mon-fri 730-500.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J. Huntley can be reached on (303) 297 - 4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/I.K./Examiner, Art Unit 2129                                                                                                                                                                                                        
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129