DETAILED ACTION
In response to communication filed on 20 September 2022, claims 1, 9 and 18 are amended. Claims 1-20 are pending.
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 20 September 2022 has been entered.

Response to Arguments
Applicant’s amendments regarding claim objections have been considered and the objections have been withdrawn. 

Applicant’s arguments, see “Rejections under 35 U.S.C. § 101” are related to newly added limitations and those limitations are addressed in the rejection below. 

Applicant’s arguments, see “Rejections under 35 U.S.C. § 103” are not persuasive since the arguments are related to newly added limitations and are addressed in the rejection below. 

Claim Interpretation
Claim 2 recites “on the composite query using at least an object identified”. The claims do not recite functionality, but instead recites what the object identified is used for. Examiner suggests amending the claim to recite the functionality performed by the claimed method, instead of reciting what the claim elements are used for.

Claims 3-4, 12-13 and 19-20 recite “uses a trained machine learned model”. The claims do not recite functionality, but instead recites what the trained machine learning model is used for. Examiner suggests amending the claim to recite the functionality performed by the claimed method, instead of reciting what the claim elements are used for.

Claims 5 and 6 recite “on the composite query using at least the object identified”. The claims do not recite functionality, but instead recites what the object identified is used for. Examiner suggests amending the claim to recite the functionality performed by the claimed method, instead of reciting what the claim elements are used for.

Claim 11 recites “uses a multi-pass approach”. The claims do not recite functionality, but instead recites what the multi-pass approach is used for. Examiner suggests amending the claim to recite the functionality performed by the claimed method, instead of reciting what the claim elements are used for.

Claims 15 and 16 recite “search results includes using at least the object identified”. The claims do not recite functionality, but instead recites what the object identified is used for. Examiner suggests amending the claim to recite the functionality performed by the claimed method, instead of reciting what the claim elements are used for.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.

Step 1:
Claims 1-20 are recited as being directed to a “method”. Thus claims 1-20 have been identified to be directed towards the appropriate statutory category. Below is further analysis related to step 2.

Regarding claim 1, 
Step 2A: Prong One:
Claim 1 recites limitations:
generating text by analyzing an image of the visual input for first semantic information; 
generating,… a composite query based on a combination of the textual query and the text based on the visual input; 
searching a search data structure including video content based on the composite query, the data structure including the video content, and metadata including second semantic information associated with a frame of the video content;
generating,… search results based on the searching of the data structure,  the search results being filtered based on the first semantic information and the second semantic information… 
These claim limitations appear to be reciting a “Mental Process” including evaluation which may be performed in a human mind. 
A human being can mentally apply evaluation to generate text based on the mental analysis of an image to determine semantic information. A human being can evaluate to search video contents based on the generated composite query and metadata information associated with the frame of the video. A human being can mentally apply evaluation to generate a composite query such as a plan on how to determine search results and then generating search results based on the composite query determined by filtering the results based on specific information. 
Step 2A: Prong Two:
Claim 1 further recites limitations:
… by the computing device,…	
These claim limitations appear to be to merely add the use of generic computer components which are merely executing the abstract idea within a computer device (see MPEP 2106.05(b)) and do not appear to integrate the abstract idea into a particular application.
Claim 1 further recites limitations:
receiving at a first time, by a computing device, a textual query; 
receiving at a second time after the first time, by the computing device, a visual input associated with the textual query;
These claim limitations as a whole have been identified as insignificant extra-solution activity. Per MPEP 2106.05(g) “An example of pre-solution activity is a step of gathering data for use in a claimed process, e.g., a step of obtaining information about credit card transactions, which is recited as part of a claimed process of analyzing and manipulating the gathered information by a series of steps in order to detect whether the transactions were fraudulent”. Similarly the claim limitations as a whole above appear to be gathering data in terms of textual and visual input query being received and transmitted and do not appear to integrate the abstract idea into a practical application.
Claim 1 further recites limitations:
… the search results including a plurality of links to content.
communicating the search results in response to the textual query.
These claim limitations as a whole have been identified as post-solution activity. According to MPEP 2106.05(g) “An example of post-solution activity is an element that is not integrated into the claim as a whole, e.g., a printer that is used to output a report of fraudulent transactions, which is recited in a claim to a computer programmed to analyze and manipulate information about credit card transactions in order to detect whether the transactions were fraudulent” and also “Cutting hair after first determining the hair style”. Similarly the claim limitations as a whole above appear to be merely formatting results in a specific format for the purpose of communicating the generated search results and do not appear to integrate the abstract idea into a practical application.
Step 2B:
Claim 1 further recites limitations:
… by the computing device,…	
These claim limitations appear to be to merely add the use of generic computer components which are merely executing the abstract idea within a computer device (see MPEP 2106.05(b)) and do not appear to amount to significantly more.
Claim 1 further recites limitations:
receiving at a first time, by a computing device, a textual query; 
receiving at a second time after the first time, by the computing device, a visual input associated with the textual query;
These claim limitations as a whole have been identified as insignificant extra-solution activity. Per MPEP 2106.05(g) “An example of pre-solution activity is a step of gathering data for use in a claimed process, e.g., a step of obtaining information about credit card transactions, which is recited as part of a claimed process of analyzing and manipulating the gathered information by a series of steps in order to detect whether the transactions were fraudulent”. Similarly the claim limitations as a whole above appear to be gathering data in terms of input information along with search results being received and transmitted and appear to be conventional computer functionality. Also, MPEP 2106.05(d)(II) has identified “Receiving or transmitting data over a network, e.g., using the Internet to gather data” as conventional computer technology. Similarly, the claim limitations identified above appear to be receiving data. As a result, these claim limitations as a whole do not appear to amount to significantly more than the abstract idea itself.
Claim 1 further recites limitations:
… the search results including a plurality of links to content.
communicating the search results in response to the textual query.
These claim limitations as a whole have been identified as insignificant extra-solution activity. According to MPEP 2106.05(g) “An example of post-solution activity is an element that is not integrated into the claim as a whole, e.g., a printer that is used to output a report of fraudulent transactions, which is recited in a claim to a computer programmed to analyze and manipulate information about credit card transactions in order to detect whether the transactions were fraudulent” and also “Cutting hair after first determining the hair style”. Similarly the claim limitations as a whole above appear to be merely formatting results in a specific format for  communicating the generated search results and appear to be conventional computer functionality. Also, MPEP 2106.05(d)(II) has identified “Receiving or transmitting data over a network” as conventional computer technology. Similarly, the claim limitations identified above appear to be transmitting data. As a result, these claim limitations as a whole do not appear to amount to significantly more than the abstract idea itself.

Regarding claim 9, 
Step 2A: Prong One:
Claim 9 recites limitations:
searching,… a search data structure including video content and metadata including first semantic information associated with a frame of the video content;
generating,… search results based on searching of the search data structure; 
generating,…. textual metadata by analyzing an image of the visual input for second semantic information; 
filtering,… the search results based on the first semantic information and the second semantic information; and 
generating,… filtered search results based on the filtering, the filtered search results…
These claim limitations appear to be reciting a “Mental Process” including evaluation which may be performed in a human mind. 
A human being can evaluate to search video contents based on the metadata of the frames of video content. A human being can mentally apply evaluation to generate search results. A human being can mentally apply evaluation to generate textual metadata based on the analysis of the visual input. A human mind can apply evaluation to filter the search results using the textual metadata and generating filtered search results based on the filtering. 
Step 2A: Prong Two:
Claim 9 further recites limitations:
receiving, by a computing device, a textual query; 
receiving, by the computing device, a visual input associated with the query;
These claim limitations as a whole have been identified as insignificant extra-solution activity. Per MPEP 2106.05(g) “An example of pre-solution activity is a step of gathering data for use in a claimed process, e.g., a step of obtaining information about credit card transactions, which is recited as part of a claimed process of analyzing and manipulating the gathered information by a series of steps in order to detect whether the transactions were fraudulent”. Similarly the claim limitations as a whole above appear to be gathering data in terms of textual and visual input query being received and transmitted and do not appear to integrate the abstract idea into a practical application.
Claim 9 further recites limitations:
… providing a plurality of links to content; and
communicating,…  the filtered search results in response to the textual query.
These claim limitations as a whole have been identified as post-solution activity. According to MPEP 2106.05(g) “An example of post-solution activity is an element that is not integrated into the claim as a whole, e.g., a printer that is used to output a report of fraudulent transactions, which is recited in a claim to a computer programmed to analyze and manipulate information about credit card transactions in order to detect whether the transactions were fraudulent” and also “Cutting hair after first determining the hair style”. Similarly the claim limitations as a whole above appear to be merely communicating the generated search results and do not appear to integrate the abstract idea into a practical application.
Claim 9 further recites limitations:
… by a computing device,…
… by the computing device,…	
These claim limitations appear to be to merely add the use of generic computer components which are merely executing the abstract idea within a computer device (see MPEP 2106.05(b)) and do not appear to integrate the abstract idea into a particular application.
Step 2B:
Claim 9 further recites limitations:
receiving, by a computing device, a textual query; 
receiving, by the computing device, a visual input associated with the query;
These claim limitations as a whole have been identified as insignificant extra-solution activity. Per MPEP 2106.05(g) “An example of pre-solution activity is a step of gathering data for use in a claimed process, e.g., a step of obtaining information about credit card transactions, which is recited as part of a claimed process of analyzing and manipulating the gathered information by a series of steps in order to detect whether the transactions were fraudulent”. Similarly the claim limitations as a whole above appear to be gathering data in terms of input information along with search results being received and transmitted and appear to be conventional computer functionality. Also, MPEP 2106.05(d)(II) has identified “Receiving or transmitting data over a network, e.g., using the Internet to gather data” as conventional computer technology. Similarly, the claim limitations identified above appear to be receiving data. As a result, these claim limitations as a whole do not appear to amount to significantly more than the abstract idea itself.
Claim 9 further recites limitations:
… providing a plurality of links to content; and
communicating,…  the filtered search results in response to the textual query.
These claim limitations as a whole have been identified as insignificant extra-solution activity. According to MPEP 2106.05(g) “An example of post-solution activity is an element that is not integrated into the claim as a whole, e.g., a printer that is used to output a report of fraudulent transactions, which is recited in a claim to a computer programmed to analyze and manipulate information about credit card transactions in order to detect whether the transactions were fraudulent” and also “Cutting hair after first determining the hair style”. Similarly the claim limitations as a whole above appear to be merely communicating the generated search results and appear to be conventional computer functionality. Also, MPEP 2106.05(d)(II) has identified “Receiving or transmitting data over a network” as conventional computer technology. Similarly, the claim limitations identified above appear to be transmitting data. As a result, these claim limitations as a whole do not appear to amount to significantly more than the abstract idea itself.
Claim 9 further recites limitations:
… by a computing device,…
… by the computing device,…	
These claim limitations appear to be to merely add the use of generic computer components which are merely executing the abstract idea within a computer device (see MPEP 2106.05(b)) and do not appear to amount to significantly more.

Regarding claim 18, 
Step 2A: Prong One:
Claim 18 recites limitations:
generating first semantic information associated with a frame of the video content;
performing,…  an object identification on an image of the visual input; 
generating, … second semantic information based on the object identification; and
These claim limitations appear to be reciting a “Mental Process” including evaluation which may be performed in a human mind. 
A human being can mentally apply evaluation to generate semantic information associated with frame of video content. A human being can also mentally perform object identification on the image of the visual input and generate semantic information based on the object identification.  
Step 2A: Prong Two:
Claim 18 further recites limitations:
… by a computing device,…
… by the computing device,…	
These claim limitations appear to be to merely add the use of generic computer components which are merely executing the abstract idea within a computer device (see MPEP 2106.05(b)) and do not appear to integrate the abstract idea into a particular application.
Claim 18 further recites limitations:
receiving,… a video content; 
receiving,… a visual input that is associated with the video content; 
These claim limitations as a whole have been identified as insignificant extra-solution activity. Per MPEP 2106.05(g) “An example of pre-solution activity is a step of gathering data for use in a claimed process, e.g., a step of obtaining information about credit card transactions, which is recited as part of a claimed process of analyzing and manipulating the gathered information by a series of steps in order to detect whether the transactions were fraudulent”. Similarly the claim limitations as a whole above appear to be gathering data in terms of content and visual input being received and transmitted and do not appear to integrate the abstract idea into a practical application.
Claim 18 further recites limitations:
storing,… in a search data structure, the video content, the first semantic information, and the second semantic information in association with the video content.
These claim limitations as a whole have been identified as insignificant extra-solution activity specifically a post solution activity. Per MPEP 2106.05(g) “when determining whether a claim integrates the judicial exception into a practical application in Step 2A Prong Two or recites significantly more in Step 2B is whether the additional elements add more than insignificant extra-solution activity to the judicial exception. The term "extra-solution activity" can be understood as activities incidental to the primary process or product that are merely a nominal or tangential addition to the claim”. MPEP in 2016.05(g) also provides examples of activities that the courts have found to be insignificant extra-solution activity of which one of them is “Consulting and updating an activity log”. Similarly the above recited claim limitations as a whole above appear to be reciting the process of storing information and does not appear to integrate the abstract idea into a practical application.
Step 2B:
Claim 18 further recites limitations:
… by the computing device,…	
These claim limitations appear to be to merely add the use of generic computer components which are merely executing the abstract idea within a computer device (see MPEP 2106.05(b)) and do not appear to amount to significantly more.
Claim 18 further recites limitations:
receiving,… a video content; 
receiving,… a visual input that is associated with the video content;
These claim limitations as a whole have been identified as insignificant extra-solution activity. Per MPEP 2106.05(g) “An example of pre-solution activity is a step of gathering data for use in a claimed process, e.g., a step of obtaining information about credit card transactions, which is recited as part of a claimed process of analyzing and manipulating the gathered information by a series of steps in order to detect whether the transactions were fraudulent”. Similarly the claim limitations as a whole above appear to be gathering data in terms of input information along with search results being received and transmitted and appear to be conventional computer functionality. Also, MPEP 2106.05(d)(II) has identified “Receiving or transmitting data over a network, e.g., using the Internet to gather data” as conventional computer technology. Similarly, the claim limitations identified above appear to be receiving data. As a result, these claim limitations as a whole do not appear to amount to significantly more than the abstract idea itself.
Claim 18 further recites limitations:
storing,… in a search data structure, the video content, the first semantic information, and the second semantic information in association with the video content.. 
These claim limitations as a whole have been identified as insignificant extra-solution activity specifically a post solution activity. Per MPEP 2106.05(g) “when determining whether a claim integrates the judicial exception into a practical application in Step 2A Prong Two or recites significantly more in Step 2B is whether the additional elements add more than insignificant extra-solution activity to the judicial exception. The term "extra-solution activity" can be understood as activities incidental to the primary process or product that are merely a nominal or tangential addition to the claim”. MPEP in 2016.05(g) also provides examples of activities that the courts have found to be insignificant extra-solution activity of which one of them is “Consulting and updating an activity log”. Similarly the claim limitations as a whole above appear to be reciting the process of storing information. Also, MPEP 2106.05(d)(II) has identified “Storing and retrieving information in memory” as conventional computer technology. Similarly, the claim limitations identified above appear to be storing association between influence scores. As a result, these claim limitations as a whole do not appear to amount to significantly more than the abstract idea itself.

Regarding claims 2, 5-7, 10-11 and 14-17 ,
Step 2A: 
Claim 2 further recites limitations:
performing an object identification on the visual input; and 
performing a semantic query addition on the composite query using at least an object identified based on the object identification to generate the first composite query, wherein the search results are based on the first composite query.
Claim 5 further recites limitations:
determining if a first confidence level in the object identification satisfies a first condition; and 
performing the semantic query addition on the composite query using at least the object identified that satisfies the first condition to generate a second composite query, wherein the search results are based on the second composite query.
	Claim 6 further recites limitations:
	determining if a second confidence level in the object identification satisfies a second condition; and 
performing the semantic query addition on the composite query using at least the object identified that satisfies the second condition to generate a third composite query, wherein the search results are based on the third composite query.
	Claim 7 further recites limitations:
	wherein the second confidence level is higher than the first confidence level..
Claim 10 further recites limitations:
	wherein the textual metadata is generated based on analyzing the visual input for semantic and visual entity information.
Claim 11 further recites limitations:
wherein the analyzing of the visual input uses a multi-pass approach.
Claim 14 further recites limitations:
wherein the search results of the textual query are filtered based on matching the textual metadata with textual metadata of videos of a video visual metadata library.
Claim 15 further recites limitations:
performing an object identification on the visual input; 
determining if a first confidence level in the object identification satisfies a first condition, wherein 
the filtering of the search results includes using at least the object identified that satisfies the first condition to generate a second composite query, and 
the search results are based on the second composite query.
Claim 16 further recites limitations:
the filtering of the search results includes using at least the object identified that satisfies the second condition to generate a third composite query, and 
the search results are based on the third composite query.
Claim 17 further recites limitations:
wherein the second confidence level is higher than the first confidence level.
These claim limitations appear to be reciting a “Mental Process” including evaluation which may be performed in a human mind. 
	A human being can mentally apply evaluation identify objects and performing semantic analysis to determine search results. A human being can mentally apply evaluation to determine first confidence level in the object identification based on the condition to generate query and determine search results. A human being can mentally apply evaluation to determine second confidence level in the object identification based on the condition to generate query and determine search results. A human mind can mentally determine second confidence level being higher than the first confidence level. A human being can apply evaluation to generate textual metadata based on the analysis of visual input. A human mind can evaluate to analyze visual input based on a specific algorithm of multi-pass approach. A human being can also apply evaluation to filter search results based on textual metadata of videos. A human mind can evaluate to perform object identification, determine first confidence level, and filter search results along with generating search results based on determined query. A human being can mentally determine second confidence level being higher than the first confidence level.
There are no additional claim limitations that integrate into a practical application or amount to significantly more than the abstract idea.

Regarding claims 3, 12 and 19,
Step 2A – Prong One: 
Claim 3 further recites limitations:
wherein the performing of the object identification…
Claim 12 further recites limitations:
wherein the analyzing of the visual input…
Claim 19 further recites limitations:
wherein the object identification…
These claim limitations appear to be reciting a “Mental Process” including evaluation which may be performed in a human mind. 
A human being can mentally apply evaluation to identify objects and analyze visual inputs. 
Step 2A – Prong Two:
Claims 3, 12 and 19 further recite:
… uses a trained machine learning model.
These claim limitations appear to be to merely add the use of generic computer components which are merely executing the abstract idea within a computer device (see MPEP 2106.05(b)) and do not appear to amount to significantly more. Using trained machine learning models are conventional computer technology. 
Step 2B:
Claims 3, 12 and 19 further recite:
… uses a trained machine learning model.
These claim limitations appear to be to merely add the use of generic computer components which are merely executing the abstract idea within a computer device (see MPEP 2106.05(b)) and do not appear to amount to significantly more. These references provide evidence that machine learning models are conventional computer technology: Smith (US 2018/0322660 A1 – Abstract), Chen. (US 2018/0314971 A1 – Abstract), Gauci et al. (US 2018/0314925 A1 – Abstract) and Kovacs et al. (US 2018/0314963 A1 – Abstract).

Regarding claims 4 and 13,
Step 2A – Prong One: 
Claim 4 further recites limitations:
the performing of the object identification …
… generates classifiers for objects in the visual input, and 
the performing of the semantic query addition includes generating the text based on the visual input based on the classifiers for the objects.
Claim 13 further recites limitations:
the analyzing of the visual input…
…generates classifiers for objects in the visual input, and
the filtering of the search results includes generating the textual metadata based on the classifiers for the objects. 
Step 2A – Prong Two:
Claims 4 and 13 further recite:
… uses a trained machine learning model.
the trained machine learned model…. 
These claim limitations appear to be to merely add the use of generic computer components which are merely executing the abstract idea within a computer device (see MPEP 2106.05(b)) and do not appear to amount to significantly more. Using trained machine learning models are conventional computer technology. 
Step 2B:
Claims 4 and 13 further recite:
… uses a trained machine learning model, 
the trained machine learned model…. 
These claim limitations appear to be to merely add the use of generic computer components which are merely executing the abstract idea within a computer device (see MPEP 2106.05(b)) and do not appear to amount to significantly more. These references provide evidence that machine learning models are conventional computer technology: Smith (US 2018/0322660 A1 – Abstract), Chen. (US 2018/0314971 A1 – Abstract), Gauci et al. (US 2018/0314925 A1 – Abstract) and Kovacs et al. (US 2018/0314963 A1 – Abstract).

Regarding claim 8,
Step 2A: 
Claim 8 further recites limitations:
wherein the first condition and second condition are configured by a user.
These claim limitations appear to be reciting a “Mental Process” including judgement which may be performed in a human mind. 
A human being can mentally apply judgement that conditions are configured by the user.
There are no additional claim limitations that integrate into a practical application or amount to significantly more than the abstract idea.

Regarding claim 20,
Step 2A:
Claim 20 further recites limitations:
the performing of the object identification …
… generates classifiers for objects in the visual input 
the generating of the semantic information includes generating text based on the classifiers for the objects and
These claim limitations appear to be reciting a “Mental Process” including evaluation which may be performed in a human mind. 
A human being can mentally apply evaluation to perform object identification, generate classifiers and perform semantic query addition to generate text based visual input based on classifiers. A human being can mentally apply evaluation to analyze visual input, generate classifiers and filter search results based on classifiers. A human being can mentally apply evaluation to perform object identification, generate classifiers and perform semantic query addition to generate text based visual input based on classifiers.
Step 2A – Prong Two:
Claim 20 further recite:
… uses a trained machine learning model.
the trained machine learned model…. 
These claim limitations appear to be to merely add the use of generic computer components which are merely executing the abstract idea within a computer device (see MPEP 2106.05(b)) and do not appear to amount to significantly more. Using trained machine learning models are conventional computer technology. 
Claim 20 further recites limitations:
the storing includes storing the image and metadata associated with the image in an image data store.
These claim limitations as a whole have been identified as insignificant extra-solution activity specifically a post solution activity. Per MPEP 2106.05(g) “when determining whether a claim integrates the judicial exception into a practical application in Step 2A Prong Two or recites significantly more in Step 2B is whether the additional elements add more than insignificant extra-solution activity to the judicial exception. The term "extra-solution activity" can be understood as activities incidental to the primary process or product that are merely a nominal or tangential addition to the claim”. MPEP in 2016.05(g) also provides examples of activities that the courts have found to be insignificant extra-solution activity of which one of them is “Consulting and updating an activity log”. Similarly the above recited claim limitations as a whole above appear to be reciting the process of storing information and does not appear to integrate the abstract idea into a practical application.
Step 2B:
Claim 20 further recite:
… uses a trained machine learning model, 
the trained machine learned model…. 
These claim limitations appear to be to merely add the use of generic computer components which are merely executing the abstract idea within a computer device (see MPEP 2106.05(b)) and do not appear to amount to significantly more. These references provide evidence that machine learning models are conventional computer technology: Smith (US 2018/0322660 A1 – Abstract), Chen. (US 2018/0314971 A1 – Abstract), Gauci et al. (US 2018/0314925 A1 – Abstract) and Kovacs et al. (US 2018/0314963 A1 – Abstract).
Claim 18 further recites limitations:
the storing includes storing the image and metadata associated with the image in an image data store. 
These claim limitations as a whole have been identified as insignificant extra-solution activity specifically a post solution activity. Per MPEP 2106.05(g) “when determining whether a claim integrates the judicial exception into a practical application in Step 2A Prong Two or recites significantly more in Step 2B is whether the additional elements add more than insignificant extra-solution activity to the judicial exception. The term "extra-solution activity" can be understood as activities incidental to the primary process or product that are merely a nominal or tangential addition to the claim”. MPEP in 2016.05(g) also provides examples of activities that the courts have found to be insignificant extra-solution activity of which one of them is “Consulting and updating an activity log”. Similarly the claim limitations as a whole above appear to be reciting the process of storing information. Also, MPEP 2106.05(d)(II) has identified “Storing and retrieving information in memory” as conventional computer technology. Similarly, the claim limitations identified above appear to be storing association between influence scores. As a result, these claim limitations as a whole do not appear to amount to significantly more than the abstract idea itself.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-4, 9-10 and 12-14 are rejected under 35 U.S.C. 103 as being unpatentable over Joo (US 2015/0339348 A1, hereinafter “Joo”) in view of Zhang et al. (US 2007/0255755 A1, hereinafter “Zhang”) further in view Lester (US 2017/0249339 A1, hereinafter “Lester”).

Regarding claim 1, Joo teaches
A method, comprising: (see Joo, [0073] “a method in which search is performed”).
receiving at a first time, by a computing device, a textual query; (see Joo, [0072] “a query includes a first query component, in which a query type is a text”; [0076] “a computing device”).
receiving at a second time after the first time, by the computing device, a visual input associated with the textual query; (see Joo, [0072] “a query includes… a second query component, in which the query type is an image, the query includes a plurality of query types”; [0226] “after or in response to the user inputting a first query input to the query input window, the controller 4720 may determine to automatically switch from a first ready state (i.e., first state) in which the first query component can be received to a second ready state (i.e., second state) in which the second query component can be received” – Therefore visual input is received after the textual query; [0076] “a computing device”).
generating text by analyzing an image of the visual input for first semantic information; (see Joo, [0262] “when a query type of the second query component 6707 is an image, the second query component understanding component 6712 may be referred to as an image processing component. Features which are extracted from the second query component 6707 by the second query component understanding component 6712 may be allocated a descriptor keyword by a second query component feature component 6722… the second query component feature component 6722 may allocate the description keyword to at least one of image text features and image vision features of the identified second query component 6707”’ [0157] “When an image or a video is received as a query component, a keyword may be acquired from the image or video by using image recognition or an OCR operation. A search may be performed by using the acquired keyword”).
generating, by the computing device, a composite query based on a combination of the textual query and the text based on the visual input; (see Joo, [0207] “the query input device may generate a combination query based on a plurality of query components”; [0211] “the combination query may include a keyword or a main feature (for example, a feature included in an image) that is added into a query component. Furthermore, as another example, the combination query may include extension keywords generated from the query components”; [0076] “a computing device”).
searching a search data structure including video content based on the composite query, the data structure including the video content, and other information in the database (see Joo, [0256] “A search method(s) 6620 denotes an algorithm(s) that is used to match a query with a database so as to select documents depending on a suitability of the documents. For example, in a video search system… a thumbnail image of a video may be matched with visual content by a single search method”; [0252] “When the user 1 selects a search button 6130, the query input device 6100 may perform a search based on the accumulated query component( s) and the detected query type (s). For example, referring to FIGS. 64 and 65, the query input device 6100 may display a video 6140, in which Obama gives a speech, as a search result on the basis of an image 6102 of a speaking scene and a text 6113 "Obama”).
generating, by the computing device, search results based on the searching of the data structure,… (see Joo, [0256] “A search method(s) 6620 denotes an algorithm(s) that is used to match a query with a database so as to select documents depending on a suitability of the documents. For example, in a video search system… a thumbnail image of a video may be matched with visual content by a single search method”; [0252] “When the user 1 selects a search button 6130, the query input device 6100 may perform a search based on the accumulated query component( s) and the detected query type (s). For example, referring to FIGS. 64 and 65, the query input device 6100 may display a video 6140, in which Obama gives a speech, as a search result on the basis of an image 6102 of a speaking scene and a text 6113 "Obama”; [0183] “a plurality of search results may be acquired (i.e., determined or obtained) based on a query received through a query input window”; [0076] “a computing device”) the first semantic information and other information (see Joo, [0272] “a descriptor may be determined for each of the normalized patches. The descriptor may be a description of a patch that may be added as a feature used for an image search”; [0256] “in a video search system, while a division search method is processing query text keywords and is matching the query text keywords with voice recognition information, a thumbnail image of a video may be matched with visual content by a single search method”).
and communicating the search results in response to the textual query (see Joo, [0083] “The query input window 210 may receive a first query component 211 (i.e., a first query input) corresponding to a first query type… The result display region 220 may include a list of response results 221 and 222 (i.e., search results)”; [0072] “a query includes a first query component, in which a query type is a text”). 
Joo does not explicitly teach search data structure include metadata including second semantic information associated with a frame of the video content; the search results being filtered based on the first semantic information and the second semantic information and the search results including a plurality of links to the video content;
	However, Zhang discloses video search engine and also teaches
	search metadata including second semantic information associated with a frame of the video content; (see Zhang, [0030] “Metadata-based modality 105 begins by obtaining training video metadata 115 (e.g., author information, tag information, domain information, title information, referring URL, abstract, keyword, description, etc.)… The training video metadata 115 for each video clip can be obtained from the video file itself or from various Internet sites linking to the video clip. A text processing component 120 generates text information from the video metadata 115, and forwards the text information”; [0064] “indexing and searching a video database using dual modalities and possibly query profiling… the video clips are categorized using dual modalities and indexed. The categorization may be implemented by a dual modality categorization model 170, e.g., a metadata-based video classification model 160 and a content-based video classification model 165”).  
search based on the second semantic information and (see Zhang, [0060] “training the video classification system to be used in a video search engine… metadata, e.g., metadata 115, is obtained for the training set of video clips. The metadata may be obtained from human subjects, from the Internet, from the video clips themselves, etc. In step 615, a set of categories for categorizing the training set of videos are obtained”; [0064] “indexing and searching a video database using dual modalities and possibly query profiling… the video clips are categorized using dual modalities and indexed. The categorization may be implemented by a dual modality categorization model 170, e.g., a metadata-based video classification model 160 and a content-based video classification model 165”) the search results including a plurality of links to the video content; (see Zhang, [0056] “The search results 260 include the links for selecting from two categories, namely, "tom cruise in News Videos" or "tom cruise in movie videos." In one embodiment, the search component 275 may identify and return the related categories with the video results retrieved”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include functionality of searching metadata information and search results including links related to video content as being disclosed and taught by Zhang in the system taught by Joo to yield the predictable results of improving time/space performance and reducing the over-fitting problem, feature selection methods may be used for optimal searching (see Zhang, [0030] “generates a metadata-based video categorization model 160, which can be used to categorize video metadata on the Internet. The number of features may be large ( e.g., dozens of thousands). To improve time/space performance and reduce the over-fitting problem, feature selection methods (such as mutual information) may be used and the optimal number of features determined by cross validation may be selected”).
The proposed combination of Joo and Zhang does not explicitly teach the search results being filtered based on the first semantic information.
However, Lester discloses conventional neural network to identify features for image and also teaches
the search results being filtered based on different types of metadata (see Lester, [0036] “filter the search results based on different types of metadata”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include functionality of filtering information, performing object identification based on machine learning models, generate classifiers and semantic analysis as being disclosed and taught Lester in the system taught by the proposed combination of Joo and Zhang to yield the predictable results of improving functioning of the computer itself because it saves data storage space and reduces network usage (see Lester, [0022] “provides improvements to the functioning of the computer itself because it saves data storage space and reduces network usage. Specifically, the computer hosting the collection of images to be searched is not required to maintain in data storage or repeatedly share over a network with the convolutional neural network classification information based on the trained semantic concepts for the images to be searched because the convolutional neural network, once trained, is configured to predict which features of the images correlated to content of a selected image subset without this information”).

Regarding claim 9, Joo teaches
A method, comprising: (see Joo, [0073] “a method in which search is performed”; [0076] “a computing device”).
receiving, by a computing device, a textual query; (see Joo, [0072] “a query includes a first query component, in which a query type is a text”; [0076] “a computing device”).
receiving, by the computing device, a visual input associated with the textual query; (see Joo, [0072] “a query includes… a second query component, in which the query type is an image, the query includes a plurality of query types”; [0076] “a computing device”).
searching, by the computing device, a search data structure including video content and other information (see Joo, [0256] “A search method(s) 6620 denotes an algorithm(s) that is used to match a query with a database so as to select documents depending on a suitability of the documents. For example, in a video search system… a thumbnail image of a video may be matched with visual content by a single search method”; [0252] “When the user 1 selects a search button 6130, the query input device 6100 may perform a search based on the accumulated query component( s) and the detected query type (s). For example, referring to FIGS. 64 and 65, the query input device 6100 may display a video 6140, in which Obama gives a speech, as a search result on the basis of an image 6102 of a speaking scene and a text 6113 "Obama”; [0076] “a computing device”).
generating, by the computing device, search results based on the searching of the search data structure; (see Joo, [0256] “A search method(s) 6620 denotes an algorithm(s) that is used to match a query with a database so as to select documents depending on a suitability of the documents. For example, in a video search system… a thumbnail image of a video may be matched with visual content by a single search method”; [0252] “When the user 1 selects a search button 6130, the query input device 6100 may perform a search based on the accumulated query component( s) and the detected query type (s). For example, referring to FIGS. 64 and 65, the query input device 6100 may display a video 6140, in which Obama gives a speech, as a search result on the basis of an image 6102 of a speaking scene and a text 6113 "Obama”; [0183] “a plurality of search results may be acquired (i.e., determined or obtained) based on a query received through a query input window”; [0076] “a computing device”).
generating, by the computing device, textual metadata by analyzing an image of the visual input for second semantic information; (see Joo, [0262] “when a query type of the second query component 6707 is an image, the second query component understanding component 6712 may be referred to as an image processing component. Features which are extracted from the second query component 6707 by the second query component understanding component 6712 may be allocated a descriptor keyword by a second query component feature component 6722… the second query component feature component 6722 may allocate the description keyword to at least one of image text features and image vision features of the identified second query component 6707”’ [0157] “When an image or a video is received as a query component, a keyword may be acquired from the image or video by using image recognition or an OCR operation. A search may be performed by using the acquired keyword”).
	searching based on other information and the second semantic information; (see Joo, [0272] “a descriptor may be determined for each of the normalized patches. The descriptor may be a description of a patch that may be added as a feature used for an image search”; [0256] “in a video search system, while a division search method is processing query text keywords and is matching the query text keywords with voice recognition information, a thumbnail image of a video may be matched with visual content by a single search method”).
providing a plurality of links to content; and (see Joo, [0084] “The result display region 220 may include a list of response results 221 and 222 (i.e., search results)… may include a thumbnail for an image document, some of text included in a document, a link for a searched document”).
communicating, by the computing device, the search results (see Joo, [0083] “The query input window 210 may receive a first query component 211 (i.e., a first query input) corresponding to a first query type… The result display region 220 may include a list of response results 221 and 222 (i.e., search results)”; [0072] “a query includes a first query component, in which a query type is a text”; [0076] “a computing device”) in response to the textual query (see Joo, [0083] “The query input window 210 may receive a first query component 211 (i.e., a first query input) corresponding to a first query type… The result display region 220 may include a list of response results 221 and 222 (i.e., search results)”; [0072] “a query includes a first query component, in which a query type is a text”; [0076] “a computing device”).
Joo does not explicitly teach search data structure include metadata including second semantic information associated with a frame of the video content; filtering, by the computing device, the search results based on the first semantic information; generating, by the computing device, filtered search results based on the filtering, the filtered search results providing a plurality of links to content; and communicating, by the computing device, the filtered search results. 
	However, Zhang discloses video search engine and also teaches
	search metadata including second semantic information associated with a frame of the video content; (see Zhang, [0030] “Metadata-based modality 105 begins by obtaining training video metadata 115 (e.g., author information, tag information, domain information, title information, referring URL, abstract, keyword, description, etc.)… The training video metadata 115 for each video clip can be obtained from the video file itself or from various Internet sites linking to the video clip. A text processing component 120 generates text information from the video metadata 115, and forwards the text information”; [0064] “indexing and searching a video database using dual modalities and possibly query profiling… the video clips are categorized using dual modalities and indexed. The categorization may be implemented by a dual modality categorization model 170, e.g., a metadata-based video classification model 160 and a content-based video classification model 165”).  
search based on the first semantic information (see Zhang, [0060] “training the video classification system to be used in a video search engine… metadata, e.g., metadata 115, is obtained for the training set of video clips. The metadata may be obtained from human subjects, from the Internet, from the video clips themselves, etc. In step 615, a set of categories for categorizing the training set of videos are obtained”; [0064] “indexing and searching a video database using dual modalities and possibly query profiling… the video clips are categorized using dual modalities and indexed. The categorization may be implemented by a dual modality categorization model 170, e.g., a metadata-based video classification model 160 and a content-based video classification model 165”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include functionality of searching metadata information and search results including links related to video content as being disclosed and taught by Zhang in the system taught by Joo to yield the predictable results of improving time/space performance and reducing the over-fitting problem, feature selection methods may be used for optimal searching (see Zhang, [0030] “generates a metadata-based video categorization model 160, which can be used to categorize video metadata on the Internet. The number of features may be large ( e.g., dozens of thousands). To improve time/space performance and reduce the over-fitting problem, feature selection methods (such as mutual information) may be used and the optimal number of features determined by cross validation may be selected”).
The proposed combination of Joo and Zhang does not explicitly teach filtering, by the computing device, the search results based on the first semantic information; generating, by the computing device, filtered search results based on the filtering, the filtered search results providing a plurality of links to content; and communicating, by the computing device, the filtered search results. 
However, Lester discloses conventional neural network to identify features for image and also teaches
	filtering, by the computing device, the search results based on different types of metadata (see Lester, [0036] “filter the search results based on different types of metadata”; [0018] “computer system”). 
	generating, by the computing device, filtered search results based on the filtering, the filtered search results (see Lester, [0078] “implemented in a computing system”; [0036] “filter the search results based on different types of metadata… the search results include a listing of”).
	displaying the filtered search results (see Lester, [0078] “implemented in a computing system”; [0036] “filter the search results based on different types of metadata… the search results include a listing of”; [0070] “the search results may be provided for display within the output section 602”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include functionality of filtering information, performing object identification based on machine learning models, generate classifiers and semantic analysis as being disclosed and taught Lester in the system taught by the proposed combination of Joo and Zhang to yield the predictable results of improving functioning of the computer itself because it saves data storage space and reduces network usage (see Lester, [0022] “provides improvements to the functioning of the computer itself because it saves data storage space and reduces network usage. Specifically, the computer hosting the collection of images to be searched is not required to maintain in data storage or repeatedly share over a network with the convolutional neural network classification information based on the trained semantic concepts for the images to be searched because the convolutional neural network, once trained, is configured to predict which features of the images correlated to content of a selected image subset without this information”).

Regarding claim 2, the proposed combination of Joo, Zhang and Lester teaches
wherein the composite query is a first composite query, and wherein the generating of the first composite query comprises: (see Joo, [0207] “the query input device may generate a combination query based on a plurality of query components”; [0211] “the combination query may include a keyword or a main feature (for example, a feature included in an image) that is added into a query component. Furthermore, as another example, the combination query may include extension keywords generated from the query components”).
performing an object identification (see Lester, [0025] “to identify features in images containing representations of one or more objects such as foreground objects and background object”) on the visual input; and (see Joo, [0072] “a query includes… a second query component, in which the query type is an image, the query includes a plurality of query types”).
performing a semantic query addition on (see Lester, [0023] “the term “semantic concept” refers to the meaning used for understanding an object and/or environment of things.. The term “semantic concept” can be interchangeably used with the term “visual word” which captures the semantic space of a thing, and may be the target of an image search query”; [0025] “identifying relationships between the images and corresponding visual words, once identified, are likely to indicate that the corresponding image is more likely to be relevant to content identified… as the image search query”) the composite query (see Joo, [0207] “the query input device may generate a combination query based on a plurality of query components”) using at least an object identified based on the object identification (see Lester, [0025] “to identify features in images containing representations of one or more objects such as foreground objects and background object”) to generate the first composite query, (see Joo, [0207] “the query input device may generate a combination query based on a plurality of query components”; [0211] “the combination query may include a keyword or a main feature (for example, a feature included in an image) that is added into a query component. Furthermore, as another example, the combination query may include extension keywords generated from the query components”) wherein the search results are based on the first composite query (see Joo, [0212] “the search engine server 420 may perform a single search or the multimodal search according to the search mode, for processing the received query… the search engine server 420 may transmit a search result”; [0256] “The combination query 6610 may be processed by a plurality of the search methods 6620, thereby acquiring a search result”). The motivation for the proposed combination is maintained. 

Regarding claim 3, the proposed combination of Joo, Zhang and Lester teaches
wherein the performing of the object identification uses a trained machine learned model (see Lester, [0025] “The neural network, which can be a convolutional neural network, is trained to identify features in images containing representations of one or more objects such as foreground objects and background objects”). The motivation for the proposed combination is maintained. 

Regarding claim 4, the proposed combination of Joo, Zhang and Lester teaches
wherein the performing of the object identification uses a trained machine learned model, (see Lester, [0025] “The neural network, which can be a convolutional neural network, is trained to identify features in images containing representations of one or more objects such as foreground objects and background objects”).
the trained machine learned model generates classifiers for objects in (see Lester, [0022] “share over a network with the convolutional neural network classification information based on the trained semantic concepts for the images to be searched because the convolutional neural network, once trained, is configured to predict which features of the images correlated to content of a selected image subset without this information”) the visual input, and (see Joo, [0072] “a query includes… a second query component, in which the query type is an image, the query includes a plurality of query types”).
the performing of the semantic query addition includes generating the text (see Lester, [0023] “the term “semantic concept” refers to the meaning used for understanding an object and/or environment of things.. The term “semantic concept” can be interchangeably used with the term “visual word” which captures the semantic space of a thing, and may be the target of an image search query”; [0025] “identifying relationships between the images and corresponding visual words, once identified, are likely to indicate that the corresponding image is more likely to be relevant to content identified… as the image search query”) based on the visual input (see Joo, [0072] “a query includes… a second query component, in which the query type is an image, the query includes a plurality of query types”) based on the classifiers for the objects (see Lester, [0022] “share over a network with the convolutional neural network classification information based on the trained semantic concepts for the images to be searched because the convolutional neural network, once trained, is configured to predict which features of the images correlated to content of a selected image subset without this information”). The motivation for the proposed combination is maintained. 


Regarding claim 10, the proposed combination of Joo, Zhang and Lester teaches
wherein the textual metadata is generated based on analyzing the visual input (see Joo, [0262] “the query type of the second query component 6707 is the image query type”; [0263] “A metadata analysis component 6714 may identify metadata associated with the second query component… The metadata may include a text, which is input for identifying a query component to be used for a search, in an URL path or a relevant text such as a text which is located in a webpage or a text-based document or is located near corresponding information for information (for example, an image or the like) built therein. The second query component feature component 6722 may identify keyword features based on an output of the metadata analysis component”) for semantic and visual entity information (see Lester, [0023] “the term “semantic concept” refers to the meaning used for understanding an object and/or environment of things.. The term “semantic concept” can be interchangeably used with the term “visual word” which captures the semantic space of a thing, and may be the target of an image search query”; [0025] “identifying relationships between the images and corresponding visual words, once identified, are likely to indicate that the corresponding image is more likely to be relevant to content identified… as the image search query”). The motivation for the proposed combination is maintained. 

Regarding claim 12, the proposed combination of Joo, Zhang and Lester teaches
wherein the analyzing of the visual input uses (see Joo, [0262] “the query type of the second query component 6707 is the image query type”; [0263] “A metadata analysis component 6714 may identify metadata associated with the second query component… The metadata may include a text, which is input for identifying a query component to be used for a search, in an URL path or a relevant text such as a text which is located in a webpage or a text-based document or is located near corresponding information for information (for example, an image or the like) built therein. The second query component feature component 6722 may identify keyword features based on an output of the metadata analysis component”) a trained machine learned model (see Lester, [0014] “for training a convolutional neural network to analyze image pixel data to identify features”). The motivation for the proposed combination is maintained. 

Regarding claim 13, the proposed combination of Joo, Zhang and Lester teaches
wherein the analyzing of the visual input uses (see Joo, [0262] “the query type of the second query component 6707 is the image query type”; [0263] “A metadata analysis component 6714 may identify metadata associated with the second query component… The metadata may include a text, which is input for identifying a query component to be used for a search, in an URL path or a relevant text such as a text which is located in a webpage or a text-based document or is located near corresponding information for information (for example, an image or the like) built therein. The second query component feature component 6722 may identify keyword features based on an output of the metadata analysis component”) a trained machine learned model, (see Lester, [0014] “for training a convolutional neural network to analyze image pixel data to identify features”). 
the trained machine learned model generates classifiers for objects in (see Lester, [0022] “share over a network with the convolutional neural network classification information based on the trained semantic concepts for the images to be searched because the convolutional neural network, once trained, is configured to predict which features of the images correlated to content of a selected image subset without this information”) the visual input, and (see Joo, [0072] “a query includes… a second query component, in which the query type is an image, the query includes a plurality of query types”).
the filtering of the search results includes (see Lester, [0036] “filter the search results based on different types of metadata”) generating the textual metadata based on (see Joo, [0262] “the query type of the second query component 6707 is the image query type”; [0263] “A metadata analysis component 6714 may identify metadata associated with the second query component… The metadata may include a text, which is input for identifying a query component to be used for a search, in an URL path or a relevant text such as a text which is located in a webpage or a text-based document or is located near corresponding information for information (for example, an image or the like) built therein. The second query component feature component 6722 may identify keyword features based on an output of the metadata analysis component”) the classifiers for the objects (see Lester, [0022] “share over a network with the convolutional neural network classification information based on the trained semantic concepts for the images to be searched because the convolutional neural network, once trained, is configured to predict which features of the images correlated to content of a selected image subset without this information”). The motivation for the proposed combination is maintained. 

Regarding claim 14, the proposed combination of Joo, Zhang and Lester teaches
wherein the search results of the textual query are filtered (see Lester, [0036] “filter the search results based on different types of metadata”; [0020] “In the text-based approach, the image search initiates a search by parsing keywords from the text-based user query that will drive the search”) based on matching the textual metadata with textual metadata of videos of a video visual metadata library (see Joo, [0256] “in a video search system, while a division search method is processing query text keywords and is matching the query text keywords with voice recognition information, a thumbnail image of a video may be matched with visual content by a single search method. The combination query 6610 may be processed by a plurality of the search methods 6620, thereby acquiring a search result”). The motivation for the proposed combination is maintained. 

Claims 5-7 and 15-17 are rejected under 35 U.S.C. 103 as being unpatentable over Joo, Zhang and Lester in view of Angel et al. (US 10,902,263 B1, hereinafter “Angel”).

Regarding claim 5, the proposed combination of Joo, Zhang and Lester teaches
performing the semantic query addition on (see Lester, [0023] “the term “semantic concept” refers to the meaning used for understanding an object and/or environment of things... The term “semantic concept” can be interchangeably used with the term “visual word” which captures the semantic space of a thing, and may be the target of an image search query”; [0025] “identifying relationships between the images and corresponding visual words, once identified, are likely to indicate that the corresponding image is more likely to be relevant to content identified… as the image search query”) the composite query (see Joo, [0207] “the query input device may generate a combination query based on a plurality of query components”) using at least the object identified… (see Lester, [0025] “to identify features in images containing representations of one or more objects such as foreground objects and background object”) to generate a second composite query, (see Joo, [0211] “the combination query may include extension keywords generated from the query components”; [0207] “the query input device may generate a combination query based on a plurality of query components” – there are plurality of combination queries) wherein the search results (see Joo, [0212] “the search engine server 420 may perform a single search or the multimodal search according to the search mode, for processing the received query… the search engine server 420 may transmit a search result”; [0256] “The combination query 6610 may be processed by a plurality of the search methods 6620, thereby acquiring a search result”) are based on the second composite query (see Joo, [0211] “the combination query may include extension keywords generated from the query components”; [0207] “the query input device may generate a combination query based on a plurality of query components” – there are plurality of combination queries).
The proposed combination of Joo, Zhang and Lester does not explicitly teach determining if a first confidence level in the object identification satisfies a first condition; and object identified that satisfies the first condition. 
However, Angel discloses confidence scores for object identification and also teaches
determining if a first confidence level in the object identification satisfies a first condition; and (see Angel, [col14 lines1-6] “image that may be processed by remote image processing system… object representation 502 is identified as a "car" object with first confidence score”; [col22 lines52-62] “the local image processing system identifies a dog object in a particular image at a first confidence score… the confidence scores reported for the object identifications by the local and remote image processing systems”; [col5 lines55-57] “Objects that are identified as faces but that are not known to the user may not be preferentially reported to the user unless certain conditions are met” – there are plurality of conditions). 
object identification that satisfies the first condition (see Angel, [col5 lines55-57] “Objects that are identified as faces but that are not known to the user may not be preferentially reported to the user unless certain conditions are met” – there are plurality of conditions”). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include functionality of confidence levels related to object identification based on conditions as being disclosed and taught by Angel in the system taught by the proposed combination of Joo, Zhang and Lester to yield the predictable results of improving efficiency of object detection module (see Angel, [col10 lines2-8] “To further improve the efficiency of object detection module 230 as compared to other image processing systems, object detection module 230 may be implemented using a smaller amount of initial training data, which may, in turn, return the size of the neural network implemented by the computer vision system of object detection module 230”). 


Regarding claim 6, the proposed combination of Joo, Zhang, Lester and Angel teaches
determining if a second confidence level in the object identification satisfies a second condition; and (see Angel, [col14 lines1-8] “image that may be processed by remote image processing system… object representation 504 is identified as a "traffic light" object with a second confidence score”; [col22 lines54-62] “but the remote image processing system identifies a Pug object in the same image… … the confidence scores reported for the object identifications by the local and remote image processing systems”). 
performing the semantic query addition on (see Lester, [0023] “the term “semantic concept” refers to the meaning used for understanding an object and/or environment of things.. The term “semantic concept” can be interchangeably used with the term “visual word” which captures the semantic space of a thing, and may be the target of an image search query”; [0025] “identifying relationships between the images and corresponding visual words, once identified, are likely to indicate that the corresponding image is more likely to be relevant to content identified… as the image search query”) the composite query (see Joo, [0207] “the query input device may generate a combination query based on a plurality of query components”) using at least the object identified (see Lester, [0025] “to identify features in images containing representations of one or more objects such as foreground objects and background object”) that satisfies the second condition (see Angel, [col5 lines55-57] “Objects that are identified as faces but that are not known to the user may not be preferentially reported to the user unless certain conditions are met” – there are plurality of conditions”) to generate a third composite query, (see Joo, [0211] “the combination query may include extension keywords generated from the query components”; [0207] “the query input device may generate a combination query based on a plurality of query components” – there are plurality of combination queries) wherein the search results (see Joo, [0212] “the search engine server 420 may perform a single search or the multimodal search according to the search mode, for processing the received query… the search engine server 420 may transmit a search result”; [0256] “The combination query 6610 may be processed by a plurality of the search methods 6620, thereby acquiring a search result”) are based on the third composite query (see Joo, [0211] “the combination query may include extension keywords generated from the query components”; [0207] “the query input device may generate a combination query based on a plurality of query components” – there are plurality of combination queries). The motivation for the proposed combination is maintained. 

Regarding claim 7, the proposed combination of Joo, Zhang, Lester and Angel teaches
wherein the second confidence level is higher than the first confidence level (see Angel, [col21 lines55-60] “the object identifications generated by local image processing system 406 may be more generic (e.g., identifying a "vehicle") and typically at a higher confidence score, while the object identifications generated by remote image processing system 404 may be more specific (e.g., identifying a "red truck") but at a lower confidence score”). The motivation for the proposed combination is maintained.

Regarding claim 15, the proposed combination of Joo, Zhang and Lester teaches
wherein the analyzing the visual input (see Joo, [0262] “the query type of the second query component 6707 is the image query type”; [0263] “A metadata analysis component 6714 may identify metadata associated with the second query component… The metadata may include a text, which is input for identifying a query component to be used for a search, in an URL path or a relevant text such as a text which is located in a webpage or a text-based document or is located near corresponding information for information (for example, an image or the like) built therein. The second query component feature component 6722 may identify keyword features based on an output of the metadata analysis component”) for semantic and visual entity information includes: (see Lester, [0023] “the term “semantic concept” refers to the meaning used for understanding an object and/or environment of things... The term “semantic concept” can be interchangeably used with the term “visual word” which captures the semantic space of a thing, and may be the target of an image search query”; [0025] “identifying relationships between the images and corresponding visual words, once identified, are likely to indicate that the corresponding image is more likely to be relevant to content identified… as the image search query”).
performing an object identification on (see Lester, [0025] “to identify features in images containing representations of one or more objects such as foreground objects and background object”) the visual input; (see Joo, [0072] “a query includes… a second query component, in which the query type is an image, the query includes a plurality of query types”).
the filtering of the search results includes (see Lester, [0036] “filter the search results based on different types of metadata”) using at least the object identified… (see Lester, [0025] “to identify features in images containing representations of one or more objects such as foreground objects and background object”) to generate a second composite query, and (see Joo, [0211] “the combination query may include extension keywords generated from the query components”; [0207] “the query input device may generate a combination query based on a plurality of query components” – there are plurality of combination queries).
the search results (see Joo, [0212] “the search engine server 420 may perform a single search or the multimodal search according to the search mode, for processing the received query… the search engine server 420 may transmit a search result”; [0256] “The combination query 6610 may be processed by a plurality of the search methods 6620, thereby acquiring a search result”) are based on the second composite query (see Joo, [0211] “the combination query may include extension keywords generated from the query components”; [0207] “the query input device may generate a combination query based on a plurality of query components” – there are plurality of combination queries).
The proposed combination of Joo, Zhang and Lester does not explicitly teach determining if a first confidence level in the object identification satisfies a first condition, wherein; object identified that satisfies the first condition. 
However, Angel discloses confidence scores for object identification and also teaches
determining if a first confidence level in the object identification satisfies a first condition, wherein (see Angel, [col14 lines1-6] “image that may be processed by remote image processing system… object representation 502 is identified as a "car" object with first confidence score”; [col22 lines52-62] “the local image processing system identifies a dog object in a particular image at a first confidence score… the confidence scores reported for the object identifications by the local and remote image processing systems”; [col5 lines55-57] “Objects that are identified as faces but that are not known to the user may not be preferentially reported to the user unless certain conditions are met” – there are plurality of conditions). 
object identification that satisfies the first condition (see Angel, [col5 lines55-57] “Objects that are identified as faces but that are not known to the user may not be preferentially reported to the user unless certain conditions are met” – there are plurality of conditions”). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include functionality of confidence levels related to object identification based on conditions as being disclosed and taught by Angel in the system taught by the proposed combination of Joo, Zhang and Lester to yield the predictable results of improving efficiency of object detection module (see Angel, [col10 lines2-8] “To further improve the efficiency of object detection module 230 as compared to other image processing systems, object detection module 230 may be implemented using a smaller amount of initial training data, which may, in turn, return the size of the neural network implemented by the computer vision system of object detection module 230”).


Regarding claim 16, the proposed combination of Joo, Zhang, Lester and Angel teaches
further comprising determining if a second confidence level in the object identification satisfies a second condition, wherein (see Angel, [col14 lines1-8] “image that may be processed by remote image processing system… object representation 504 is identified as a "traffic light" object with a second confidence score”; [col22 lines54-62] “but the remote image processing system identifies a Pug object in the same image… … the confidence scores reported for the object identifications by the local and remote image processing systems”).
the filtering of the search results includes (see Lester, [0036] “filter the search results based on different types of metadata”) using at least the object identified (see Lester, [0025] “to identify features in images containing representations of one or more objects such as foreground objects and background object”) that satisfies the second condition (see Angel, [col5 lines55-57] “Objects that are identified as faces but that are not known to the user may not be preferentially reported to the user unless certain conditions are met” – there are plurality of conditions”) to generate a third composite query, and (see Joo, [0211] “the combination query may include extension keywords generated from the query components”; [0207] “the query input device may generate a combination query based on a plurality of query components” – there are plurality of combination queries).
the search results (see Joo, [0212] “the search engine server 420 may perform a single search or the multimodal search according to the search mode, for processing the received query… the search engine server 420 may transmit a search result”; [0256] “The combination query 6610 may be processed by a plurality of the search methods 6620, thereby acquiring a search result”) are based on the third composite query (see Joo, [0211] “the combination query may include extension keywords generated from the query components”; [0207] “the query input device may generate a combination query based on a plurality of query components” – there are plurality of combination queries). The motivation for the proposed combination is maintained.

Regarding claim 17, the proposed combination of Joo, Zhang, Lester and Angel teaches
wherein the second confidence level is higher than the first confidence level (see Angel, [col21 lines55-60] “the object identifications generated by local image processing system 406 may be more generic (e.g., identifying a "vehicle") and typically at a higher confidence score, while the object identifications generated by remote image processing system 404 may be more specific (e.g., identifying a "red truck") but at a lower confidence score”). The motivation for the proposed combination is maintained.

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Joo, Zhang, Lester and Angel in view of Aminzadeh et al. (US 2016/0055426 A1, hereinafter “Aminzadeh”).

Regarding claim 8, the proposed combination of Joo, Zhang, Lester and Angel teaches
wherein the first condition and second condition are (see Angel, [col5 lines55-57] “Objects that are identified as faces but that are not known to the user may not be preferentially reported to the user unless certain conditions are met” – there are plurality of conditions”).
The proposed combination of Joo, Zhang, Lester and Angel does not explicitly teach condition are configured by a user. 
However, Aminzadeh discloses training classifiers and also teaches
configured by a user (see Aminzadeh, [0026] “a user's specified criteria and produce a corresponding model, regardless of the user's expectations regarding the input and its relationships with outcomes”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include functionality of user specified criteria as being disclosed and taught by Aminzadeh in the system taught by the proposed combination of Joo, Zhang, Lester and Angel to yield the predictable results of improving computational efficiency, thereby making more efficient use of computing resources and improving the efficiency of existing resources (see Aminzadeh, [0049] “improve computational efficiency, thereby making more efficient use of computing resources and improving the efficiency of existing resources”).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Joo, Zhang, Lester and Wolf et al. (US 2004/0044958 A1, hereinafter “Wolf”).

Regarding claim 11, the proposed combination of Joo, Zhang and Lester teaches
wherein the analyzing of the visual input uses metadata information (see Joo, [0262] “the query type of the second query component 6707 is the image query type”; [0263] “A metadata analysis component 6714 may identify metadata associated with the second query component… The metadata may include a text, which is input for identifying a query component to be used for a search, in an URL path or a relevant text such as a text which is located in a webpage or a text-based document or is located near corresponding information for information (for example, an image or the like) built therein. The second query component feature component 6722 may identify keyword features based on an output of the metadata analysis component”).
The proposed combination of Joo, Zhang and Lester does not explicitly teach visual input uses a multi-pass approach.
However, Wolf discloses multipass image analysis and teaches
analyze images based on a multi-pass approach (see Wolf, [0039] “a multi-pass image analysis is performed wherein one or more portions of the electronic document are selected”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include functionality of confidence levels related to object identification based on conditions as being disclosed and taught by Wolf in the system taught by the proposed combination of Joo, Zhang and Lester to yield the predictable results of efficiently searching results (see Wolf, [0004] “when a computerized search engine is directed to search for documents that meet certain requirements, the search engine can more efficiently search the documents by scanning only the metadata tags associated with the documents instead of the entire documents”).

Claims 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Li (US 11,282,509 B1, hereinafter “Li”) in view of Zhang et al. (US 2007/0255755 A1, hereinafter “Zhang”) further in view Iorio et al. (US 2014/0324823 A1, hereinafter “Iorio”).

Regarding claim 18, Li teaches
A method, comprising: (see Li, [col4 lines66-67] “Various embodiments of the present disclosure include systems, methods”). 
receiving, by the computing device, a video content; (see Li, [col5 lines2-7] “may receive a video comprising multiple image frames, and select a subset of the multiple image frames, where the multiple image frames are sequential frames of the video… may input the subset of the multiple image frames into a machine-learned model trained”; [col8 line5] “the computing devices 104”). 
receiving, by the computing device, a visual input that is associated with the video content; (see Li, [col5 lines2-7] “may receive a video comprising multiple image frames, and select a subset of the multiple image frames, where the multiple image frames are sequential frames of the video… may input the subset of the multiple image frames into a machine-learned model trained”; [col8 line5] “the computing devices 104”).
performing, by the computing device, an object identification on an image of the visual input; (see Li, [col5 lines5-12] “may input the subset of the multiple image frames into a machine-learned model trained to detect and classify objects in videos, and receive, from the machine-learned model, an object detected in the subset of the multiple image frames and a classification of the object across the subset of the multiple image frames”; [col8 line5] “the computing devices 104”). 
generating, by the computing device, second semantic information based on the object identification; and (see Li, [col3 lines4-15] “What the content is “about” may be based on a semantic representation of who and/or what is included in the content (e.g., a noun-related entity such as a person, animal, object, etc.), and what is being done in the content (e.g., a verb-related activity such as talking, running, fighting, sitting still, etc.), and in some cases may include descriptors of what is included and/or what is being done (e.g., an adjective- or adverb-related descriptor such as a color, size, shape, duration, etc.). An item of content may include multiple semantic representations, even for relatively simple items of content such as an image”; [col8 line5] “the computing devices 104”). 
storing, by the computing device, in a search data structure, (see Li, [col12 lines11-13] “an object or item of content may be stored in any suitable manner, such as, for example, in association with the object”; [col8 line5] “the computing devices 104”) the video content,… (see Li, [col1 lines58-59] “content can take a number of forms (e.g., text, image, video, audio, or a combination of these)”) and the second semantic information… (see Li, [col3 lines4-15] “What the content is “about” may be based on a semantic representation of who and/or what is included in the content (e.g., a noun-related entity such as a person, animal, object, etc.), and what is being done in the content (e.g., a verb-related activity such as talking, running, fighting, sitting still, etc.), and in some cases may include descriptors of what is included and/or what is being done (e.g., an adjective- or adverb-related descriptor such as a color, size, shape, duration, etc.). An item of content may include multiple semantic representations, even for relatively simple items of content such as an image”) the video content (see Li, [col1 lines58-59] “content can take a number of forms (e.g., text, image, video, audio, or a combination of these)”). 
Li does not explicitly teach generating first semantic information associated with a frame of the video content; storing the first semantic information, in association with the video content. 
However, Zhang discloses video search engine and also teaches
generating first semantic information associated with a frame of the video content; (see Zhang, [0030] “Metadata-based modality 105 begins by obtaining training video metadata 115 (e.g., author information, tag information, domain information, title information, referring URL, abstract, keyword, description, etc.)… The training video metadata 115 for each video clip can be obtained from the video file itself or from various Internet sites linking to the video clip. A text processing component 120 generates text information from the video metadata 115, and forwards the text information”; [0064] “indexing and searching a video database using dual modalities and possibly query profiling… the video clips are categorized using dual modalities and indexed. The categorization may be implemented by a dual modality categorization model 170, e.g., a metadata-based video classification model 160 and a content-based video classification model 165”).
storing the first semantic information, (see Zhang, [0030] “Metadata-based modality 105 begins by obtaining training video metadata 115 (e.g., author information, tag information, domain information, title information, referring URL, abstract, keyword, description, etc.)… The training video metadata 115 for each video clip can be obtained from the video file itself or from various Internet sites linking to the video clip. A text processing component 120 generates text information from the video metadata 115, and forwards the text information”; [0064] “indexing and searching a video database using dual modalities and possibly query profiling… the video clips are categorized using dual modalities and indexed. The categorization may be implemented by a dual modality categorization model 170, e.g., a metadata-based video classification model 160 and a content-based video classification model 165” – video database stores the information).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include functionality of searching metadata information and search results including links related to video content as being disclosed and taught by Zhang in the system taught by Li to yield the predictable results of improving time/space performance and reducing the over-fitting problem, feature selection methods may be used for optimal searching (see Zhang, [0030] “generates a metadata-based video categorization model 160, which can be used to categorize video metadata on the Internet. The number of features may be large ( e.g., dozens of thousands). To improve time/space performance and reduce the over-fitting problem, feature selection methods (such as mutual information) may be used and the optimal number of features determined by cross validation may be selected”).
The proposed combination of Li and Zhang does not explicitly teach storing the first semantic information, in association with the video content. 
However, Iorio discloses search application and also teaches
stores videos in association with semantic information (see Iorio, [0046] “The stored semantic correlation information includes word-based descriptions of visual features as well as other metadata associated with the digital images and/or digital videos that are stored in image content database 141”). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include functionality of storing video information in association with semantic information along with image and metadata information as being disclosed and taught by Iorio in the system taught by the proposed combination of Li and Zhang to yield the predictable results of enabling fast and accurate location of desired item in a collection of visual subject matter stored in the database (see Iorio, [0007] “One advantage of the disclosed technique is that it enables fast and accurate location of a desired item in a collection of visual subject matter stored in a database, such as an image or video. Because browsing for the desired item is based on visual features selected by a user, the user is not required to remember metadata associated with the item in order to efficiently locate the item”).

Regarding claim 19, the proposed combination of Li, Zhang and Iorio teaches
wherein the object identification uses a trained machine learned model (see Li, [col5 lines5-12] “may input the subset of the multiple image frames into a machine-learned model trained to detect and classify objects in videos, and receive, from the machine-learned model, an object detected in the subset of the multiple image frames and a classification of the object across the subset of the multiple image frames”).

Regarding claim 20, the proposed combination of Li, Zhang and Iorio teaches
wherein the performing of the object identification uses a trained machine learned model, (see Li, [col5 lines5-12] “may input the subset of the multiple image frames into a machine-learned model trained to detect and classify objects in videos, and receive, from the machine-learned model, an object detected in the subset of the multiple image frames and a classification of the object across the subset of the multiple image frames”).
the trained machine learned model generates classifiers for objects in the visual input, (see Li, [col20 lines5-11] “The video classifier 212 may, in some cases, also detect and/or classify objects in a video based on one or more colors in one or more pixels in the image frames of the video. For instance, the video classifier 212 may evaluate RGB ( or other type of color model) values for individual pixels of the image frames of the video to detect and/or classify objects in a video”). 
the generating of the semantic information includes generating text based on the classifiers for the objects and (see Li, [col3 lines4-15] “What the content is “about” may be based on a semantic representation of who and/or what is included in the content (e.g., a noun-related entity such as a person, animal, object, etc.), and what is being done in the content (e.g., a verb-related activity such as talking, running, fighting, sitting still, etc.), and in some cases may include descriptors of what is included and/or what is being done (e.g., an adjective- or adverb-related descriptor such as a color, size, shape, duration, etc.). An item of content may include multiple semantic representations, even for relatively simple items of content such as an image” – text based information includes descriptions such as noun, verbs, adjective or adverbs). 
the storing includes storing (see Li, [col12 lines11-13] “an object or item of content may be stored in any suitable manner, such as, for example, in association with the object”; [col8 line5] “the computing devices 104”) the image and metadata associated with the image in an image data store (see Iorio, [0046] “The stored semantic correlation information includes word-based descriptions of visual features as well as other metadata associated with the digital images and/or digital videos that are stored in image content database 141”). The motivation for the proposed combination is maintained. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VAISHALI SHAH whose telephone number is (571)272-8532. The examiner can normally be reached Monday - Friday (7:30 AM to 4:00 PM).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, TAMARA KYLE can be reached on (571)272-4241. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/VAISHALI SHAH/Primary Examiner, Art Unit 2156