DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments/Amendments
2.	With respect to Claim Objections towards claim 12, the improper dependency is corrected. The claim objection is withdrawn. 
 	With respect to Claim Rejections under 35 U.S.C § 101 towards claim 19, the amended claim 19 overcomes the 101 rejection. Consequently, the 101 rejection is withdrawn. 
 	With respect to Claim Rejections under 35 U.S.C § 103, Applicant’s arguments filed on 03/18/2021 have been fully considered but are moot in view of the new ground(s) of rejection.

Information Disclosure Statement
3.	The information disclosure statement (IDS) submitted on 02/23/2021 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 103
4.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

5.	Claims 1, 6, 7, 10, 15, 16, 19 are rejected under 35 U.S.C.103 as being unpatentable over  Kale et al. (US 2018/0107685 A1) in view of Park et al. (US 2016/0098138 A1). 

	With respect to Claim 1, Kale et al. disclose
message to user of a device (Kale et al. [0010] provide intelligent, personalized answers in predictive turns of communication between a human user and an intelligent online personal assistant), 
 	the method comprising: 
 	capturing, via a camera of the device, an image including at least one object (Kale et al. [0088] The input query image may comprise a photograph, a video frame, a ketch, or a diagram, for example. The input query image is typically a digital image file such as may be produced by a portable camera or smartphone, or such as may be copied from a web site or an electronic message, Fig. 10 element 1002); 
 	receiving, via the microphone, a user's voice input indicating a question about an object in the image (Kale et al. [0104] The visual search service 800 performs a visual search based on the image signature it generates for input query image 1002 of a dress, [0105] a user also provides a natural language utterance (e.g., text or voice converted to text) input question, [0106] The parsed input question “How about this for a formal dinner party?”, [0074] audio input component (e.g., a microphone), Fig. 10 elements 1002 and 1004); 
 	identifying the object corresponding to the received user's voice input from among the at least one object in the image, based on the image and the user's voice input (Kale et al. [0106] The parsed input question “How about this for a formal dinner party?” contains the term “formal” that may be recognized as an aspect value 820 that also appears in the knowledge graph, perhaps for the aspect “style”. The term "dinner party" may also be recognized, perhaps as an aspect value for the aspect “occasion” in the knowledge graph. The “bot” therefore identifies the user's intent from its processing of the natural language utterance. It further recognizes that the visual search results do not strongly correlate with candidate product images for “formal” and “dinner party” dresses. This mismatch would lead to a great deal of filtering, perhaps such that no acceptable candidate products may be found); 
 	obtaining at least one of user intonation information and user emotion information by analyzing the user's voice input (Kale et al. [0060] The speaker adaptation components allows the speech recognition component 210 (and consequently the artificial intelligence framework 128) to be robust to speaker variations by continuously learning the idiosyncrasies of a user’s intonation, pronunciation, accent, and other speech factor, and apply these to the speech-dependent components, e.g., feature extraction component, and the acoustic model component.); 
identified object based on the obtained at least one of user intonation information and user emotion information (Kale et al. [0042] Input modalities for the AI orchestrator 206 may be derived from a computer vision component 208, a speech recognition component 210, and a text normalization component which may form part of the speech recognition component 210, for example. The computer vision component 208 may identify objects and attributes from visual input (e.g., a photo). The speech recognition component 210 may convert audio signals (e.g., spoken utterances) into text, [0060] The speaker adaptation components allows the speech recognition component 210 (and consequently the artificial intelligence framework 128) to be robust to speaker variations by continuously learning the idiosyncrasies of a user’s intonation, pronunciation, accent, and other speech factor, and apply these to the speech-dependent components, e.g., feature extraction component, and the acoustic model component); and 
 	providing a response regarding the identified object, based on the determined intention of the user and the identified object (Kale et al. Fig. 10 element 1006 That looks casual. How about these more elegant choices?, [0107] The knowledge graph service 822 therefore elects to generate both a statement type user prompt and a question type user prompt at 824 to solicit further user input. The statement type user prompt “That looks casual.” Therefore denotes the conflicting results between the visual search and the knowledge graph search.)
	Kale et al. fail to explicitly teach 
activating a microphone of the device while the image is being captured;
	However, Park et al. teach 
 	activating a microphone of the device while the image is being captured (Park et al. [0534] when a picture image is captured by the wearable device 200...the microphone 202 b of the wearable device 200 may be activated.);
 	Kale et al. and Park et al. are analogous art because they are from a similar field of endeavor in the Signal Processing algorithm and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of providing the answer to the user by identifying the object in the input query image and the audio input from the user as taught by Kale et al., using teaching of activating the microphone as taught by Park et al. for the benefit of processing external sound The microphone 202 b processes a user’s voice signal input during a call process as electrical voice data.)

 	With respect to Claim 6, Kale et al. in view of Park et al. teach 
 	further comprising: 
 	extracting text data included in the received voice input (Kale et al. [0042] The speech recognition component 210 may convert audio signal (e.g., spoken utterances) into text),
 	wherein the providing of the response regarding the identified object comprises providing a response regarding the identified object, based on the extracted text data and the intention of the user (Kale et al. Fig. 10 elements 1002, 1004, 1006.)

 	With respect to Claim 7, Kale et al. in view of Park et al. teach 
 	wherein the providing of the response regarding the identified object comprises: 
 	generating a search word based on the intention of the user (Kale et al. Fig. 10 element 1006), and 
 	providing a search result obtained by performing a search using the search word together with the response regarding the identified object (Kale et al. Fig. 10 elements 1006, 1012).  

 	With respect to Claim 10, Kale et al. disclose
A device for providing a response to a user's voice input, the device comprising: 
 	an input unit configured to receive an image including at least one object captured through a camera of the device (Kale et al. [0088] The input query image may comprise a photograph, a video frame, a ketch, or a diagram, for example. The input query image is typically a digital image file such as may be produced by a portable camera or smartphone, or such as may be copied from a web site or an electronic message, Fig. 10 element 1002) and receive the user's voice input for the object inputted through a microphone of the device, wherein the user's voice input indicates a question about an object in the image (Kale et al. [0104] The visual search service 800 performs a visual search based on the image signature it generates for input query image 1002 of a dress, [0105] a user also provides a natural language utterance (e.g., text or voice converted to text) input question, [0106] The parsed input question “How about this for a formal dinner party?”, [0074] audio input component (e.g., a microphone), Fig. 10 elements 1002 and 1004); 
 	a memory storing at least one instruction (Kale et al. [0030] random-access memory (RAM), read-only memory (ROM)); and 
 	at least one processor configured to execute the at least one instruction stored in the memory, wherein the at least one processor is further configured to execute the at least one instruction to (Kale et al. [0030] “MACHINE-READABLE MEDIUM” in this context refers to a component, device or other tangible media able to store instructions and data... (RAM), read-only memory (ROM)...when executed by one or more processors of the machine):  
 	identify the object corresponding to the received user's voice input from among the at least one object in the image, based on the image and the user's voice input (Kale et al. [0106] The parsed input question “How about this for a formal dinner party?” contains the term “formal” that may be recognized as an aspect value 820 that also appears in the knowledge graph, perhaps for the aspect “style”. The term "dinner party" may also be recognized, perhaps as an aspect value for the aspect “occasion” in the knowledge graph. The “bot” therefore identifies the user's intent from its processing of the natural language utterance. It further recognizes that the visual search results do not strongly correlate with candidate product images for “formal” and “dinner party” dresses. This mismatch would lead to a great deal of filtering, perhaps such that no acceptable candidate products may be found), 
 	obtain at least one of user intonation information and user emotion information by analyzing the user's voice input (Kale et al. [0060] The speaker adaptation components allows the speech recognition component 210 (and consequently the artificial intelligence framework 128) to be robust to speaker variations by continuously learning the idiosyncrasies of a user’s intonation, pronunciation, accent, and other speech factor, and apply these to the speech-dependent components, e.g., feature extraction component, and the acoustic model component.), 
 	determine an intention of the user with respect to the identified object based on the obtained at least one of user intonation information and user emotion information (Kale et al. [0042] Input modalities for the AI orchestrator 206 may be derived from a computer vision component 208, a speech recognition component 210, and a text normalization component which may form part of the speech recognition component 210, for example. The computer vision component 208 may identify objects and attributes from visual input (e.g., a photo). The speech recognition component 210 may convert audio signals (e.g., spoken utterances) into text, [0060] The speaker adaptation components allows the speech recognition component 210 (and consequently the artificial intelligence framework 128) to be robust to speaker variations by continuously learning the idiosyncrasies of a user’s intonation, pronunciation, accent, and other speech factor, and apply these to the speech-dependent components, e.g., feature extraction component, and the acoustic model component), and 
 	provide a response regarding the identified object based on the determined intention of the user and the identified object (Kale et al. Fig. 10 element 1006 That looks casual. How about these more elegant choices?, [0107] The knowledge graph service 822 therefore elects to generate both a statement type user prompt and a question type user prompt at 824 to solicit further user input. The statement type user prompt “That looks casual.” Therefore denotes the conflicting results between the visual search and the knowledge graph search.)
	Kale et al. fail to explicitly teach 
activate the microphone of the device while the image is being captured, 
However, Park et al. teach 
 	activate the microphone of the device while the image is being captured, (Park et al. [0534] when a picture image is captured by the wearable device 200...the microphone 202 b of the wearable device 200 may be activated.)
 	Kale et al. and Park et al. are analogous art because they are from a similar field of endeavor in the Signal Processing algorithm and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of providing the answer to the user by identifying the object in the input query image and the audio input from the user as taught by Kale et al., using teaching of activating the microphone as taught by Park et al. for the benefit of processing external sound signal (Park et al. [0183] The microphone 202 b processes a user’s voice signal input during a call process as electrical voice data.)

 	With respect to Claim 15, Kale et al. in view of Park et al. teach 
 	wherein the at least one processor is further configured to execute the at least one instruction to: 
The speech recognition component 210 may convert audio signal (e.g., spoken utterances) into text), and 
 	provide a response regarding the identified object based on the extracted text data and the intention of the user (Kale et al. Fig. 10 elements 1002, 1004, 1006.)

 	With respect to Claim 16, Kale et al. in view of Park et al. teach 
 	wherein the at least one processor is further configured to execute the at least one instruction to: 
 	generate a search word based on the intention of the user (Kale et al. Fig. 10 element 1006); and 
 	provide a search result obtained by performing a search using the search word together with the response regarding the identified object (Kale et al. Fig. 10 elements 1006, 1012).  

With respect to Claim 19, claim 19 recites “A non-transitory computer-readable recording medium having recorded thereon a program for executing the method of claim 1 in a computer.” Thus, claim 19 is rejected under 35 U.S.C.103 as being unpatentable over Kale et al. (US 2018/0107685 A1) in view of Park et al. (US 2016/0098138 A1) as the same ground as claim 1. 

6.	Claims 2, 11 are rejected under 35 U.S.C.103 as being unpatentable over Kale et al. (US 2018/0107685 A1) in view of Park et al. (US 2016/0098138 A1) as applied to Claims 1, 10 respectively, and further in view of Tanaka (US 2016/0180833 A1). 

	With respect to Claim 2, Kale et al. in view of Park et al. teach 
 	wherein the determining of the intention of the user comprises determining the intention of the user with respect to the identified object by using the generated intonation information (Kale et al. [0042] Input modalities for the AI orchestrator 206 may be derived from a computer vision component 208, a speech recognition component 210, and a text normalization component which may form part of the speech recognition component 210, for example. The computer vision component 208 may identify objects and attributes from visual input (e.g., a photo). The speech recognition component 210 may convert audio signals (e.g., spoken utterances) into text, [0060] The speaker adaptation components allows the speech recognition component 210 (and consequently the artificial intelligence framework 128) to be robust to speaker variations by continuously learning the idiosyncrasies of a user’s intonation, pronunciation, accent, and other speech factor, and apply these to the speech-dependent components, e.g., feature extraction component, and the acoustic model component.)
	Kale et al. in view of Park et al. fail to explicitly teach 
 	wherein the obtaining of the at least one of user intonation information and user emotion information comprises generating user intonation information of the user by analyzing at least one of voice energy (dB), sound pitch (Hz), shimmer of a voice waveform, and a change rate (zitter) of vocal fold vibration, and 
	However, Tanaka teaches 
 	wherein the obtaining of the at least one of user intonation information and user emotion information comprises generating user intonation information of the user by analyzing at least one of voice energy (dB), sound pitch (Hz), shimmer of a voice waveform, and a change rate (zitter) of vocal fold vibration, and (Tanaka [0057] the intonation information included in the target prosody is generated by extracting gradual changes in the power and pitch from the target prosody.)
 	Kale et al., Park et al. and Tanaka are analogous art because they are from a similar field of endeavor in the Signal Processing algorithm and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of providing the answer to the user by identifying the object in the input query image and the audio input from the user as taught by Kale et al., using teaching of activating the microphone as taught by Park et al. for the benefit of processing external sound signal, using teaching of extracting gradual changes in the power and pitch as taught by Tanaka for the benefit of generating the intonation information (Tanaka [0057] the intonation information included in the target prosody is generated by extracting gradual changes in the power and pitch from the target prosody.)

With respect to Claim 11, Kale et al. in view of Park et al. teach 
 	wherein the at least one processor is further configured to execute the at least one instruction to:  
identified object by using the generated intonation information (Kale et al. [0042] Input modalities for the AI orchestrator 206 may be derived from a computer vision component 208, a speech recognition component 210, and a text normalization component which may form part of the speech recognition component 210, for example. The computer vision component 208 may identify objects and attributes from visual input (e.g., a photo). The speech recognition component 210 may convert audio signals (e.g., spoken utterances) into text, [0060] The speaker adaptation components allows the speech recognition component 210 (and consequently the artificial intelligence framework 128) to be robust to speaker variations by continuously learning the idiosyncrasies of a user’s intonation, pronunciation, accent, and other speech factor, and apply these to the speech-dependent components, e.g., feature extraction component, and the acoustic model component.)
	Kale et al. in view of Park et al. fail to explicitly teach 
 	generate intonation information of the user by analyzing at least one of voice energy (dB), sound pitch (Hz), shimmer of a voice waveform, and a change rate (zitter) of vocal fold vibration, and 
 	However, Tanaka teaches
 	generate intonation information of the user by analyzing at least one of voice energy (dB), sound pitch (Hz), shimmer of a voice waveform, and a change rate (zitter) of vocal fold vibration, and (Tanaka [0057] the intonation information included in the target prosody is generated by extracting gradual changes in the power and pitch from the target prosody.)
 	Kale et al., Park et al. and Tanaka are analogous art because they are from a similar field of endeavor in the Signal Processing algorithm and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of providing the answer to the user by identifying the object in the input query image and the audio input from the user as taught by Kale et al., using teaching of activating the microphone as taught by Park et al. for the benefit of processing external sound signal, using teaching of extracting gradual changes in the power and pitch as taught by Tanaka for the benefit of generating the intonation information (Tanaka [0057] the intonation information included in the target prosody is generated by extracting gradual changes in the power and pitch from the target prosody.)

7.	Claims 3, 12 are rejected under 35 U.S.C.103 as being unpatentable over Kale et al. (US 2018/0107685 A1) in view of Park et al. (US 2016/0098138 A1) and Tanaka (US 2016/0180833 A1) as applied to Claims 2, 11 respectively, and further in view of Napolitano (US 2017/0068423 A1). 

	With respect to Claim 3, Kale et al. in view of Park et al. and Tanaka teach
	wherein the determining of the intention of the user comprises determining the intention of the user with respect to the identified object based on the generated intonation information and the generated emotion information (Kale et al. [0060] The speaker adaptation components allows the speech recognition component 210 (and consequently the artificial intelligence framework 128) to be robust to speaker variations by continuously learning the idiosyncrasies of a user’s intonation, pronunciation, accent, and other speech factor, and apply these to the speech-dependent components, e.g., feature extraction component, and the acoustic model component, [0042] The speech recognition component 210 may convert audio signal (e.g., spoken utterances) into text. A text normalization component may operate to make input normalization, such as language normalization by rendering emoticons into text, for example. Examiner notes that Kale et al. produces a response based on the identified object in the image, the intonation and the emotion of the user);
	Kale et al. in view of Park et al. and Tanaka fail to explicitly teach
 	wherein the obtaining of the at least one of user intonation information and user emotion information comprises generating emotion information of the user by analyzing the generated intonation information, and 
	However, Napolitano teaches 
 	wherein the obtaining of the at least one of user intonation information and user emotion information comprises generating emotion information of the user by analyzing the generated intonation information, and  (Napolitano et al. [0212] the user intent can be determined based on prosody information derived from the user utterance in the sample audio data. In particular, prosody information (e.g., tonality, rhythm, volume, stress, intonation, speech, etc.) can be derived from the user utterance to determine the attitude, mood, emotion, or sentiment of the user.)
the user intent can be determined based on prosody information derived from the user utterance in the sample audio data. In particular, prosody information (e.g., tonality, rhythm, volume, stress, intonation, speech, etc.) can be derived from the user utterance to determine the attitude, mood, emotion, or sentiment of the user.)

	With respect to Claim 12, Kale et al. in view of Park et al. and Tanaka teach
 	wherein the at least one processor is further configured to execute the at least one instruction to:  
 	determine the intention of the user with respect to the identified object, based on the generated intonation information and the generated emotion information (Kale et al. [0060] The speaker adaptation components allows the speech recognition component 210 (and consequently the artificial intelligence framework 128) to be robust to speaker variations by continuously learning the idiosyncrasies of a user’s intonation, pronunciation, accent, and other speech factor, and apply these to the speech-dependent components, e.g., feature extraction component, and the acoustic model component, [0042] The speech recognition component 210 may convert audio signal (e.g., spoken utterances) into text. A text normalization component may operate to make input normalization, such as language normalization by rendering emoticons into text, for example. Examiner notes that Kale et al. produces a response based on the identified object in the image, the intonation and the emotion of the user);
	Kale et al. in view of Park et al. and Tanaka fail to explicitly teach
 	generate emotion information of the user by analyzing the generated intonation information, and 

 	generate emotion information of the user by analyzing the generated intonation information, and (Napolitano et al. [0212] the user intent can be determined based on prosody information derived from the user utterance in the sample audio data. In particular, prosody information (e.g., tonality, rhythm, volume, stress, intonation, speech, etc.) can be derived from the user utterance to determine the attitude, mood, emotion, or sentiment of the user.)
 	Kale et al., Park et al., Tanaka and Napolitano are analogous art because they are from a similar field of endeavor in the Signal Processing algorithm and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of providing the answer to the user by identifying the object in the input query image and the audio input from the user as taught by Kale et al., using teaching of activating the microphone as taught by Park et al. for the benefit of processing external sound signal, using teaching of extracting gradual changes in the power and pitch as taught by Tanaka for the benefit of generating the intonation information, using teaching of the intonation as taught by Napolitano for the benefit of determining the emotion of the user (Napolitano et al. [0212] the user intent can be determined based on prosody information derived from the user utterance in the sample audio data. In particular, prosody information (e.g., tonality, rhythm, volume, stress, intonation, speech, etc.) can be derived from the user utterance to determine the attitude, mood, emotion, or sentiment of the user.)

8.	Claims 8, 17 are rejected under 35 U.S.C.103 as being unpatentable over Kale et al. (US 2018/0107685 A1) in view of Park et al. (US 2016/0098138 A1) as applied to Claims 1, 10 respectively, and further in view of Kim (US 2014/0142953 A1). 

	With respect to Claim 8, Kale et al. in view of Park et al. teach all the limitations of Claim 1 upon which Claim 8 depends. Kale et al. in view of Park et al. fail to explicitly teach 
 	further comprising: 
 	displaying the image including the at least one object, and 
 	wherein the activating of the microphone of the device comprises activating the microphone of the device while the image is displayed. 
	However, Kim et al. teach 

 	displaying the image including the at least one object (Kim et al. Fig. 6c, [0132] Even if the display object is displayed on the home screen, if the corresponding object is selected or while the corresponding object is displayed, the controller 180 activates the microphone, recognizes a user voice in addition, and then resumes the interrupted voice recognition task), and 
 wherein the activating of the microphone of the device comprises activating the microphone of the device while the image is displayed (Kim et al. Fig. 6c, [0132] Even if the display object is displayed on the home screen, if the corresponding object is selected or while the corresponding object is displayed, the controller 180 activates the microphone, recognizes a user voice in addition, and then resumes the interrupted voice recognition task.)  
Kale et al., Park et al. and Kim et al. are analogous art because they are from a similar field of endeavor in the Signal Processing algorithm and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of providing the answer to the user by identifying the object in the input query image and the audio input from the user as taught by Kale et al., using teaching of activating the microphone as taught by Park et al. for the benefit of processing external sound signal, using teach of activating the microphone while the object is displayed on the screen as taught by Kim et al. for the benefit of recognizing the user voice in addition and resuming the interrupted voice recognition task (Kim et al. Fig. 6c, [0132] Even if the display object is displayed on the home screen, if the corresponding object is selected or while the corresponding object is displayed, the controller 180 activates the microphone, recognizes a user voice in addition, and then resumes the interrupted voice recognition task.)  

 	With respect to Claim 17, Kale et al. in view of Park et al. teach all the limitations of Claim 10 upon which Claim 17 depends. Kale et al. in view of Park et al. fail to explicitly teach 
 	wherein the at least one processor is further configured to execute the at least one instruction to: 
 	receive the user's input for selecting a portion of the image that is displayed; and 
 	identify the object in the selected portion. 
	However, Kim et al. teach 

 	further comprising: 
 	a display displaying the image including the at least one object (Kim et al. Fig. 6c, [0132] Even if the display object is displayed on the home screen, if the corresponding object is selected or while the corresponding object is displayed, the controller 180 activates the microphone, recognizes a user voice in addition, and then resumes the interrupted voice recognition task), and 
 wherein the at least one processor is further configured to execute the at least one instruction to activate the microphone of the device while the image is displayed (Kim et al. Fig. 6c, [0132] Even if the display object is displayed on the home screen, if the corresponding object is selected or while the corresponding object is displayed, the controller 180 activates the microphone, recognizes a user voice in addition, and then resumes the interrupted voice recognition task.)  
Kale et al., Park et al. and Kim et al. are analogous art because they are from a similar field of endeavor in the Signal Processing algorithm and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of providing the answer to the user by identifying the object in the input query image and the audio input from the user as taught by Kale et al., using teaching of activating the microphone as taught by Park et al. for the benefit of processing external sound signal, using teach of activating the microphone while the object is displayed on the screen as taught by Kim et al. for the benefit of recognizing the user voice in addition and resuming the interrupted voice recognition task (Kim et al. Fig. 6c, [0132] Even if the display object is displayed on the home screen, if the corresponding object is selected or while the corresponding object is displayed, the controller 180 activates the microphone, recognizes a user voice in addition, and then resumes the interrupted voice recognition task.)  

9.	Claims 9, 18 are rejected under 35 U.S.C.103 as being unpatentable over Kale et al. (US 2018/0107685 A1) in view of Park et al. (US 2016/0098138 A1) and Kim (US 2014/0142953 A1) as applied to Claims 8, 17 respectively, and further in view of Rifkin et al. (US 20170041523 A1). 

With respect to Claim 9, Kale et al. in view of Park et al. and Kim et al. teach all the limitations of Claim 8 upon which Claim 9 depends. Kale et al. in view of Park et al. and Kim et al. fail to explicitly teach 
 	further comprising: 
receiving the user's input for selecting a portion of the image that is displayed, and
wherein the identifying of the object comprises identifying the object in the selected portion. 
However, Rifkin et al. teach 
 	further comprising: 
receiving the user's input for selecting a portion of the image that is displayed (Rifkin et al. [0006] obtaining a transcription of the audio data, [0027] In Fig. 1D, the subject 104 is a landmark 105c and single person 105d, and the speech command 106 is “Take pictures of me and the Eiffel Tower.” In this example, the camera may identify that the speaker is the person 105d and that the landmark 105c is the Eiffel Tower. The example interpretation 110 shows that the camera 102 has identified the person 105d as “John” and the landmark 105c is the Eiffel Tower.), and
 	wherein the identifying of the object comprises identifying the object in the selected portion (Rifkin et al. [0006] controlling a future operation on the device based at least on (i) the one or more objects identified in the image data, and (ii) the transcription of the audio data.)
Kale et al., Park et al., Kim et al. and Rifkin et al. are analogous art because they are from a similar field of endeavor in the Signal Processing algorithm and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of providing the answer to the user by identifying the object in the input query image and the audio input from the user as taught by Kale et al., using teaching of activating the microphone as taught by Park et al. for the benefit of processing external sound signal, using teach of activating the microphone while the object is displayed on the screen as taught by Kim et al. for the benefit of recognizing the user voice in addition and resuming the interrupted voice recognition task, using teaching of identifying one or more objects in the image data as taught by Rifkin et al. for the benefit of controlling an operation in response to detecting one or more objects in the image data and the transcription of the voice data (Rifkin et al. [0006] obtaining, by a device, (i) image data and (ii) audio data; identifying one or more objects in the image data; obtaining a transcription of the audio data; and controlling a future operation of the device based at least on (i) the one or more objects identified in the image data, and (ii) the transcription of the audio data.)  
 	
 With respect to Claim 18, Kale et al. in view of Park et al. and Kim et al. teach all the limitations of Claim 17 upon which Claim 18 depends. Kale et al. in view of Park et al. and Kim et al. fail to explicitly teach 
  	wherein the at least one processor is further configured to execute the at least one instruction to: 
 	receive the user's input for selecting a portion of the image that is displayed; and 
 	identify the object in the selected portion. 
	However, Rifkin et al. teach 
 	wherein the at least one processor is further configured to execute the at least one instruction to: 
 	receive the user's input for selecting a portion of the image that is displayed (Rifkin et al. [0006] obtaining a transcription of the audio data, [0027] In Fig. 1D, the subject 104 is a landmark 105c and single person 105d, and the speech command 106 is “Take pictures of me and the Eiffel Tower.” In this example, the camera may identify that the speaker is the person 105d and that the landmark 105c is the Eiffel Tower. The example interpretation 110 shows that the camera 102 has identified the person 105d as “John” and the landmark 105c is the Eiffel Tower); and 
 	identify the object in the selected portion (Rifkin et al. [0006] controlling a future operation on the device based at least on (i) the one or more objects identified in the image data, and (ii) the transcription of the audio data.)
 	Kale et al., Park et al., Kim et al. and Rifkin et al. are analogous art because they are from a similar field of endeavor in the Signal Processing algorithm and applications. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the steps of providing the answer to the user by identifying the object in the input query image and the audio input from the user as taught by Kale et al., using teaching of activating the microphone as taught by Park et al. for the benefit of processing external sound signal, using teach of activating the microphone while the object is displayed on obtaining, by a device, (i) image data and (ii) audio data; identifying one or more objects in the image data; obtaining a transcription of the audio data; and controlling a future operation of the device based at least on (i) the one or more objects identified in the image data, and (ii) the transcription of the audio data.)  

Allowable Subject Matter
10.	Claims 4, 13 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.  
The following is an examiner’s statement of reasons for allowance: The prior art(s) taken alone or in combination fail(s) to teach the following element(s) in combination with the other recited elements in the claim(s).
 	“determining a background knowledge level of the user corresponding to the determined type of the language, 
 	wherein the providing of the response regarding the identified object comprises providing a response regarding the identified object based on the determined background knowledge level of the user, by using the determined type of the language.” as recited in Claim 4. 
	“determine a background knowledge level of the user corresponding to the determined type of the language, and 
 	provide a response regarding the identified object based on the determined background knowledge level of the user, by using the determined type of the language.” as recited in Claim 13. 

 	Claims 5, 14 are objected to as being dependent upon an objected claims. 

Conclusion
11.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
a.	Shimota et al. (US 2013/0282360 A1). Shimota et al. disclose a method/a system for visual searching. 
b.	Solem et al. (US 2013/0346068 A1). Solem et al. disclose a method/ a system for providing a digital photograph of a real-world scene. 
c.	Schott et al. (US 2003/0033266 A1). Schott et al. disclose a method/ a system for providing a human answer in response to a human question. 

12.	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

13.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to THUYKHANH LE whose telephone number is (571)272-6429.  The examiner can normally be reached on Mon-Fri: 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, RICHEMOND DORVIL can be reached on 571-272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.





/THUYKHANH LE/Primary Examiner, Art Unit 2658