DETAILED ACTION	
In Applicant’s Response (RCE) dated 7/7/2021, Applicant amended claims 1-2, 4-12 and 14-20; and argued against all rejections previously set forth in the Office action dated 5/7/2021.
	Claims 1-2, 4-12 and 14-20 are pending in this case. 

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 6/23/2014 has been entered.

Response to Argument
Applicant’s arguments were considered, but are moot in view of the new ground(s) of rejection.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having 

Claims 1, 2, 6, 8, 10, 11, 12, 16, 18, 20   is/are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al., Pub. No.: 2016/0077794, in view of Filev et al., Pub. 2008/0269958A1. 
With regard to claim 1:
Kim discloses a method of detecting an instruction from a user, the method comprising: receiving, from the user of a user device, an audio input (paragraph 41: “FIG. 3 illustrates exemplary process 300 for dynamically adjusting a speech trigger threshold and triggering a virtual assistant. Process 300 can, for example, be executed on processor 204 of user device 102 discussed above with reference to FIG. 2. In other examples, processing modules 118 of server system 110 and processor 204 of user device 102 can be used together to execute some or all of process 300. At block 302, audio input can be received via a microphone, such as microphone 230 of user device 102. At block 304, the received audio input can be sampled. In one example, received audio input can be sampled at regular intervals, such as sampling a portion of the received audio every ten milliseconds. In other examples, received audio input can be continuously sampled or analyzed at various other intervals. It should be understood that always listening can include receiving and analyzing audio continuously or can include receiving and analyzing audio at regular intervals or at particular times.”); extracting at least one of a pitch of the audio input, an abrupt change in an intensity of the audio input (paragraph 69: “In still other examples, which of multiple devices triggers in response to detecting a trigger phrase can be determined based on comparing confidence levels among the various devices, comparing volume levels of speech received at various devices, comparing various other types of sensor data among the various devices, or the like. For example, if two or more devices could trigger based on a single utterance, before triggering, the devices could share sensor data to determine which of the devices should trigger and which should not.”). calculating a confidence score based on the at least one of a pitch of the audio input, an abrupt change in an intensity of the audio input, (paragraph 69: “In still other examples, which of multiple devices triggers in response to detecting a trigger phrase can be determined based on comparing confidence levels among the various devices, comparing volume levels of speech received at various devices, comparing various other types of sensor data among the various devices, or the like. For example, if two or more devices could trigger based on a single utterance, before triggering, the devices could share sensor data to determine which of the devices should trigger and which should not.”). and detecting the audio input as the instruction based on the confidence score exceeding a predetermined value (paragraph 48: “If the confidence level from block 306 does exceed the threshold as determined at decision block 308 (e.g., the "yes" branch), process 300 can proceed to block 310. At block 310, a virtual assistant can be triggered to receive a user command from the audio input. For example, with a confidence level above the threshold indicating that a speech trigger was likely uttered and received, a virtual assistant can be triggered to receive a user command from the audio input. In some examples, triggering the virtual assistant can include initiating a virtual assistant session with a user. A virtual assistant session can, for example, include a prompt (e.g., a played sound, displayed image, displayed text, illuminated indicator light, etc.) to notify the user that a trigger was recognized and the system is ready to receive a command. A user can then utter a command or request for the virtual assistant. As discussed above, user intent can then be determined based on the received user command or request, and the command associated with the determined user intent can be executed or a response to the request can be provided.”). 
	Kim does not disclose the aspect of extracting a pitch of the audio input and an abrupt change in an intensity of the audio input. However Filev discloses the aspect of extracting a pitch of the audio input and an abrupt change in an intensity of the audio input to make a determination on a received user speech (paragraph 88: “The prosodic analysis module 134 of FIG. 9B may use multi-parametric speech analysis algorithms to determine the occupant's affective state. For example, the specific features of the speech input, such as speech rate, pitch, pitch change rate, pitch variation, Teager energy operator, intensity, intensity change, articulation, phonology, voice quality, harmonics to noise ratio, or other speech characteristics, are computed. The change in these values compared with baseline values is used as input into a classifier algorithm which determines the emotion on either a continuous scale or as speech categories.”). It would have being obvious to one of ordinary skill in the art, at the time the filing was made to apply Filev to Kim so the system can more accurately determine user voice input by considering more criteria and their attributes to determine that instruction has been 

With regard to claims 2 and 12:
Kim and Filev disclose the method of claim 1 wherein the extracting comprises extracting the intensity of the audio input, and wherein the calculating comprises calculating the confidence score based on the intensity of the audio input. (Kim paragraph 69: “In still other examples, which of multiple devices triggers in response to detecting a trigger phrase can be determined based on comparing confidence levels among the various devices, comparing volume levels of speech received at various devices, comparing various other types of sensor data among the various devices, or the like. For example, if two or more devices could trigger based on a single utterance, before triggering, the devices could share sensor data to determine which of the devices should trigger and which should not.”).

With regard to claims 6 and 16:
Kim and Filev disclose the method of claim 1, further comprising: executing a task corresponding to the instruction (Kim paragraph 48: “If the confidence level from block 306 does exceed the threshold as determined at decision block 308 (e.g., the "yes" branch), process 300 can proceed to block 310. At block 310, a virtual assistant can be triggered to receive a user command from the audio input. For example, with a confidence level above the threshold indicating that a speech trigger was likely uttered and received, a virtual assistant can be triggered to receive a user command from the audio input. In some examples, triggering the virtual assistant can include initiating a virtual assistant session with a user. A virtual assistant session can, for example, include a prompt (e.g., a played sound, displayed image, displayed text, illuminated indicator light, etc.) to notify the user that a trigger was recognized and the system is ready to receive a command. A user can then utter a command or request for the virtual assistant. As discussed above, user intent can then be determined based on the received user command or request, and the command associated with the determined user intent can be executed or a response to the request can be provided.”). 

With regard to claims 8 and 18:
Kim and Filev disclose the aspect of detecting a plurality of audio inputs from a plurality of users; extracting a plurality of verbal audio cues or a plurality of non-verbal audio data corresponding to each of the plurality of users; calculating a plurality of confidence scores corresponding to the each of the plurality of users; and detecting a plurality of instructions corresponding to the plurality of confidence scores. (Kim paragraph 63: “In other examples, multiple devices that are nearby and in communication (e.g., via Bluetooth, Wi-Fi, or other communication channels) can share sensor data or trigger threshold information to adjust speech trigger thresholds for the various devices. FIG. 5 illustrates multi-device system 540, which can include interactions among the various devices to dynamically adjust a speech trigger threshold for one or more of the devices. In the illustrated example, a variety of devices and multiple users can be present, and speech trigger thresholds for various speech-enabled devices can be adjusted based on sensor information from the various devices. Multi-device system 540, for instance, can include multiple speech-enabled devices, such as TV set-top box 546, security system control panel 554, user device 102, and tablet computer 564. The various devices can be associated with a variety of different sensors and displays. TV set-top box 546, for example, can be connected to display 542 and camera 544, which can be used for facial recognition, gesture detection, presence detection, photography, filming, or the like. Security system control panel 554 can be associated with camera 556, which can likewise be used for facial recognition, gesture detection, presence detection, or the like. User device 102 and tablet computer 564 can include a variety of sensors that can likewise be used to detect user presence.”). 

With regard to claims 10 and 20:
Kim and Filev disclose the method of claim 1, further comprising: extracting verbal information from the audio input (Kim paragraph 41: “FIG. 3 illustrates exemplary process 300 for dynamically adjusting a speech trigger threshold and triggering a virtual assistant. Process 300 can, for example, be executed on processor 204 of user device 102 discussed above with reference to FIG. 2. In other examples, processing modules 118 of server system 110 and processor 204 of user device 102 can be used together to execute some or all of process 300. At block 302, audio input can be received via a microphone, such as microphone 230 of user device 102. At block 304, the received audio input can be sampled. In one example, received audio input can be sampled at regular intervals, such as sampling a portion of the received audio every ten milliseconds. In other examples, received audio input can be continuously sampled or analyzed at various other intervals. It should be understood that always listening can include receiving and analyzing audio continuously or can include receiving and analyzing audio at regular intervals or at particular times.”); determining a context of the verbal information (Kim paragraph 42: “At block 306, a confidence level can be determined that the sampled audio input comprises a portion of a spoken trigger. In one example, sampled portions of audio can be analyzed to determine whether they include portions of a speech trigger. Speech triggers can include any of a variety of spoken words or phrases that a user can utter to trigger an action (e.g., initiating a virtual assistant session). For example, a user can utter "hey Siri" to initiate a session with a virtual assistant referred to as "Siri." In other examples, a user can utter a designated assistant name or reference, such as "Assistant," "Siri," "Secretary," "Hi Assistant," "Hello Helper," or any other names or references.”) calculating the confidence score based on the context of the verbal information (Kim paragraph 69: “In still other examples, which of multiple devices triggers in response to detecting a trigger phrase can be determined based on comparing confidence levels among the various devices, comparing volume levels of speech received at various devices, comparing various other types of sensor data among the various devices, or the like. For example, if two or more devices could trigger based on a single utterance, before triggering, the devices could share sensor data to determine which of the devices should trigger and which should not.”).

Claim 11 is rejected for the same reason as claim 1. 



Claims 4, 5, 14, 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kim, in view of Filev  in view of Kim et al. (hereinafter Kim2), Pub. No.: 2019/0019515A1. 
With regard to claims 4 and 14:
Kim and Filev do not disclose The method of claim 1, further comprising: receiving a video input; extracting a video cue based on the video input; calculating the confidence score based on the video cue; and detecting the audio input or the video input as the instruction based on the confidence score exceeding the predetermined value.
However Kim2 discloses the aspect receiving a video input; extracting a video cue based on the video input; calculating the confidence score based on the video cue; and detecting the audio input or the video input as the instruction based on the confidence score exceeding the predetermined value. (fig. 12, paragraph 79: “FIG. 12 is a flowchart of an optimum control method based on a gesture-speech multi-mode command according to one embodiment of the invention. Referring to FIG. 12, the electronic device 100 acquires an image of a user in the vicinity through the attached camera 112. When a reference body part (e.g., eye, nose, mouth, or the like) of the user is captured in the image, a body coordinate point thereof is determined, and a connection vector S extending from the body coordinate point to the user's finger and a motion vector M along which the finger is moved are determined. Here, when magnitudes of the connection vector S and the motion vector M, a time (tm) for which the user moves the finger, an angle (a) spatially formed by the connection vector S and the motion vector M, and a reference time point t0 satisfy a threshold condition stored in the memory system 150, it is judged that the user has requested a trigger for a voice command, and a speech trigger is generated to enable voice command recognition. When the user issues a voice command after switching to a speech recognition mode by the speech trigger, the speech of the user is recognized through the attached microphone 114 and the voice command is executed according to a result of the speech recognition. When the user intends to transmit a voice command to a specific electronic device, noises caused by other peripheral electronic devices may disturb the transmission. Therefore, it is possible to generate a command for muting or reducing sound of electronic devices other than the electronic device to which the voice command is to be transmitted.”). It would have being obvious to one of ordinary skill in the art, at the time the filing was made to apply Kim2 to Kim and Filev so the system can more precisely determine the confidence score based on video data wherein the system can use user’s gesture along with user voice command to accurately determine user’s command and intentions. 

With regard to claims 5 and 15:
(Kim2 fig. 12, paragraph 79: “FIG. 12 is a flowchart of an optimum control method based on a gesture-speech multi-mode command according to one embodiment of the invention. Referring to FIG. 12, the electronic device 100 acquires an image of a user in the vicinity through the attached camera 112. When a reference body part (e.g., eye, nose, mouth, or the like) of the user is captured in the image, a body coordinate point thereof is determined, and a connection vector S extending from the body coordinate point to the user's finger and a motion vector M along which the finger is moved are determined. Here, when magnitudes of the connection vector S and the motion vector M, a time (tm) for which the user moves the finger, an angle (a) spatially formed by the connection vector S and the motion vector M, and a reference time point t0 satisfy a threshold condition stored in the memory system 150, it is judged that the user has requested a trigger for a voice command, and a speech trigger is generated to enable voice command recognition. When the user issues a voice command after switching to a speech recognition mode by the speech trigger, the speech of the user is recognized through the attached microphone 114 and the voice command is executed according to a result of the speech recognition. When the user intends to transmit a voice command to a specific electronic device, noises caused by other peripheral electronic devices may disturb the transmission. Therefore, it is possible to generate a command for muting or reducing sound of electronic devices other than the electronic device to which the voice command is to be transmitted.”).


Claims 7 and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kim, in view of Filev, and further in view of Panainte et al., Pub. No.: 2015/0379987A1. 
With regard to claim 7 and 17:
Kim and Filev do not disclose The method of claim 6, further comprising: receiving a second audio input during execution of the task; extracting a non-verbal audio cue or a verbal audio cue based on the second audio input; determining that the instruction is an intentional instruction based on the non-verbal audio cue or the verbal audio cue; and updating the confidence score based on determining that the instruction is the intentional instruction.             However Panainte disclose the aspect of receiving a second audio input during execution of the task; extracting a non-verbal audio cue or a verbal audio cue based on the second audio input; determining that the instruction is an intentional instruction based on the non-verbal audio cue or the verbal audio cue; and updating the confidence score based on determining that the instruction is the intentional instruction. (paragraph 17: “In an exemplary embodiment, a second confidence level is used to compare the representation of the received speech to the complete set of commands or names. The first confidence level and the second confidence level may be different. The first confidence level may be higher than the second confidence level. The method may include adaptively updating the first confidence level based on user feedback received at the microphone or received at a user input device. The adaptive updating may include raising the first confidence level in response to the user feedback indicating that results of first passes were not correct. The adaptive updating may further include lowering the first confidence level in response to the user feedback indicating that results of the first passes were correct. The method may further include maintaining a different partial set of commands or names for each of a plurality of different users or user devices of the voice recognition system. The partial set of commands or names for voice recognition include at least one of: (a) a set of most frequently used commands, (b) a set of most frequently used voice tags, and (c) a set of most frequently used contacts or phonebook names. The method may further include updating the partial set of commands or names for voice recognition in the first pass as the frequency of use changes for the commands or names.”). It would have being obvious to one of ordinary skill in the art, at the time the filing was made to apply Panainte to Kim and Filev so the system can more accurately predict user command based on user feedback and adjust the threshold level to better determine user intention using the user feedback. 

Claims 9 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kim, in view of Filev and further in view of Yoon, Pub. No.: 2018/0358013A1. 
With regard to claims 9 and 19:

However Yoon discloses the aspect of allocating respective priorities to the plurality of instructions; and executing a plurality of tasks corresponding to the plurality of instructions based on the respective priorities (paragraph 207: “The above-mentioned method for selecting the task based of the voice command may decide the task and the task processing order according to the priority of each task and information indicating whether the uttering person is the current driver or the previous driver. Therefore, even when voice commands uttered by the plurality of users are input to the apparatus, a relatively high -priority task may be primarily processed.”). It would have being obvious to one of ordinary skill in the art, at the time the filing was made to apply Yoon to Kim and Filev so the system can prioritize user commands and first perform most important commands and accurately perform tasks based multiple user’s different degree of needs or authority levels. 

Pertinent Arts
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Khosravy, Pub. No.: 2014/0104197A1, An expression of voice and speech can be sensed and detected for one or more interactions, and analyzed for user intensity based on volume (loudness), change in volume, speed of speech, pitch of voice (e.g., high, low) as levels of stress, shortness of vocal tones, sustained vocal tones, and so on. .

	
Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to DI XIAO whose telephone number is (571)270-1758.  The examiner can normally be reached on 9Am-5Pm est M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Renee Chavez can be reached on 5712701104.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access 






/DI XIAO/Primary Examiner, Art Unit 2179