DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This Office Action is in response to correspondence filed 08 October 2020 in reference to application 17/066,228.  Claims 1-30 are pending and have been examined.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.

Claims 1-30, 32, 39, and 43 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-28 of U.S. Patent No. 10,818,288. Although the claims at issue are not identical, they are not patentably distinct from each other because the claims of Patent 10,818,288 anticipated the instant claims as laid out in the chart below.
Instant Application
US Patent 10,818,288
Claim 1: An electronic device, comprising: 
Claim 1: An electronic device, comprising:
one or more processors; 
one or more processors; 
a microphone; and 
a microphone; and 
memory storing one or more programs configured to be executed by the one or more processors, the one or more programs including instructions for: 
memory storing one or more programs configured to be executed by the one or more processors, the one or more programs including instructions for:
receiving, via the microphone, a first audio stream including one or more utterances; 
receiving, via the microphone, a first audio stream including one or more utterances;
determining whether the first audio stream includes a lexical trigger; 
determining whether the first audio stream includes a lexical trigger;
in accordance with a determination that the first audio stream includes the lexical trigger, generating one or more candidate text representations of the one or more utterances; 
in accordance with a determination that the first audio stream includes the lexical trigger, generating one or more candidate text representations of the one or more utterances;
determining whether at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant based on sensory data obtained from one or more sensors of the electronic device; 
determining whether at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant; Claim 13: estimating, based on sensory data, the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant
in accordance with a determination that at least one candidate text representation is to be disregarded by the virtual assistant, generating one or more candidate intents based on candidate text representations of the one or more candidate text representations other than the to be disregarded at least one candidate text representation; 
Claim 1: in accordance with a determination that at least one candidate text representation is to be disregarded by the virtual assistant, generating one or more candidate intents based on candidate text representations of the one or more candidate text representations other than the to be disregarded at least one candidate text representation,…
determining whether the one or more candidate intents include at least one actionable intent; 
determining whether the one or more candidate intents include at least one actionable intent;
in accordance with a determination that the one or more candidate intents include at least one actionable intent, executing the at least one actionable intent; 
in accordance with a determination that the one or more candidate intents include at least one actionable intent, executing the at least one actionable intent;
outputting a result of the execution of the at least one actionable intent.
outputting a result of the execution of the at least one actionable intent.
Claim 2: The electronic device of claim 1, wherein the lexical trigger is a single-word lexical trigger.
Claim 2: The electronic device of claim 1, wherein the lexical trigger is a single-word lexical trigger.
Claim 3: The electronic device of claim 2, wherein the first audio stream includes a first utterance, and wherein the single-word lexical trigger is positioned in a portion of the first utterance other than the beginning portion of the first utterance.
Claim 3: The electronic device of claim 2, wherein the first audio stream includes a first utterance, and wherein the single-word lexical trigger is positioned in a portion of the first utterance other than the beginning portion of the first utterance.	
Claim 4: The electronic device of claim 1, wherein determining whether the first audio stream includes a lexical trigger comprises: 
Claim 4: The electronic device of claim 1, wherein determining whether the first audio stream includes a lexical trigger comprises: 
detecting a beginning point of the first audio stream; 
detecting a beginning point of the first audio stream; 
detecting an end point of the first audio stream; and 
detecting an end point of the first audio stream; and 
determining whether a lexical trigger is included between the beginning point and the end point of the first audio stream.
determining whether a lexical trigger is included between the beginning point and the end point of the first audio stream.
Claim 5: The electronic device of claim 4, wherein detecting the beginning point of the first audio stream comprises: 
Claim 5: The electronic device of claim 4, wherein detecting the beginning point of the first audio stream comprises: 
detecting, via the microphone, an absence of voice activity before receiving the first audio stream; 
detecting, via the microphone, an absence of voice activity before receiving the first audio stream; 
determining whether the absence of voice activity before receiving the first audio stream exceeds a first threshold period of time; and 
determining whether the absence of voice activity before receiving the first audio stream exceeds a first threshold period of time; and 
in accordance with a determination that the absence of voice activity exceeds the first threshold period of time, determining the beginning point of the first audio stream based on the absence of voice activity before receiving the first audio stream.
in accordance with a determination that the absence of voice activity exceeds the first threshold period of time, determining the beginning point of the first audio stream based on the absence of voice activity before receiving the first audio stream.
Claim 6: The electronic device of claim 4, wherein detecting the end point of the first audio stream comprises: 
Claim 6: The electronic device of claim 4, wherein detecting the end point of the first audio stream comprises: 
detecting, via the microphone, an absence of voice activity after receiving the one or more utterances of the first audio stream; 
detecting, via the microphone, an absence of voice activity after receiving the one or more utterances of the first audio stream;.

determining whether the absence of voice activity after receiving the one or more utterances of the first audio stream exceeds a second threshold period of time; and 
in accordance with a determination that the absence of voice activity after receiving the one or more utterances of the first audio stream exceeds the second threshold period of time, determining the end point of the first audio stream based on the absence of voice activity after receiving the one or more utterances of the first audio stream.
in accordance with a determination that the absence of voice activity after receiving the one or more utterances of the first audio stream exceeds the second threshold period of time, determining the end point of the first audio stream based on the absence of voice activity after receiving the one or more utterances of the first audio stream
Claim 7: The electronic device of claim 4, wherein detecting the end point of the first audio stream comprises: 107681599100Attorney Docket No.: P37148USC1/77870000293201 
Claim 7: The electronic device of claim 4, wherein detecting the end point of the first audio stream comprises: 
obtaining a pre-configured duration that the electronic device is configured to receive the first audio stream; and 
obtaining a pre-configured duration that the electronic device is configured to receive the first audio stream; and 
determining the end point of the first audio stream based on the detected beginning point of the first audio stream and the pre-configured duration
determining the end point of the first audio stream based on the detected beginning point of the first audio stream and the pre-configured duration.
Claim 8: The electronic device of claim 4, wherein detecting the end point of the first audio stream comprises: 
Claim 8: The electronic device of claim 4, wherein detecting the end point of the first audio stream comprises: 
determining a size of an audio file representing the received one or more utterances of the first audio stream; 
determining a size of an audio file representing the received one or more utterances of the first audio stream; 
comparing the size of the audio file with a capacity of a buffer storing the audio file; and 
comparing the size of the audio file with a capacity of a buffer storing the audio file; and 
determining the end point of the first audio stream based on a result of comparing the size of the audio file with the capacity of the buffer storing the audio file.
determining the end point of the first audio stream based on a result of comparing the size of the audio file with the capacity of the buffer storing the audio file.
Claim 9: The electronic device of claim 1, wherein the one or more utterances of the first audio stream include at least one utterance that is not directed to the virtual assistant.
Claim 9: The electronic device of claim 1, wherein the one or more utterances of the first audio stream include at least one utterance that is not directed to the virtual assistant.
Claim 10: The electronic device of claim 1, wherein generating one or more 
Claim 10: The electronic device of claim 1, wherein generating one or more 

performing speech-to-text conversion of each of the one or more utterances of the first audio stream to generate the one or more candidate text representations; and 
determining confidence levels corresponding to the one or more candidate text representations.
determining confidence levels corresponding to the one or more candidate text representations.
Claim 11: The electronic device of claim 1, wherein determining whether the at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant comprises: 
Claim 11: The electronic device of claim 1, wherein determining whether the at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant comprises: 
determining whether the at least one candidate text representation includes the lexical trigger; and 
determining whether the at least one candidate text representation includes the lexical trigger; and 
in accordance with a determination that the at least one candidate text representation does not include the lexical trigger, estimating a likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant; and 107681599101Attorney Docket No.: P37148USC1/77870000293201 
in accordance with a determination that the at least one candidate text representation does not include the lexical trigger, estimating a likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant; and 
determining, based on the estimated likelihood, whether the at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant.
determining, based on the estimated likelihood, whether the at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant.
Claim 12: The electronic device of claim 11, wherein estimating the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant comprises: 
Claim 12: The electronic device of claim 11, wherein estimating the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant comprises: 
obtaining context information associated with a usage pattern of the virtual assistant; and 
obtaining context information associated with a usage pattern of the virtual assistant; and 
estimating, based on the context information associated with the usage pattern of the virtual assistant, the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant.
estimating, based on the context information associated with the usage pattern of the virtual assistant, the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant.
Claim 13: The electronic device of claim 11, wherein estimating the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant comprises: 
Claim 13: The electronic device of claim 11, wherein estimating the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant comprises: 
estimating, based on the sensory data, the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant.
…estimating, based on sensory data, the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant.
Claim 14: The electronic device of claim 1, wherein generating the one or more candidate intents based on the candidate text representations of the one or more candidate text representations other than the to be disregarded at least one candidate text representation comprises: 
Claim 1: wherein generating the one or more candidate intents comprises: 
obtaining one or more pre-mitigation intents corresponding to the one or more candidate text representations of the one or more utterances; and 
obtaining one or more pre-mitigation intents corresponding to the one or more candidate text representations of the one or more utterances, including obtaining a pre-mitigation intent corresponding to the to be disregarded at least one candidate text representation; and
selecting, from the one or more pre-mitigation intents, the one or more candidate intents corresponding to the one or more candidate text representations other than the to be disregarded at least one candidate text representation.
selecting, from the one or more pre-mitigation intents, the one or more candidate intents corresponding to the one or more candidate text representations other than the to be disregarded at least one candidate text representation, 
Claim 15: The electronic device of claim 1, wherein determining whether the one or more candidate intents include at least one actionable intent comprises: 107681599102Attorney Docket No.: P37148USC1/77870000293201 
Claim 14: The electronic device of claim 1, wherein determining whether the one or more candidate intents include at least one actionable intent comprises: 
determining, for each of the one or more candidate intents, whether a task can be performed; and 
determining, for each of the one or more candidate intents, whether a task can be performed; and 
in accordance with a determination that the task can be performed, determining that the one or more candidate intents include at least one actionable intent.
in accordance with a determination that the task can be performed, determining that the one or more candidate intents include at least one actionable intent.
Claim 16: The electronic device of claim 15, wherein determining whether the task can be performed comprises: 
Claim 15: The electronic device of claim 14, wherein determining whether the task can be performed comprises: 

obtaining context information associated with a usage pattern of the virtual assistant; and 
determining, based on the context information associated with the usage pattern of the virtual assistant, whether the task can be performed.
determining, based on the context information associated with the usage pattern of the virtual assistant, whether the task can be performed
Claim 17: The electronic device of claim 15, wherein determining whether the task can be performed comprises: 
Claim 16: The electronic device of claim 14, wherein determining whether the task can be performed comprises: 
obtaining context information associated with a previous task performed by the virtual assistant; and 
obtaining context information associated with a previous task performed by the virtual assistant; and 
determining, based on the context information associated with the previous task performed by the virtual assistant, whether the task can be performed
determining, based on the context information associated with the previous task performed by the virtual assistant, whether the task can be performed.
Claim 18: The electronic device of claim 15, wherein determining whether the task can be performed comprises: 
Claim 17: The electronic device of claim 14, wherein determining whether the task can be performed comprises: 
determining one or more relations among the one or more candidate text representations; 
determining one or more relations among the one or more candidate text representations; 
determining whether the task can be performed based on the one or more relations among the one or more candidate text representations.
determining whether the task can be performed based on the one or more relations among the one or more candidate text representations.
Claim 19: The electronic device of claim 15, wherein determining whether the task can be performed comprises: 
Claim 18: The electronic device of claim 14, wherein determining whether the task can be performed comprises: 
obtaining sensory data from one or more sensors communicatively coupled to the electronic device; and 
obtaining sensory data from one or more sensors communicatively coupled to the electronic device; and 
determining whether the task can be performed based on the sensory data.
determining whether the task can be performed based on the sensory data.
Claim 20: The electronic device of claim 15, wherein determining whether the task can be performed comprises: 
Claim 19: The electronic device of claim 14, wherein determining whether the task can be performed comprises: 
estimating a confidence level associated with performing the task; 
estimating a confidence level associated with performing the task; 
determining whether the confidence level associated with performing the task satisfies a threshold confidence level; and 
determining whether the confidence level associated with performing the task satisfies a threshold confidence level; and 
in accordance with a determination that the confidence level associated with performing the task satisfies the threshold 

Claim 21: The electronic device of claim 1, wherein executing the at least one actionable intent comprises: 
Claim 20: The electronic device of claim 1, wherein executing the at least one actionable intent comprises: 
performing one or more tasks according to the at least one actionable intent.
performing one or more tasks according to the at least one actionable intent.
Claim 22: The electronic device of claim 1, wherein the one or more candidate intents includes a plurality of actionable intents, and where executing the at least one actionable intent comprises: 
Claim 21: The electronic device of claim 1, wherein the one or more candidate intents includes a plurality of actionable intents, and where executing the at least one actionable intent comprises: 
selecting, from a plurality of tasks associated with the plurality of actionable intents, a first task for execution; and 
selecting, from a plurality of tasks associated with the plurality of actionable intents, a first task for execution; and 
performing the selected first task
performing the selected first task.
Claim 23: The electronic device of claim 22, wherein selecting, from the plurality of tasks associated with the plurality of actionable intents, the first task for execution comprises: 
Claim 22: The electronic device of claim 21, wherein selecting, from the plurality of tasks associated with the plurality of actionable intents, the first task for execution comprises: 
obtaining context information associated with a most-recent task initiated by the virtual assistant; and
obtaining context information associated with a most-recent task initiated by the virtual assistant; and 
selecting the first task based on the context information associated with a previous task performed by the virtual assistant
selecting the first task based on the context information associated with a previous task performed by the virtual assistant.
Claim 24: The electronic device of claim 22, wherein selecting, from the plurality of tasks associated with the plurality of actionable intents, the first task for execution comprises: 
Claim 23: The electronic device of claim 21, wherein selecting, from the plurality of tasks associated with the plurality of actionable intents, the first task for execution comprises: 
outputting a plurality of task options corresponding to the plurality of tasks associated with the plurality of actionable intents;
outputting a plurality of task options corresponding to the plurality of tasks associated with the plurality of actionable intents; 
receiving a user selection from the plurality of task options; and 107681599104Attorney Docket No.: P37148USC1/77870000293201 
receiving a user selection from the plurality of task options; and 
selecting the first task based on the user selection
selecting the first task based on the user selection.
Claim 25: The electronic device of claim 22, wherein selecting, from the plurality of tasks associated with the plurality of actionable intents, the first task for execution comprises: 
Claim 24: The electronic device of claim 21, wherein selecting, from the plurality of tasks associated with the plurality of actionable intents, the first task for execution comprises: 

determining a priority associated with each of the plurality of tasks; and 
selecting the first task for execution based on the priority associated with each of the plurality of tasks
selecting the first task for execution based on the priority associated with each of the plurality of tasks.
Claim 26: The electronic device of claim 1, wherein the one or more programs comprise further instructions for: 
Claim 25: The electronic device of claim 1, wherein the one or more programs comprise further instructions for: 
upon executing the at least one actionable intent, receiving, via the microphone, a second audio stream; 
upon executing the at least one actionable intent, receiving, via the microphone, a second audio stream; 
generating one or more second candidate text representations to represent the second audio stream; 
generating one or more second candidate text representations to represent the second audio stream; 
determining, based on the one or more second candidate text representations, whether the second audio stream is a part of an audio session that includes the first audio stream; 
determining, based on the one or more second candidate text representations, whether the second audio stream is a part of an audio session that includes the first audio stream; 
in accordance with a determination that the second audio stream is a part of the audio session that includes the first audio stream, generating, based on the one or more second candidate text representations, one or more second candidate intents; 
in accordance with a determination that the second audio stream is a part of the audio session that includes the first audio stream, generating, based on the one or more second candidate text representations, one or more second candidate intents; 
determining whether the one or more second candidate intents include at least one second actionable intent; 
determining whether the one or more second candidate intents include at least one second actionable intent; 
in accordance with a determination that the one or more second candidate intents include at least one second actionable intent, executing the at least one second actionable intent; and 
in accordance with a determination that the one or more second candidate intents include at least one second actionable intent, executing the at least one second actionable intent; and 
outputting a result of the execution of the at least one second actionable intent.
outputting a result of the execution of the at least one second actionable intent.
Claim 27: The electronic device of claim 26, wherein determining whether the second audio stream is a part of the audio session that includes the first audio stream comprises: 
Claim 26: The electronic device of claim 25, wherein determining whether the second audio stream is a part of the audio session that includes the first audio stream comprises: 
obtaining context information associated with executing the at least one actionable intent; and 107681599105Attorney Docket No.: P37148USC1/77870000293201 
obtaining context information associated with executing the at least one actionable intent; and 
determining, based on the context information associated with executing the 

Claim 28: The electronic device of claim 26, wherein determining whether the second audio stream is a part of the audio session that includes the first audio stream comprises: 
Claim 27: The electronic device of claim 25, wherein determining whether the second audio stream is a part of the audio session that includes the first audio stream comprises: 
determining a relation among respective candidate text representations of the first audio stream and the second audio stream; and 
determining a relation among respective candidate text representations of the first audio stream and the second audio stream; and 
determining, based on the relation among respective candidate text representations of the first audio stream and the second audio stream, whether the second audio stream is a part of the audio session that includes the first audio stream.
determining, based on the relation among respective candidate text representations of the first audio stream and the second audio stream, whether the second audio stream is a part of the audio session that includes the first audio stream.
Claim 29: A method for providing natural language interaction by a virtual assistant, the method comprising: 
Claim 28: A method for providing natural language interaction by a virtual assistant, the method comprising:
at an electronic device with one or more processors, memory, and a microphone: 
at an electronic device with one or more processors, memory, and a microphone:
receiving, via a microphone, a first audio stream including one or more utterances; 
receiving, via a microphone, a first audio stream including one or more utterances;
determining whether the first audio stream includes a lexical trigger; 
determining whether the first audio stream includes a lexical trigger; 
in accordance with a determination that the first audio stream includes the lexical trigger, generating one or more candidate text representations of the one or more utterances; 
 in accordance with a determination that the first audio stream includes the lexical trigger, generating one or more candidate text representations of the one or more utterances;
determining whether at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant based on sensory data obtained from one or more sensors of the electronic device; 
determining whether at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant; Claim 32:  estimating, based on sensory data, the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant
in accordance with a determination that at least one candidate text representation is to be disregarded by the virtual assistant, generating one or more candidate intents 
Claim 28: in accordance with a determination that at least one candidate text representation is to be disregarded by the virtual assistant, generating one or 

determining whether the one or more candidate intents include at least one actionable intent;
in accordance with a determination that the one or more candidate intents include at least one actionable intent, executing the at least one actionable intent; 
in accordance with a determination that the one or more candidate intents include at least one actionable intent, executing the at least one actionable intent; 
outputting a result of the execution of the at least one actionable intent.
outputting a result of the execution of the at least one actionable intent.
Claim 30: A non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device, the one or more programs including instructions for: 
Claim 39: A non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device, the one or more programs including instructions for:
receiving, via a microphone, a first audio stream including one or more utterances; 
receiving, via a microphone, a first audio stream including one or more utterances;
determining whether the first audio stream includes a lexical trigger; 
determining whether the first audio stream includes a lexical trigger;
in accordance with a determination that the first audio stream includes the lexical trigger, generating one or more candidate text representations of the one or more utterances; 
in accordance with a determination that the first audio stream includes the lexical trigger, generating one or more candidate text representations of the one or more utterances;
determining whether at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant based on sensory data obtained from one or more sensors of the electronic device; 
 determining whether at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant; Claim 43: estimating, based on sensory data, the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant.
in accordance with a determination that at least one candidate text representation is to be disregarded by the virtual assistant, generating one or more candidate intents based on candidate text representations 
Claim 39: in accordance with a determination that at least one candidate text representation is to be disregarded by the virtual assistant, generating one or more candidate intents based on 

determining whether the one or more candidate intents include at least one actionable intent; 
in accordance with a determination that the one or more candidate intents include at least one actionable intent, executing the at least one actionable intent; 
in accordance with a determination that the one or more candidate intents include at least one actionable intent, executing the at least one actionable intent; 
outputting a result of the execution of the at least one actionable intent.
outputting a result of the execution of the at least one actionable intent



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim 1-3, 9-21, and 26-30 is/are rejected under 35 U.S.C. 103 as being unpatentable over Koshida (US PAP 2018/0233140) in view of Konig (US PAP 2005/00216271).

Consider claim 1, Koshida teaches an electronic device (abstract), comprising: 
one or more processors (0042, processors); 
a microphone (0039, microphone); and 

receiving, via the microphone, a first audio stream including one or more utterances (0040, receiving speech); 
generating one or more candidate text representations of the one or more utterances (0040, translate speech to text via speech recognition); 
determining whether at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant (0206, subfragment may be disregarded if speaker is different) based on sensory data obtained from one or more sensors of the electronic device (0033, sensor data to detect when a user engaged with another device, also see 0081, 0146-48, entity tracker uses sensors to identify entities and speakers); 
in accordance with a determination that at least one candidate text representation is to be disregarded by the virtual assistant, generating one or more candidate intents based on candidate text representations of the one or more candidate text representations other than the to be disregarded at least one candidate text representation (0206, other fragments may be used to generate user intention); 
determining whether the one or more candidate intents include at least one actionable intent (0206, determine user intention, 0041, intent handler determines if not ambiguous (actionable)); 

outputting a result of the execution of the at least one actionable intent (0041, outputting execution result through speaker, video, or other device).
Koshida does not specifically teach 
determining whether the first audio stream includes a lexical trigger; 
in accordance with a determination that the first audio stream includes the lexical trigger, generating one or more candidate text representations of the one or more utterances.
In the same field of processing commands, Konig teaches determining whether the first audio stream includes a lexical trigger (0022, identify waking key word i.e. “car”); 
in accordance with a determination that the first audio stream includes the lexical trigger, generating one or more candidate text representations of the one or more utterances (0027-28, if keyword identified, system attempts to identify a command).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use a lexical trigger as taught by Konig in the system of Koshida in order to make the process more user friendly to activate the command system (Konig 0008-09).

Consider claim 2, Konig teaches the electronic device of claim 1, wherein the lexical trigger is a single-word lexical trigger (0022, single key word i.e. “car”).

Consider claim 3, Konig teaches the electronic device of claim 2, wherein the first audio stream includes a first utterance, and wherein the single-word lexical trigger is positioned in a portion of the first utterance other than the beginning portion of the first utterance (i.e. example 2, 0039-40).

Consider claim 9, Konig teaches the electronic device of claim 1, wherein the one or more utterances of the first audio stream include at least one utterance that is not directed to the virtual assistant (0022, “I bought a new car,” not directed to assistant).

Consider claim 10, Koshida teaches the electronic device of claim 1, wherein generating one or more candidate text representations of the one or more utterances comprises: 
performing speech-to-text conversion of each of the one or more utterances of the first audio stream to generate the one or more candidate text representations (0040, translate speech to text); and 
determining confidence levels corresponding to the one or more candidate text representations (0040, assign confidence values to recognition texts).

Consider claim 11, Koshida and Konig teach the electronic device of claim 1, wherein determining whether the at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant comprises: 

in accordance with a determination that the at least one candidate text representation does not include the lexical trigger, estimating a likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant (Konig 0041-44, even without lexical trigger, system may determine user is speaking command); and  107681599101Attorney Docket No.: P37148US1/77870000293101 
determining, based on the estimated likelihood, whether the at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant (0041-44, even without lexical trigger, system may determine user is speaking command).

Consider claim 12, Koshida teaches the electronic device of claim 11, wherein estimating the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant comprises: 
obtaining context information associated with a usage pattern of the virtual assistant (0066-68, using context); and 
estimating, based on the context information associated with the usage pattern of the virtual assistant, the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant (0066-68 using context to determine user intent and thus user command).

Consider claim 13, Koshida teaches the electronic device of claim 11, wherein estimating the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant comprises: 
estimating, based on sensory data, the likelihood that the utterance corresponding to the at least one candidate text representation is not directed to the virtual assistant (0081, intent handler will use sensor data to determine user intent, 0033, sensor data to detect when a user engaged with another device, also see 0146-48, entity tracker uses sensors to identify entities and speakers).

Consider claim 14, Koshida teaches the electronic device of claim 1, wherein generating the one or more candidate intents based on the candidate text representations of the one or more candidate text representations other than the to be disregarded at least one candidate text representation comprises: 
obtaining one or more pre-mitigation intents corresponding to the one or more candidate text representations of the one or more utterances (0206, sentence fragments of the audio stream); and 
selecting, from the one or more pre-mitigation intents, the one or more candidate intents corresponding to the one or more candidate text representations other than the to be disregarded at least one candidate text representation (0206, sentence fragments other than the disregarded fragments used to generate intent.).

Consider claim 15, Koshida teaches the electronic device of claim 1, wherein determining whether the one or more candidate intents include at least one actionable intent comprises:  107681599102Attorney Docket No.: P37148US1/77870000293101 
determining, for each of the one or more candidate intents, whether a task can be performed (0041, determine of presence of missing or ambiguous data); and 
in accordance with a determination that the task can be performed, determining that the one or more candidate intents include at least one actionable intent (0041, if missing or ambiguous data not present or can be determined, intent is actionable and is acted upon).

Consider claim 16, Koshida teaches the electronic device of claim 15, wherein determining whether the task can be performed comprises: 
obtaining context information associated with a usage pattern of the virtual assistant (0116-18, previous intent content may be used); and 
determining, based on the context information associated with the usage pattern of the virtual assistant, whether the task can be performed (0116-18, determine intent from previous content).

Consider claim 17, Koshida teaches the electronic device of claim 15, wherein determining whether the task can be performed comprises: 
obtaining context information associated with a previous task performed by the virtual assistant (0116-18, previous intent content may be used); and 


Consider claim 18, Koshida teaches the electronic device of claim 15, wherein determining whether the task can be performed comprises: 
determining one or more relations among the one or more candidate text representations (0206, determine segments that came from same speaker); 
determining whether the task can be performed based on the one or more relations among the one or more candidate text representations (0206, using segments from same speaker to determine intent).

Consider claim 19, Koshida teaches the electronic device of claim 15, wherein determining whether the task can be performed comprises: 
obtaining sensory data from one or more sensors communicatively coupled to the electronic device (0081, sensor data received from sensors); and 
determining whether the task can be performed based on the sensory data (0081, intent handler will use sensor data to determine user intent.).

Consider claim 20, Koshida teaches the electronic device of claim 15, wherein determining whether the task can be performed comprises: 
estimating a confidence level associated with performing the task (0053, confidence level text is accurate); 

in accordance with a determination that the confidence level associated with performing the task satisfies the threshold confidence level, determining that the task can be performed (0058, confidence values used to determine user intent).

Consider claim 21, Koshida teaches the electronic device of claim 1, wherein executing the at least one actionable intent comprises: 
performing one or more tasks according to the at least one actionable intent (0041, if missing or ambiguous data not present or can be determined, intent is actionable and is acted upon).

Consider claim 26, Koshida teaches the electronic device of claim 1, wherein the one or more programs comprise further instructions for: 
upon executing the at least one actionable intent, receiving, via the microphone, a second audio stream (0040, receiving speech, 0080, after prior utterances in user history); 
generating one or more second candidate text representations to represent the second audio stream (0040, translate speech to text via speech recognition); 
determining, based on the one or more second candidate text representations, whether the second audio stream is a part of an audio session that includes the first audio stream (0080, determining if ambiguities can be resolved from prior utterances); 

determining whether the one or more second candidate intents include at least one second actionable intent (0206, determine user intention, 0041, intent handler determines if not ambiguous (actionable)); 
in accordance with a determination that the one or more second candidate intents include at least one second actionable intent, executing the at least one second actionable intent (0041, executing intent); and 
outputting a result of the execution of the at least one second actionable intent (0041, outputting execution result through speaker, video, or other device).

Consider claim 27, Koshida teaches the electronic device of claim 26, wherein determining whether the second audio stream is a part of the audio session that includes the first audio stream comprises: 
obtaining context information associated with executing the at least one actionable intent (0116-18, previous intent content may be used); and 
determining, based on the context information associated with executing the at least one actionable intent, whether the second audio stream is a part of the audio session that includes the first audio stream (0116-18, previous intent content may be used to select new intent).

Consider claim 28, Koshida teaches the electronic device of claim 26, wherein determining whether the second audio stream is a part of the audio session that includes the first audio stream comprises: 
determining a relation among respective candidate text representations of the first audio stream and the second audio stream (0206, determine segments that came from same speaker); and 
determining, based on the relation among respective candidate text representations of the first audio stream and the second audio stream, whether the second audio stream is a part of the audio session that includes the first audio stream (0206, using segments from same speaker to determine intent).

Consider claim 29, Koshida teaches A method for providing natural language interaction by a virtual assistant (abstract), the method comprising: 
at an electronic device with one or more processors (0042, processors), memory (0042, memory), and a microphone (0039, microphone): 
receiving, via the microphone, a first audio stream including one or more utterances (0040, receiving speech); 
generating one or more candidate text representations of the one or more utterances (0040, translate speech to text via speech recognition); 
determining whether at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant (0206, subfragment may be disregarded if speaker is different) based on sensory data obtained from one or more sensors of the electronic device (0033, sensor data to detect 
in accordance with a determination that at least one candidate text representation is to be disregarded by the virtual assistant, generating one or more candidate intents based on candidate text representations of the one or more candidate text representations other than the to be disregarded at least one candidate text representation (0206, other fragments may be used to generate user intention); 
determining whether the one or more candidate intents include at least one actionable intent (0206, determine user intention, 0041, intent handler determines if not ambiguous (actionable)); 
in accordance with a determination that the one or more candidate intents include at least one actionable intent, executing the at least one actionable intent (0041, executing intent); 
outputting a result of the execution of the at least one actionable intent (0041, outputting execution result through speaker, video, or other device).
Koshida does not specifically teach 
determining whether the first audio stream includes a lexical trigger; 
in accordance with a determination that the first audio stream includes the lexical trigger, generating one or more candidate text representations of the one or more utterances.
In the same field of processing commands, Konig teaches determining whether the first audio stream includes a lexical trigger (0022, identify waking key word i.e. “car”); 

Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use a lexical trigger as taught by Konig in the system of Koshida in order to make the process more user friendly to activate the command system (Konig 0008-09).

Consider claim 30, Koshida teaches A non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device (0042, memory), the one or more programs including instructions for:
receiving, via the microphone, a first audio stream including one or more utterances (0040, receiving speech); 
generating one or more candidate text representations of the one or more utterances (0040, translate speech to text via speech recognition); 
determining whether at least one candidate text representation of the one or more candidate text representations is to be disregarded by the virtual assistant (0206, subfragment may be disregarded if speaker is different) based on sensory data obtained from one or more sensors of the electronic device (0033, sensor data to detect when a user engaged with another device, also see 0081, 0146-48, entity tracker uses sensors to identify entities and speakers); 

determining whether the one or more candidate intents include at least one actionable intent (0206, determine user intention, 0041, intent handler determines if not ambiguous (actionable)); 
in accordance with a determination that the one or more candidate intents include at least one actionable intent, executing the at least one actionable intent (0041, executing intent); 
outputting a result of the execution of the at least one actionable intent (0041, outputting execution result through speaker, video, or other device).
Koshida does not specifically teach 
determining whether the first audio stream includes a lexical trigger; 
in accordance with a determination that the first audio stream includes the lexical trigger, generating one or more candidate text representations of the one or more utterances.
In the same field of processing commands, Konig teaches determining whether the first audio stream includes a lexical trigger (0022, identify waking key word i.e. “car”); 

Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use a lexical trigger as taught by Konig in the system of Koshida in order to make the process more user friendly to activate the command system (Konig 0008-09).

Claims 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Koshida and Konig as applied to claims 1 above, and further in view of Bou-Ghazale et al. (Hands Free Voice Activation of Personal Communication Devices).

Consider claim 4, Koshida and Konig teach the electronic device of claim 1, but do not specifically teach wherein determining whether the first audio stream includes a lexical trigger comprises:  10768159999Attorney Docket No.: P37148US1/77870000293101 
detecting a beginning point of the first audio stream; 
detecting an end point of the first audio stream; and 
determining whether a lexical trigger is included between the beginning point and the end point of the first audio stream.
In the same field of voice commands, Bou-Ghazale teaches wherein determining whether the first audio stream includes a lexical trigger comprises:  10768159999Attorney Docket No.: P37148US1/77870000293101 
detecting a beginning point of the first audio stream(section 2.1, endpoint detection detects beginning and end of speech segments); 

determining whether a lexical trigger is included between the beginning point and the end point of the first audio stream (section 2, and 2.2, activation word searched between endpoints).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use input detection as taught by Bou-Ghazale in the system of Koshida and Konig to reduce false activations (Bou-Ghazale section 2.1).

Claims 5 and 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Koshida and Konig  and Bou-Ghazale as applied to claim 4 above, and further in view of Liu et al. (Accurate Endpointing with Expected Pause Duration).

Consider claim 5, Koshida, Konig, and Bou-Ghazale teach the electronic device of claim 4, but does not specifically teach wherein detecting the beginning point of the first audio stream comprises: 
detecting, via the microphone, an absence of voice activity before receiving the first audio stream; 
determining whether the absence of voice activity before receiving the first audio stream exceeds a first threshold period of time; and 
in accordance with a determination that the absence of voice activity exceeds the first threshold period of time, determining the beginning point of the first audio stream based on the absence of voice activity before receiving the first audio stream.

detecting, via the microphone, an absence of voice activity before receiving the first audio stream (section 1, detecting non speech regions); 
determining whether the absence of voice activity before receiving the first audio stream exceeds a first threshold period of time (section 1, determining if length of non-speech region exceeds threshold); and 
in accordance with a determination that the absence of voice activity exceeds the first threshold period of time, determining the beginning point of the first audio stream based on the absence of voice activity before receiving the first audio stream (section 1, mark endpoints based on pause duration exceeding threshold).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use time thresholds to determine endpoints as taught by Liu in the system of Koshida and Konig and Bou-Ghazale in order to use a classic and well known method of determining utterance boundaries (Liu section 1).

Consider claim 6, Koshida, Konig, and Bou-Ghazale teach the electronic device of claim 4, wherein detecting the end point of the first audio stream comprises: 
detecting, via the microphone, an absence of voice activity after receiving the one or more utterances of the first audio stream; 
determining whether the absence of voice activity after receiving the one or more utterances of the first audio stream exceeds a second threshold period of time; and 
in accordance with a determination that the absence of voice activity after receiving the one or more utterances of the first audio stream exceeds the second 
In the same field of audio endpointing, Liu teaches 
detecting, via the microphone, an absence of voice activity after receiving the one or more utterances of the first audio stream (section 1, detecting non speech regions); 
determining whether the absence of voice activity after receiving the one or more utterances of the first audio stream exceeds a second threshold period of time (section 1, determining if length of non-speech region exceeds threshold); and 
in accordance with a determination that the absence of voice activity after receiving the one or more utterances of the first audio stream exceeds the second threshold period of time, determining the end point of the first audio stream based on the absence of voice activity after receiving the one or more utterances of the first audio stream (section 1, mark endpoints based on pause duration exceeding threshold).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use time thresholds to determine endpoints as taught by Liu in the system of Koshida and Konig and Bou-Ghazale in order to use a classic and well known method of determining utterance boundaries (Liu section 1).

Claims 7 and 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Koshida and Konig and Bou-Ghazale as applied to claim 4 above, and further in view of Zhou et al. (US PAP 2018/0033436).

Consider claim 7, Koshida and Konig and Bou-Ghazale teach the electronic device of claim 4, but does not specifically teach wherein detecting the end point of the first audio stream comprises: 
obtaining a pre-configured duration that the electronic device is configured to receive the first audio stream; and  107681599100Attorney Docket No.: P37148US1/77870000293101
determining the end point of the first audio stream based on the detected beginning point of the first audio stream and the pre-configured duration.
In the same field of trigger detection, Zhou teaches wherein detecting the end point of the first audio stream comprises: 
obtaining a pre-configured duration that the electronic device is configured to receive the first audio stream (0085-86, preconfigured buffer of a set duration); and  107681599100Attorney Docket No.: P37148US1/77870000293101
determining the end point of the first audio stream based on the detected beginning point of the first audio stream and the pre-configured duration (buffer of predetermined size is filled and passed to recognizer, thus “ending” the speech information).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to determine speech length with a buffer size as taught by Zhou in the system of Koshida and Konig and Bou-Ghazale in order to allow the system to process speech that might have otherwise been lost while the recognizer wakes (Zhou 0070).

Consider claim 8, Koshida and Konig and Bou-Ghazale teach the electronic device of claim 4, but do not specifically teach wherein detecting the end point of the first audio stream comprises: 
determining a size of an audio file representing the received one or more utterances of the first audio stream; 
comparing the size of the audio file with a capacity of a buffer storing the audio file; and 
determining the end point of the first audio stream based on a result of comparing the size of the audio file with the capacity of the buffer storing the audio file
In the same field of trigger detection, Zhou teaches wherein detecting the end point of the first audio stream comprises: 
determining a size of an audio file representing the received one or more utterances of the first audio stream (0085-86, preconfigured buffer of a set duration); 
comparing the size of the audio file with a capacity of a buffer storing the audio file (0085-86, preconfigured buffer of a set duration); and 
determining the end point of the first audio stream based on a result of comparing the size of the audio file with the capacity of the buffer storing the audio file (buffer of predetermined size is filled and passed to recognizer, thus “ending” the speech information).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to determine speech length with a buffer size as taught by Zhou in the system of Koshida and Konig and Bou-Ghazale in order to allow the system to process speech that might have otherwise been lost while the recognizer wakes (Zhou 0070).
Claims 22-24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Koshida and Konig as applied to claim 1 above, and further in view of Newendorp et al. (US PAP 2016/0260431).

Consider claim 22, Koshida and Konig teach the electronic device of claim 1, but does not specifically teach wherein the one or more candidate intents includes a plurality of actionable intents, and where executing the at least one actionable intent comprises: 
selecting, from a plurality of tasks associated with the plurality of actionable intents, a first task for execution; and 
performing the selected first task.
In the same field of voice commands, Newendorp teaches wherein the one or more candidate intents includes a plurality of actionable intents (0222, multiple actionable intents), and where executing the at least one actionable intent comprises: 
selecting, from a plurality of tasks associated with the plurality of actionable intents, a first task for execution (0232, actionable intent selected); and 
performing the selected first task (0232, actionable intent selected for performance).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to select an intent from multiple possible intents as taught by Newendorp in the system of Koshida and Konig in order to improve accuracy of command processing (Newendorp 0006-08). 

Consider claim 23, Koshida teaches the electronic device of claim 22, wherein selecting, from the plurality of tasks associated with the plurality of actionable intents, the first task for execution comprises: 
obtaining context information associated with a most-recent task initiated by the virtual assistant (0116-18, previous intent content may be used to determine intent); and 
selecting the first task based on the context information associated with a previous task performed by the virtual assistant (0116-18, previous intent content may be used to determine intent task).

Consider claim 24, Koshida teaches the electronic device of claim 22, wherein selecting, from the plurality of tasks associated with the plurality of actionable intents, the first task for execution comprises: 
outputting a plurality of task options corresponding to the plurality of tasks associated with the plurality of actionable intents (0134, system may output different options when selection is ambiguous); 
receiving a user selection from the plurality of task options(0134, user may select from outputted options); and 
selecting the first task based on the user selection (0134, user may select from outputted options for execution).

Claims 25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Koshida and Konig and Newendorp as applied to claim 22 above, and further in view of Thrangarathnam et al. (US PAP 20190179607).

Consider claim 25, Koshida, Konig, and Newendorp teach the electronic device of claim 22, but do not specifically teach wherein selecting, from the plurality of tasks associated with the plurality of actionable intents, the first task for execution comprises: 
determining a priority associated with each of the plurality of tasks; and 
selecting the first task for execution based on the priority associated with each of the plurality of tasks.
In the same field of speech command processing, Thrangarathnam teaches wherein selecting, from the plurality of tasks associated with the plurality of actionable intents, the first task for execution comprises: 
determining a priority associated with each of the plurality of tasks (0118 different tasks may have different priorities); and 
selecting the first task for execution based on the priority associated with each of the plurality of tasks (0118, task with higher priority may be selected first).
Therefore it would have been obvious to one of ordinary skill in the art at the time of effective filing to use priority as taught by Thrangarathnam in the system of Koshida, Konig, and Newendorp in order to insure more important tasks are executed (Thrangarathnam 0118).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DOUGLAS C GODBOLD whose telephone number is (571)270-1451. The examiner can normally be reached 6:30am-5pm Monday-Thursday.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

DOUGLAS GODBOLD
Examiner
Art Unit 2655



/DOUGLAS GODBOLD/Primary Examiner, Art Unit 2655