2020Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-21 are pending. Claims 1 and 20-21 are independent.
This Application was published as U.S. 20220293125.
            Apparent priority: 11 March 2021.
	This Application has 17 sheets of drawings.  Figures 8-10, 11A, and 11B pertain to the Claims with the Written Description beginning on page 60 of the Specification as filed and [0238] of the published Application.  
[0183] As used here, the term “affordance” refers to a user-interactive graphical user interface object that is, for example, displayed on the display screen of devices 200, 400, and/or 600 (FIGS. 2A, 4, and 6A-6B). For example, an image (e.g., icon), a button, and text (e.g., hyperlink) each constitutes an affordance.
Claims
The independent Claims are objected to.
Claim 1 includes:
in response to receiving the first speech input, providing a response based on the first speech input; and
providing a first output corresponding to a digital assistant in a first state;

Based on the language, the punctuation, and the indentation, it is not clear whether the second “providing” is or is not also “in response to receiving the first speech input”:

    PNG
    media_image1.png
    531
    676
    media_image1.png
    Greyscale

Note, the contrast with last limitation where a colon “:” and the indentation make clear that the last two lines are both in response to the confidence level exceeding a threshold.

Claims 20 and 21 suffer from the same issue.	
Appropriate correction is required.

Suggestion:
in response to receiving the first speech input, 
providing a response based on the first speech input [[;]] , and
providing a first output corresponding to a digital assistant in a first state;

Keep the semicolon “;” for in between complete limitations.  Use comma “,” as long as the words are still within the same limitation.
Drawings
The Drawings are objected to.
Figures 8 and 9 convey the idea of the Claims.  Figures 8 and 9 can benefit from an appropriate level of detail.
Figures 8 and 9 of the drawings are objected to under 37 CFR 1.83(a) because they fail to show the names of the blocks/modules as described in the specification. Any structural detail that is essential for a proper understanding of the disclosed invention should be shown in the drawing. MPEP § 608.02(d). 
Figures 8 and 9 include a term “OSD” which is not a term of art and does not appear anywhere in the Specification.  Examiner guesses that this phrase is intended to refer to the “on-device speech detector” mentioned one single time in [0244] of the published Application ([0204] of the Specification as filed).  Write out what this term is intended to signify.  
 “Value” is a term used in the Claims and refers to 806, 814, 906, and 908 of Figures 8 and 9.  This term should be shown on the Drawings.
“Time” appears to be represented by the horizontal drags whereas neither the Drawings and nor the Specification indicates so.
Figures 8 and 9 include a series of numbered blocks and another series of numbers that do not include any suitable legend for understanding the drawing. Accordingly, these drawings fail to convey the part of invention to which they are intended to pertain without referring to the Specification. 
37 C.F.R. 1.84 Standards for drawings:
(o) Legends. Suitable descriptive legends may be used subject to approval by the Office, or may be required by the examiner where necessary for understanding of the drawing. They should contain as few words as possible.
To overcome the objection, refer to Figure 2A which is a good example of the amount of information that is considered "suitable descriptive legends" and a drawing should include.
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 6-8, and 19-21 are rejected under 35 U.S.C. 103 as being unpatentable over Piernot (U.S. 20180012596) in view of Guday (U.S. 20190087205).
Regarding Claim 1, Piernot teaches:
1. An electronic device, comprising: [Piernot, Figures 2 and 5 teach the hardware of “user device 102” and “electronic device 500.”]
one or more processors; [Piernot, Figure 2, “processors 204,” Figure 5, “processing unit 508.”]
a memory; and [Piernot, Figure 2, “memory 250.”]
one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, [[Piernot, Figure 2, “memory 250.”  “[0024] … In some examples, a non-transitory computer-readable storage medium of memory 250 can be used to store instructions (e.g., for performing process 300 and/or 400, described below) for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. In other examples, the instructions (e.g., for performing process 300 and/or 400, described below) can be stored on a non-transitory computer-readable storage medium of server system 110, or can be divided between the non-transitory computer-readable storage medium of memory 250 and the non-transitory computer-readable storage medium of server system 110….” ] the one or more programs including instructions for: 
receiving, from a user, a first speech input; [Piernot, Figure 2, “microphone 230.”  Figures 3 and 4, “receive an audio input 302/404” and 304, 306 to Yes or “identify an initial spoken user input from the audio input 406.” ]
in response to receiving the first speech input, 
providing a response based on the first speech input; and [Piernot, Figures 3 and 4, “Respond? 310” to YES and “generate a response to the spoken user input 312” or “generate a response to the first spoken user input 408.”]
providing a first output corresponding to a digital assistant in a first state; 
receiving, from the user, a second speech input; [Piernot, Figure 3 shows a loop back from 312 to 304 such that second, third, etc. speech inputs may be received, Figure 4, “monitor the audio input for a spoken user input 304” to “spoken user input identified? 306” to YES. ]
obtaining a first plurality of values; [Piernot, Figure 2, shows a series of sensors 216, 210, 212, 214, 220, and a “touch screen 246” which can yield the “first plurality of values” of the Claim.]
obtaining, based on the first plurality of values, a first confidence level corresponding to the second speech input; [Piernot, in [0038] teaches that it obtains “a likelihood or confidence score that the user intended for the spoken user input to be directed at the virtual assistant.”  “[0039] The likelihood or confidence score can be determined in any number of ways. … For example, the likelihood or confidence score can be calculated using the general formula of P=C.sub.1+C.sub.2+C.sub.3+ . . . +C.sub.N, where P represents the likelihood or confidence score that the spoken user input was intended for the user device and C.sub.1 . . . C.sub.N can be positive, negative, or zero values representing the positive, negative, or neutral contributions to the likelihood or confidence score from the N different types of contextual information. …”]
in accordance with a determination that the first confidence level exceeds a first threshold confidence level: [Piernot, Figures 3 and 4: 308: “[0037] At block 308, it can be determined whether or not the virtual assistant should respond to the spoken user input by determining whether or not the spoken user input identified at block 304 was intended for the virtual assistant (e.g., the user directed the spoken user input at the virtual assistant and expects the virtual assistant to perform a task or provide a response based on the spoken user input) based on contextual information….”  “[0038] … The calculated likelihood or confidence score can then be compared to a threshold value to determine whether or not the virtual assistant should respond to the spoken user input. For example, if the calculated likelihood or confidence score is greater than the threshold value, it can be determined that the spoken user input was intended for the virtual assistant. If, however, the calculated likelihood or confidence score is not greater than the threshold value, it can be determined that the spoken user input was not intended for the virtual assistant.”]
providing a second output corresponding to the digital assistant in a second state; and [Piernot, Figure 3, “Generate a Response to the Spoken User input 312” in the second round around the loop.  Figure 4, Generate a Response to the Spoken User input 312.”  The “second state” is taught by the awake state of virtual assistant that is taught by the use of trigger phrase (Hey Siri) at 402 in Figure 4.  Piernot is directed to using sensors to do away with the need to use trigger phases or “start-point identifiers” such as Hey Siri.  However, it does teach that they may be used, e.g., in Figure 4 at 402.  “[0048] At block 402, a start-point identifier can be received. The start-point identifier can include a trigger phrase spoken by the user ….”]
continuing to receive the second speech input. [Piernot, Figures 3 and 4 both include a loop back for receiving more speech input at the end of 312 back to 304.]

	The Claim does not define its first and second states.  These states may correspond to asleep and awake in virtual assistant devices or they may correspond to transitions between different applications that are invoked in response to different commands.
Piernot does not teach “providing a first output corresponding to a digital assistant in a first state.” 

    PNG
    media_image2.png
    483
    607
    media_image2.png
    Greyscale
	

    PNG
    media_image3.png
    709
    482
    media_image3.png
    Greyscale


    PNG
    media_image4.png
    481
    714
    media_image4.png
    Greyscale

Guday teaches:
in response to receiving the first speech input, [Guday teaches that the user interface is modified and tailored to the changed context of the user.  “[0097] … The device 110 can support one or more input devices 3130—such as a touchscreen 3132; microphone 3134 for implementation of voice input for voice recognition, voice commands and the like;…”  “[0053] FIGS. 9 and 10 show an illustrative use scenario in which the digital assistant 305 provides a modality to a user based on context data and interactions with the user. For example, FIG. 9 provides illustrative user interactions 900 and processes at the digital assistant. In this scenario, the digital assistant observes, with the user's consent, the user's calendar data which shows that the user has a scheduled flight at the airport. In addition, the user requested information regarding what food is available at the airport. After the digital assistant provides the food information to the user, the digital assistant adjusts the modality of the GUI to include the food information.”]
providing a response based on the first speech input; and [ “[0097] … and one or more output devices 3150- such as a speaker 3152 and one or more displays 3154….”]
providing a first output corresponding to a digital assistant in a first state; [Guday, “[0054] For example, as user 130 is a passenger at an airport, the digital assistant creates a particular modality with various travel icons in an active region 505 of the display, as shown in FIG. 10. The digital assistant also includes food information with the travel icons in light of the user interactions illustrated in FIG. 9. In this scenario, the food icons include a food and dining application and websites associated with restaurants at the airport. The various travel icons include a travel application, ticket, and files that are related to the flight or trip. The classic region 510 can be located behind the active region 505.”   “A digital assistant supported on a local device and/or a remote digital assistant service is configured to track contextual data associated with a user and dynamically load or pre-load various modalities to provide increased ease of use for the user. Various modalities can include adjustments to the graphical icons displayed on the user's device, such as the type, shape, color, size, orientation, and position of the icons. The digital assistant may track context data such as the user's location, upcoming schedule in the user's calendar, user interactions with the digital assistant, and the like to determine the best modality for the user. In one exemplary embodiment, the digital assistant may pre-load a modality with travel applications when the digital assistant learns that the user has scheduled a flight. The digital assistant may render the pre-loaded modality when the user arrives at the airport.”  Abstract.]
…
in accordance with a determination that the first confidence level exceeds a first threshold confidence level: [Guday teaches the change of modality based on contextual data but does not discuss that a certain level of confidence needs to be reached.  This is taught by the primary reference.]
providing a second output corresponding to the digital assistant in a second state; and [[Guday, “…  In one exemplary embodiment, the digital assistant may pre-load a modality with travel applications when the digital assistant learns that the user has scheduled a flight. The digital assistant may render the pre-loaded modality when the user arrives at the airport.”  Abstract.  “17 … the mobile computing device to iteratively change presentation of the subset of icons according to changes in the developed contextual data, in which each change of presentation of the subset of icons comprises a new modality.”]
continuing to receive the second speech input. [Guday teaching receiving spoken commands.]

Piernot and Guday pertain to digital assistants that respond to spoken commands and it would have been obvious to modify Piernot with Guday to include the change in state/modality associated with different applications as more speech comes in and the determined intent of the speaker changes.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 3, Piernot teaches:
3. The device of claim 1, the one or more programs further comprising instructions for: 
in accordance with a determination that the second speech input is associated with a minimum threshold duration: [Piernot teaches that it uses the length of time between two consecutive speech inputs as a context value indicator of whether the speech is directed to the virtual assistant:  “[0058] In one example rule-based system, one rule that can be used (alone, in combination with other rules, or as one of multiple conditions in other rules) is that if the length of time between consecutive spoken user inputs is less than a threshold duration, then it can be determined that the user intended for the current spoken user input to be directed at the virtual assistant….”]
determining, for each value of the first plurality of values, whether a respective value satisfies at least one rule; and [Piernot uses the length of time between two consecutive speech inputs in a rule-based determination of intent:  “[0058] In one example rule-based system, one rule that can be used (alone, in combination with other rules, or as one of multiple conditions in other rules) is that if the length of time between consecutive spoken user inputs ….”]  
in accordance with a determination that a respective value satisfies at least one rule, increasing the first confidence level. [Piernot teaches a rule that the confidence is increased if the length of time is less than a threshold:  “[0059] … . For example, a length of time less than a threshold duration can contribute a positive value to the final likelihood or confidence score, where the magnitude of the positive value can be greater for shorter lengths of time. Similarly, a length of time greater than or equal to the threshold duration can contribute a zero or negative value to the final likelihood or confidence score, where the magnitude of the negative value can be greater for longer lengths of time….”]

Regarding Claim 6, Piernot teaches:
6. The device of claim 1, wherein obtaining a first plurality of values comprises: 
detecting a direction associated with a user gaze; and [Piernot teaches eye-tracking: “[0082] In other examples, the image data can be analyzed (e.g., using known eye-tracking techniques) to determine whether or not the user is looking at or facing the user device when the spoken user input was received….”]
obtaining a respective value of the first plurality values based on the determined direction. [Piernot, teaches contextual values which include the gaze direction:  “1…. wherein the contextual information comprises a direction of the user's gaze when the first spoken user input was received;…”]

Regarding Claim 7, Piernot teaches:
7. The device of claim 1, wherein obtaining a first plurality of values comprises: 
detecting positional information associated with the electronic device; and [Piernot, Figure 2, the sensors include a GPS: “[0021] For example, user device 102 can include a motion sensor 210, a light sensor 212, and a proximity sensor 214 coupled to peripherals interface 206 to facilitate orientation, light, and proximity sensing functions. One or more other sensors 216, such as a positioning system (e.g., a GPS receiver), a temperature sensor, a biometric sensor, a gyroscope, a compass, an accelerometer, and the like, are also connected to peripherals interface 206, to facilitate related functionalities.”]
obtaining a respective value of the first plurality values based on the positional information. [Piernot, the contextual values that are used in determining whether or not the user is addressing the Assistant include the sensor values such as GPS.  “[0091] In some examples, the contextual information can include location data from a GPS receiver from other sensors 216 of user device 102. The location data can represent a geographical location of the user device. In some examples, receiving a spoken user input while the user device is in certain locations (e.g., at home, in an office, or the like) can be indicative that the user was more likely to have intended for the current spoken user input to be directed at the virtual assistant, while receiving the spoken user input while the user device is in certain other locations (e.g., at a movie theatre, in a conference room, or the like) can be indicative that the user was less likely to have intended for the current spoken user input to be directed at the virtual assistant.”]

Regarding Claim 8, Piernot teaches:
8. The device of claim 1, wherein obtaining a first plurality of values comprises: 
determining whether speech is detected at the electronic device; and [Piernot, Figures 3 and 4 at 304 and 306.  First the device is monitoring the collected audio for speech and determines if there is speech in the input.  “[0034] At block 304, the audio input received at block 302 can be monitored to identify a segment of the audio input that includes or potentially includes a spoken user input….”]
obtaining a respective value of the first plurality values based on the determination that speech is detected at the electronic device. [Piernot, Figures 3 and 4 308. Then if speech is detected, the device determines the likelihood/confidence that the speech was in fact directed at the Assistant device.  The confidence is determined from the context values obtained from the various sensors.  “[0037] At block 308, it can be determined whether or not the virtual assistant should respond to the spoken user input by determining whether or not the spoken user input identified at block 304 was intended for the virtual assistant (e.g., the user directed the spoken user input at the virtual assistant and expects the virtual assistant to perform a task or provide a response based on the spoken user input) based on contextual information….”]

Regarding Claim 19, Piernot teaches:
19. The device of claim 1, 
wherein providing a first output corresponding to a digital assistant in a first state comprises at least one of displaying a digital assistant object in a first state and providing an audible output. [Piernot, Figure 2, speaker 228.  “[0003] … The tasks can then be performed by executing one or more functions of the electronic device and a relevant output can be returned to the user in natural language form.”  “[0013] …generating output responses to the user in an audible (e.g., speech) and/or visual form.]

Claim 20 is a method claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.
20. A computer-implemented method, comprising: at an electronic device with one or more processors and memory: receiving, from a user, a first speech input; 
in response to receiving the first speech input, providing a response based on the first speech input; and 
providing a first output corresponding to a digital assistant in a first state; 
receiving, from the user, a second speech input; 
obtaining a first plurality of values; 
obtaining, based on the first plurality of values, a first confidence level corresponding to the second speech input; 
in accordance with a determination that the first confidence level exceeds a first threshold confidence level: 
providing a second output corresponding to the digital assistant in a second state; and 
continuing to receive the second speech input.

Claim 21 is a computer program product system claim with limitations corresponding to the limitations of method Claim 1 and is rejected under similar rationale.
21. A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of a first electronic device, cause the first electronic device to: receive, from a user, a first speech input; in response to receiving the first speech input, provide a response based on the first speech input; and provide a first output corresponding to a digital assistant in a first state; receive, from the user, a second speech input; obtain a first plurality of values; obtain, based on the first plurality of values, a first confidence level corresponding to the second speech input; in accordance with a determination that the first confidence level exceeds a first threshold confidence level: provide a second output corresponding to the digital assistant in a second state; and continue to receive the second speech input.


Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Piernot and Guday in view of Weigand (U.S. 5528666).
Support for echo cancellation in the instant Application:  “[0242] Various values and other signals may be utilized in order to determine whether additional speech is directed to the digital assistant. In particular, a first plurality of values 806 is obtained once the digital assistant transitions to the first state. The first plurality of values 806 may be used to determine a first confidence level corresponding to second speech input 808, as described herein. In general, obtaining the first plurality of values may be based on whether the electronic device is configured for echo cancellation. In particular, a device configured for echo cancellation may include functionality to begin collecting and analyzing the first plurality of values (and/or other values) immediately after a user finishes providing a request. With this functionality enabled, the user may interrupt a digital assistant response once the response begins to be provided. For example, the user may utter a first phrase “What is the weather in Cupertino?” In response, the digital assistant may begin to output a response “It's sunny and . . . .” The response may include a text-to-speech (TTS) output, for example, such that the user could interrupt the digital assistant while the response is being provided. In particular, the user may interrupt the digital assistant response with a follow-up utterance “Sorry, I meant San Francisco.” With an echo cancellation enabled device, the first plurality of values 806 are immediately collected and analyzed once the user finishes uttering the first phrase. Accordingly, with an echo cancellation enabled device, any audible response from the digital assistant may be detected and accounted for utilizing echo cancellation, and at the same time, the device may immediately begin analyzing any follow-up speech and related signals after the user finishes providing an initial input. Accordingly, in accordance with a determination that the electronic device is configured for echo cancellation, the obtaining of the first plurality of values is initiated in response to a detected end of first speech input 802. Alternatively, in accordance with a determination that the electronic device is not configured for echo cancellation, the obtaining of the first plurality of values is initiated in response to a detected end of provided response 804.”

Regarding Claim 2, Piernot teaches:
2. The device of claim 1, the one or more programs further comprising instructions for: 
in accordance with a determination that the electronic device is configured for echo cancellation: initiating the obtaining of the first plurality of values in response to a detected end of the first speech input; and [Piernot teaches speech end-pointing.  Further, the “values” of the Claim are context values collected from the various sensors in order to interpret the speech input.  In Figure 3, the “values” / context is collected after the first speech at 308 in order to determine the intent of the speaker. “[0004] In order for a virtual assistant to properly process and respond to a spoken user input, the virtual assistant can first identify the beginning and end of the spoken user input within a stream of audio input using processes typically referred to as start-pointing and end-pointing, respectively. Conventional virtual assistants can identify these points based on energy levels and/or acoustic characteristics of the received audio stream or manual identification by the user….”  “…  The spoken user input can be identified from the audio input by identifying start and end-points of the spoken user input….”  Abstract.]
in accordance with a determination that the electronic device is not configured for echo cancellation: initiating the obtaining of the first plurality of values in response to a detected end of the provided response. [Piernot, Figure 3, the step of collecting and using context occurs at 308 after the second user input and after the response.]

Piernot does not mention echo cancellation.  
Neither does Guday.
Also, relevance of echo cancellation to the rest of the Claim is not clear.  This “if, then” appears as an arbitrary design choice.  Echo cancellation is used in telephone conversation where the voice of the user of the near-end comes back from the far-end and is mixed with the response of the far-end person.  What the first user said is known and can be canceled (subtracted) from the response.  This is not the case with the scenario described in the instant Application.  All the voices are on the same side here.  What is the echo here?
Weigand teaches:
in accordance with a determination that the electronic device is configured for echo cancellation: initiating the obtaining of the first plurality of values in response to a detected end of the first speech input; and [Weigand, “The echo canceling and duplexing circuit 23 also provides full duplexing capability for the speaker phone 24. Normally, speaker phones are half duplex, which only allows one person to talk at a time. This is necessary to prevent feedback between the speaker and the microphone of the speaker phone 24. The echo canceling and duplexing circuit 23 detects any feedback and removes it before it is received by the other party. This allows for full duplex capability like that provided by a standard telephone handset.”  Col. 2, lines 47-55.]
Piernot/Guday and Weigand pertain to digital assistants and it would have been obvious to modify the system of Piernot/Guday which does not mention echo cancelling because it is directed to spoken commands of a user to his PDA with the system of Weigand which includes echo cancellation as a standard feature in devices that are used as a telephone.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Piernot and Guday in view of Prokofieva (U.S. 20160091967).
Regarding Claim 4, Piernot teaches:
4. The device of claim 1, wherein obtaining a first plurality of values comprises: 
detecting a user gaze directed at a display of the electronic device; [Piernot, Figure 2, teaches a number of sensors that can detect the direction of gaze of the user and teaches that gaze direction is one of the contextual values used to determine whether the speech is directed at the Virtual Assistant Device:  “1…. determining whether to respond to the first spoken user input based on contextual information associated with the first spoken user input, wherein the contextual information comprises a direction of the user's gaze when the first spoken user input was received;…”  “[0082] In other examples, the image data can be analyzed (e.g., using known eye-tracking techniques) to determine whether or not the user is looking at or facing the user device when the spoken user input was received. In these examples, a determination that the user was looking at the user device when the spoken user input was received can be indicative that the user is more likely to have intended for the current spoken user input to be directed at the virtual assistant, while a determination that the user was not looking at the user device when the spoken user input was received can be indicative that the user was less likely to have intended for the current spoken user input to be directed at the virtual assistant or can be neutral regarding the likelihood that the user intended for the current spoken user input to be directed at the virtual assistant.”]
determining whether the user gaze is directed at a displayed digital assistant object; and  [Piernot teaches that it determines whether the user is looking at the device or not but not at particular objects on the display of the device:  “[0045] … In this example, it can be determined (using either the rule-based or probabilistic system) that the virtual assistant should respond to the user's question because the contextual information indicates that the user was looking at the user device while speaking the question and that the volume of the user's voice was above a threshold volume….”]
obtaining a respective value of the first plurality values based on the determination whether the user gaze is directed at the displayed digital assistant object. [Piernot: “12… increasing the likelihood score in response to the direction of the user's gaze being pointed at the electronic device; and decreasing the likelihood score in response to the direction of the user's gaze being pointed away from the electronic device.”  “
Piernot teaches that it determines whether the user is looking at the device or not but not at particular objects on the display of the device.
Guday, teaches gaze detection and also determining the object that the user is gazing at: “[0116] The display system 3300 may further include a gaze detection subsystem 3310 configured for detecting a direction of gaze of each eye of a user or a direction or location of focus, as described above….”  “[0117] In addition, a location at which gaze lines projected from the user's eyes intersect the external display may be used to determine an object at which the user is gazing (e.g. a displayed virtual object and/or real background object)….”
Guday is almost sufficient for teaching of the Claim but does not describe how the value of the object that is gazed at is used.
Prokofieva teaches:
determining whether the user gaze is directed at a displayed digital assistant object; and [Prokofieva, Figures 1, 3, and 5.  As shown in Figure 1, the “tracking component 106” detects the particular line/flight/object on the screen at which the user 102 is gazing.]
obtaining a respective value of the first plurality values based on the determination whether the user gaze is directed at the displayed digital assistant object. [Prokofieva, Figures 3 and 5.  “Extract gaze features 510” and “determine particular visual element 512.”  Further, the value of this object is used to complement the command.]
Piernot/Guday and Prokofieva pertain to digital assistants that respond to spoken commands and both include eye/gaze tracking to provide context to the spoken command and it would have been obvious to modify the system of Piernot/Guday which uses an eye-tracking and which detects which object on the screen the speaker is looking at with the system of Pokofieva that teaches that the object gazed at while issuing the spoken command is used to provide a more detailed level of context to the NLU system.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Piernot and Guday in view of George-Svahn (U.S. 20210256980).
Regarding Claim 5, Piernot teaches capturing a video image of the user:  “[0081] In some examples, the contextual information can include image data from camera subsystem 220 of user device 102. The image data can represent an image or video captured by camera subsystem 220….”  It does not teach detecting lip movements.
Neither does Guday.
George-Svahn teaches:
5. The device of claim 1, wherein obtaining a first plurality of values comprises: 
detecting a lip movement associated with the user; [George-Svahn, “[0033] The digital assistant 200 in FIG. 2 may also comprise a physical user 110 viewing device, in turn comprising a digital image sensor 204 arranged to depict an area 204a in front of the device 200. This image sensor 204 may be used for detecting lip movement of the user 210. However, it is realized that such lip movement may alternatively be detected by a sensor provided as a part of said 3D glasses.”]
determining whether the lip movement corresponds to the first speech input; and [George-Svahn, Figure 4, 417: “Map lip movement to sound.”]
obtaining a respective value of the first plurality values based on the determination. [George-Svahn, “[0042] This speech detection step 409-421 further comprises a lip movement detection step 416, in which the digital assistant 100, 200 detects a lip movement of said detected current speaker. Typically, this lip movement will be detected via image data captured by the above described sensors 104, 204, but may in practise be detected in any suitable way allowing the digital assistant 100, 200 to detect lip movement of the current speaker to reliably be able to, via digital data analysis, determine lip movement patterns or discreet lip movements being characteristic of different uttered speech sounds.”  See also the Abstract of this reference.]
Piernot/Guday and Geroge-Svahn pertain to digital assistants that respond to spoken commands and both include capturing video images of the speaker to provide context to the spoken command and it would have been obvious to modify the system of Piernot/Guday with more refined system of George-Svahn which correlates discreet lip movements to sounds in in order to provide a more detailed level of context to the NLU system.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claims 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Piernot and Guday in view of Sharifi (U.S. 20220180868) (“Sharifi2022”).
Regarding Claim 9, Piernot as shown with respect to Claim 1 teaches calculating a confidence score associated with a particular intent based on the combination of speech and context values ([0038]-[0039]) but does not appear to keep recalculating the confidence for the same spoken command by adding more context values.
Guday does not mention confidence calculations.
Sharifi2022 teaches:
9. The device of claim 1, the one or more programs further comprising instructions for: 
in accordance with a determination that the first confidence level exceeds a first threshold confidence level: obtaining a second plurality of values; [Sharifi2022 in Figure 3 at 304a teaches that a first portion of the received audio is processed by the speech recognizer to generate intermediate results with confidence values that satisfy an “interpretation confidence threshold.”  Sharifi2022 also uses context / “second plurality of values” for its interpretation.  The context in Sharifi2022 comes from the previously spoken and interpreted speech and for new speech, context is lacking.  “[0032] … For instance, the interpreter 220 performs semantic interpretation (e.g., grammar interpretation) on a sequence of intermediate speech recognition results 212 to understand a portion of the utterance 20 and its context to identify any candidate sub-actions 26 that may be associated with a final action 24 to be specified once the query 22 is revealed when the user 10 finished speaking the utterance 20. Here, because the interpreter 220 is interpreting a sequence of intermediate speech recognition results 212 that corresponds to only a portion of the query 22, the interpreter 220 is able to derive the context of a sub-action 26 from the sequence of intermediate speech recognition results 212 corresponding to a portion of the utterance 20….”  “[0037] In some configurations, the interpreter 220 uses an interpretation model that generates a confidence level for a given interpretation 222. …  Here, when the interpreter 220 generates multiple possible interpretations 222 for a given sequence of intermediate speech recognition results 212 and with confidence levels satisfying an interpretation confidence threshold, the executor 230 may process respective sub-actions 26 characterized by the possible interpretations 222 in parallel. With the sub-actions 26 processing in parallel, the interface 200 may graphically display each parallel track on the display 116 and enable the user 10 to select a particular track, or even modify his or her utterance 20 to change the behavior of the interpreter 220 and/or executor 230.”]
obtaining, based on the first plurality of values and the second plurality of values, a second confidence level corresponding to the second speech input; [Sharifi2022 in Figure 3 at 304a and [0037] teaches that for each of the series of intermediate interpretations a confidence value is calculated that must satisfy an “interpretation confidence threshold.”  The “context” in Sharifi2022 is cumulative because the interpretation, and hence the confidence value, relies on all of the previous speech and keeps refining the context which teaches “based on the first plurality of values and the second plurality of values.”]
in accordance with a determination that the second confidence level exceeds a second threshold confidence level: continuing to receive the second speech input; and [Sharifi2022 in Figures 2A to 2H show the series of steps taken by the assistant device as more and more speech come in and the command is getting refined and teaches “continuing to receive the second speech” as long as the intermediates are determined with sufficient confidence.] 
in accordance with a determination that the second confidence level does not exceed a second threshold confidence level: ceasing to receive the second speech input. [Sharifi2022 teaches that when the confidence level falls below the threshold, the user must make a selection and merely continuing to talk will not work as it did before.  “13…. determining, by the data processing hardware, a confidence score of second sub-action identified by performing the partial query interpretation on the second sequence of intermediate ASR results; and when the confidence score of the second sub-action fails to satisfy a confidence threshold, prompting, by the data processing hardware, the user to confirm whether the second sub-action is correctly identified.”]
Piernot/Guday and Sharifi2022 pertain to digital assistants that respond to spoken commands and it would have been obvious to modify the system of Piernot/Guday with the system of Sharifi2022 that expressly discusses continuous incoming speech that is being recognized as it comes in and that the ASR process uses the previous speech as context in aid of recognition and that stops if the confidence in the intent determined falls below a threshold.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 10, Piernot teaches:
10. The device of claim 9, wherein obtaining a second plurality of values comprises: [this Claim limits the analysis of the second speech to a “predetermined duration.”]
identifying a user intent associated with a predetermined duration of the second speech input; [Piernot uses the duration of speech input as one indicator of whether the speech is a command directed at the PDA:  “[0106] In other examples, the speech recognition data from the ASR engine can further include an indication of the length (e.g., number of words, duration of speech, or the like) of the spoken user input. Generally, in some examples, a shorter length of the spoken user input can be indicative that the user was more likely to have intended for the current spoken user input to be directed at the virtual assistant, while a longer length of the spoken user input can be indicative that the user was less likely to have intended for the current spoken user input to be directed at the virtual assistant. However, in some examples, a longer length of the spoken user input can be indicative that the user was more likely to have intended for the current spoken user input to be directed at the virtual assistant, while a shorter length of the spoken user input can be indicative that the user was less likely to have intended for the current spoken user input to be directed at the virtual assistant.”]
determining, based on the user intent, whether the second speech input is directed to a digital assistant; and [Piernot uses the duration of speech input as one indicator of whether the speech is a command directed at the PDA:  “[0106]…  Generally, in some examples, a shorter length of the spoken user input can be indicative that the user was more likely to have intended for the current spoken user input to be directed at the virtual assistant, while a longer length of the spoken user input can be indicative that the user was less likely to have intended for the current spoken user input to be directed at the virtual assistant. ….”]
obtaining a respective value of the second plurality values based on the determination whether the second speech input is directed to a digital assistant. [Piernot modifies the confidence associated with the determined intent according to the length/duration of the speech input as one of the context values:  “[0108] In one example probabilistic system, the length of the spoken user input can be used to calculate a positive, negative, or neutral contribution to a final likelihood or confidence score, where the value of the contribution can have a linear or non-linear relationship with the value of the length of the spoken user input. For example, a length less than a threshold length can contribute a positive value to the final likelihood or confidence score, where the magnitude of the positive value can be greater for shorter lengths. Similarly, a length greater than or equal to the threshold distance can contribute a zero or negative value to the final likelihood or confidence score, where the magnitude of the negative value can be greater for longer lengths.”]

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Piernot and Guday and Sharifi2022 in view of Sharifi (U.S. 8,843,369).
Regarding Claim 11, this Claim compares the speaker profiles of the first and second speech and presumably moves on to analyzing context if the first and second speech are from the same speaker.
Piernot does not refer to speaker profiles.
Neither does Guday.
Neither does Sharifi2022.
Sharifi teaches:
11. The device of claim 9, wherein obtaining a second plurality of values comprises: 
retrieving a first speaker profile associated with the first speech input; [Sharifi, Figure 3, “generate a voice profile for the particular user … 320.”  Figure 1, “voice profile change detector 112.”]]
obtaining a second speaker profile associated with the second speech input; [Sharifi, Figure 3, “receive audio data corresponding to an utterance spoken by a particular user 310” and “generate a voice profile for the particular user … 320.”  This voice profile is generated for each of a number of speakers.]
comparing the first speaker profile to the second speaker profile; and [Sharifi, Figure 3, “determine in the audio data beginning point or an ending point of the utterance based at least in part on the voice profile for the particular user 330.”  Figure 2, “Voice profile change detector 230” including a “comparer 232” and “voice profile specific speech activity detector 235” including a “comparer 237.”  “The comparer 232 compares the voice profile corresponding to the first audio frame to the voice profile corresponding to the second audio frame to determine if the speech of the two audio frames correspond to the same speaker.”  Col. 7, 30-35.  “The voice profile specific speech activity detector 235 can use the stored voice profile to compare subsequent voice profiles using the comparer 237.”  Col. 7, 60-63.]
obtaining a respective value of the second plurality values based on the comparison. [Sharifi, Figure 1, the bottom portion shows that depending on which speaker 127, 130 is detected to have spoken a command 133 or a follow up phrase 136, 139, the reminder is set at 124.  The “respective value” is the date of October 23rd which is obtained based on the comparison of the input voice with the voice profiles.]

Piernot/Guday/Sharifi2022 and Sharifi pertain to digital assistants that respond to spoken commands and it would have been obvious to modify the system of combination to include the multiple speaker scenario of Sharifi in order to be able to determine the context of speech for the continuous speech system of Sharifi2022.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Piernot and Guday and Sharifi2022 in view of Rastoghi (U.S. 20200320988).
Regarding Claim 12, Piernot, Guday, and Sharifi2022 do not teach “lattice embedding” as a phase of speech recognition.
Rastoghi teaches:
12. The device of claim 9, wherein obtaining a second plurality of values comprises: 
obtaining, based on a speech recognition output, a lattice embedding; [Rastoghi performs speech recognition to obtain user intent for commands directed at an automated assistant and generates lattices which are then used as embeddings.  Figure 5, “generate utterance representations and slot features 558.”  “[0061] … For example, an n-best list and/or lattices generated by the voice to text module 114 may be applied to the features model 152 as a representation of tokens of the natural language input. A lattice is a graph that compactly represents multiple possible hypotheses for an utterance. Accordingly, the lattice represents possible tokens of the natural language input.”  “[0070] … The slot values engine 124 can apply embeddings of the slot descriptors, and tokens of an utterance (or embeddings thereof), to slot model 156 to determine which tokens correspond to which slots….”] 
determining a user intent based on the lattice embedding; and [Rastoghi. Figure 2A, “Agent/Domain Engine 122” determines the “Agent/Domain 173” which represents intent from the “tokens 172” which can be embeddings from the ASR lattice. “Candidates Module 132”, “[0067] … As used herein, a domain refers to an ontological categorization of a user's intent for a dialog, and describes the user's intent with less granularity than a dialog state….”]
obtaining a respective value of the second plurality values based on the user intent. [Rastoghi , Figure 2B, “responsive content 180” or “further system utterance 178” both generated by the “agent 140A.”  Rastoghi determines dialog state in a dialog between a user and an automated assistant and performs the action that corresponds to the determined dialog state.
Piernot/Guday/Sharifi2022 and Rastoghi pertain to digital assistants that respond to spoken commands and it would have been obvious to modify the system of combination to include the aspects pertaining to the user of neural networks for speech recognition which is a more recent and powerful algorithm.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Piernot, Guday, and Sharifi2022 in view of Friske (U.S. 20210365863).
Regarding Claim 13, Piernot teaches:
13. The device of claim 9, wherein obtaining a second plurality of values comprises: [this Claim updates the confidence measure by obtaining more of the context values after obtaining a “result” from the first round of NLU.]
identifying a result candidate based on the second speech input; [Piernot, Figure 4, at 308 a “result” is obtained as to whether or not the virtual assistant should respond to the user based on the contextual information.  In Figure 4, the step 308 is based on the “second speech input” at 304 where another spoken user input (first input) had been identified at 406.]
in response to identifying the result candidate: obtaining an updated first plurality of values and an updated second plurality of values; and 
obtaining, based on the updated first plurality of values and the updated second plurality of values, an updated second confidence level corresponding to the second speech input. [Piernot, as applied to Claim, teaches calculating the confidence level associated with a determination of intent at 308 of Figures 3 or 4.]
Piernot in Figure 4 shows the loop of going back after generating response at 312 but in Figure 4, Piernot waits for another spoken user input before it goes on to collect updated context values.
Guday does not teach use of context expressly.
Sharifi2022 teaches updating the context as new portions of speech are incoming.
Friske expressly teaches:
in response to identifying the result candidate: obtaining an updated first plurality of values and an updated second plurality of values; and [Friske teaches a continuous update of context by a “behavior tracking system 304/404 in Figures 3 and 4 and the “behavior tracking component 608” and “model context tracking component 616” in Figure 6.  Figure 8, 804 shows that the virtual assistant is trying to determine the “action to perform.”  “… During (or after) the interaction, IMC can update behavior attributes, context, and/or aggregate propensity metric associated with the entity based on actions performed during the interaction.”  Abstract. This would include the “result candidate” of the Claim.]
Piernot/Guday/Sharifi2022 and Friske pertain to digital assistants that respond to spoken commands and it would have been obvious to modify the system of combination to include the express update of Friske which is also present in Sharifi2022.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claims 14-18 are rejected under 35 U.S.C. 103 as being unpatentable over Piernot and Guday in view of Jitkoff (U.S. 20210256980).
Regarding Claim 14, Piernot does not teach interacting with icons on the screen.
Guday teaches that the interface changes according to the application that best fits the user inquiry and context.  Figure 5 shows the various modalities uploaded and shown on the GUI based on context data.  “… Various modalities can include adjustments to the graphical icons displayed on the user's device, such as the type, shape, color, size, orientation, and position of the icons. The digital assistant may track context data such as the user's location, upcoming schedule in the user's calendar, user interactions with the digital assistant, and the like to determine the best modality for the user. …”  Abstract.  Guday therefore teaches Claim 14 but for the use of confidence values and thresholds.
Jitkoff teaches:
14. The device of claim 1, wherein providing a first output corresponding to a digital assistant in a first state comprises displaying a digital assistant object in a first state, [Jitkoff, Figure 1A, 118 showing the display of a navigation program / first state and reading out the driving directions at the same time.  “[0037] At step D, output for the disambiguated command ("Go To [Geographic Location]") can be provided on the mobile computing device 102 (116). For example, a map 118 depicting driving directions from the current geographic location of the mobile computing device 102 to New York, N.Y., can be provided on the mobile computing device 102.”  “[0038] … It may be dangerous to only present the directions visually as the map 118 on the mobile computing device 102, as it will likely take a user's focus off the road. However, the driving directions could be provided audibly by the mobile computing device 102 in addition to (or instead of) the map 118.”]
the one or more programs further comprising instructions for: 
in accordance with a determination that the first confidence level does not exceed the first threshold confidence level: [Jitkoff, Figure 1A, “Identify User Input as Ambiguous 106” means that the confidence level in the intent determination is low.  Figure 3, “306: identify the user input as ambiguous.”  “12. The computer-implemented method of claim 1, wherein the ambiguous user input comprises voice input; the method further comprising causing speech recognition of the voice input to be performed, wherein the voice input is interpreted through the speech recognition to correspond to each of the plurality of commands with at least a threshold level of certainty.”  “15 … wherein the voice input, as received by the mobile computing device, is of sufficiently poor quality that the voice input is interpreted as corresponding with at least the threshold level of certainty to two or more commands with different pronunciations.”]
maintaining display of the digital assistant object in the first state; [Jitkoff, this corresponds to when no change is warranted.  (see also Guday or Sharifi2022).]
obtaining a second plurality of values associated with a predetermined duration of the second speech input; and [Jitkoff, Figure 3, “316: Determine a current context” adds more information to current information of the device.  “10. The computer-implemented method of claim 1, further comprising detecting one or more ambient sounds at a time when the ambiguous user input was received or within a threshold amount of time of receiving the ambiguous user input; wherein the current context associated with the mobile computing device is determined based on, at least, the detected one or more ambient sounds.”]
obtaining, based on the first plurality of values and the second plurality of values, a second confidence level corresponding to the second speech input. [Jitkoff, Figure 3, “318: Disambiguate the user input based on the current context” when the current context is obtained from information at 308, 312, 314, and “within a threshold amount of time of receiving” the first user input.]
Piernot/Guday and Jitkoff pertain to digital assistants that respond to spoken commands and it would have been obvious to modify the system of combination to include the disambiguation of Jitkoff.  This combination falls under simple substitution of one known element for another to obtain predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 15, display is not discussed in Piernot.
Guday teaches the change of display.  See Figure 5 of Guday, e.g.
Jitkoff teaches:
15. The device of claim 14, the one or more programs further comprising instructions for: 
in accordance with a determination that the second confidence level exceeds a second threshold confidence level: 
displaying the digital assistant object in a second state; and [Jitkoff, Figure 3, “cause output associated with the selected command to be provided  326.”  As shown in Figure 1A and described at [0038], the output includes a displayed item and a voice output.]
while displaying the digital assistant object in the second state, continuing to receive the second speech input. [Jitkoff, Figure 3, “receive second user input 328”, and in general the user does not need to stop.]
Rationale for combination as provided for Claim 14.  Guday teaches the set of Claims pertaining to change of GUI except Guday does not include the confidence condition.

Regarding Claim 16, “affordance” means icon or object shown on the screen.  Piernot does not teach interacting with icons on the screen.
Guday teaches that the icons and GUI shown is associated with user input and user context such as user’s location or the user being at the airport.  See Guday, Abstract.  Guday does not discuss confidence values.
Jitkoff teaches:
16. The device of claim 14, the one or more programs further comprising instructions for:
 displaying an affordance, wherein the affordance is associated with contextual information; and [Jitkoff, Figure 1A and Figure 1B both show that the disambiguated command is also displayed on the screen (174, 186) with graphics/ affordance.  The disambiguation is based on context and therefore the “affordance is associated with contextual information.”]
in accordance with a determination that the second speech input is associated with the contextual information, increasing the second confidence level. [Jitkoff, Figure 3, at 328 receives a second user input with can be a confirmation and thus increases the confidence.  “[0077] In some implementations, second user input can be received (step 328) and a determination can be made as to whether the ambiguous user input was correctly disambiguated based on the received second input (step 330)….. However, if the second input further interacts with the output provided based on the disambiguation (e.g., zoom in on a portion of the driving directions provided for the command "Go To New York, New York"), the mobile computing device 202 may determine that the ambiguous user input was correctly disambiguated. Based on the determination of whether the ambiguous user input was correctly disambiguated, user behavior data can be updated (step 332)….”]
Rationale for combination as provided for Claim 14.  Guday teaches the set of Claims pertaining to change of GUI except Guday does not include the confidence condition.

Regarding Claim 17, Piernot teaches:
17. The device of claim 16, 
wherein the contextual information includes a first semantic representation, [Piernot teaches that the meaning (semantic representation) of conversation history provides context:  “[0063] In other examples, a semantic similarity analysis can be performed on the current spoken user input and some or all of the conversation history data….”]
the one or more programs further comprising instructions for: 
obtaining a second semantic representation associated with the second speech input; and [Piernot finds the semantic distance between what was said before and what is being said now which requires determining the “semantic representation” of the current speech:  “[0063] … In these examples, a semantic distance between the current spoken user input and one or more of the previously received spoken user inputs or responses generated and provided to the user by the user device can be determined and used to determine the likelihood or confidence score that the spoken user input was intend for the virtual assistant at block 308. ….”]
in accordance with a determination that the first semantic representation corresponds to the second semantic representation, increasing the second confidence level. [Piernot teaches that a smaller semantic distance corresponds to more confidence is what is being said:  “[0063] … In these examples, a small semantic distance between the current spoken user input and one or more of the previously received spoken user inputs (e.g., the immediately preceding spoken user input) and/or one or more of the responses generated and provided to the user by the user device can be indicative that the user was more likely to have intended for the current spoken user input to be directed at the virtual assistant, while a large semantic distance between the current spoken user input and one or more of the previously received spoken user inputs (e.g., the immediately preceding spoken user input) and/or one or more of the responses generated and provided to the user by the user device can be indicative that the user was less likely to have intended for the current spoken user input to be directed at the virtual assistant.”] 

Regarding Claim 18, Piernot teaches:
18. The device of claim 16, 
wherein the contextual information includes at least one predefined word, [Piernot teaches looking for a keyword or key phrase as trigger:  “[0044] Using process 300, a virtual assistant implemented by a user device can selectively ignore or respond to spoken user inputs in a way that allows a user to speak to the virtual assistant in natural language without having to manually enter a start-point identifier, such as by pressing a physical or virtual button before speaking to the virtual assistant or by uttering a specific trigger phrase (e.g., a predetermined word or sequence of words, such as “Hey Siri”) before speaking to the virtual assistant in natural language. In some examples, process 300 can be used to process all spoken user inputs received by user device 102.”] 
the one or more programs further comprising instructions for: 
identifying at least one word included in the second speech input; and [Piernot describes embodiments where context is used to do away with the need for a trigger phrase but of course a trigger phrase would be the ultimate context that indicates an incoming command:  “[0047] In other examples, user device 102 can require that a start-point identifier be manually entered by the user prior to process 300 being invoked. For example, a user can be required to utter a trigger phrase or press a physical or virtual button before first speaking to the virtual assistant….”  “[0048] At block 402, a start-point identifier can be received. The start-point identifier can include a trigger phrase spoken by the user ….”]
in accordance with a determination that the at least one predefined word corresponds to the at least one identified word, increasing the second confidence level. [Piernot, a wakeup trigger phrase such as “Hey Siri” is the ultimate context that increases the confidence of an ensuing command to a 100%.  However, Piernot also teaches that information from the user’s contact list could be used as context and as provided in the rejection of Claim 1, context is used in Piernot to modify the “confidence” level.  See: “ [0112] In some examples, the contextual information can include user data from memory 250 or another storage device located within or remote from user device 102. The user data can include any type of information associated with the user, such as a contact list, calendar, preferences, personal information, financial information, family information, or the like. In some examples, the user data can be compared with other types of contextual information at block 308 to assist in the determination of whether or not the spoken user input was intend for the virtual assistant. For example, the time that the spoken user input was received can be compared with the user's calendar to determine if the user was at an event in which the user was more or less likely to be conversing with the virtual assistant of the user device, the speech recognition data from the ASR engine can be compared with contacts in the user's contact list to determine if the a name from the user's contact list was mentioned in the spoken user input, the speech recognition data from the ASR engine can be compared with user preferences to determine if the spoken user input corresponds to a previously defined phrase that should or should not be ignored by the virtual assistant, or the like.”]
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Sharifi (U.S. 20210406260)
Choi (U.S. 20200142505): 
[0053] For example, the electronic device may perform a function according to the user's specific action (e.g., approach to the electronic device). Upon recognizing the user's specific action, the electronic device may determine that the user has the intent to dialog with the electronic device and switch the inactivated state (e.g., sleep state) into the activated state (e.g., wake-up, listening, and speaking state).
[0054] As another example, the electronic device may perform the function (e.g., play music) according to the user's specific continuous actions. Upon recognizing the user's second action (e.g., voice command “Play music”) subsequent to the user's first action (e.g., approach), the electronic device may perform a function (e.g., play music) in response thereto.
[0055] In the intelligent interaction as shown in FIG. 1, the user's action may be an intentional action for performing the function. The same function may be performed according to the user's specific intentional action. For example, there are possible functions such as the function of waking up a smartphone in response to a wake-up word (e.g., “Hi Bixby,” “OK Google,” or “Hey, Siri”), the function of turning on the screen of the smartphone when the user lifts an arm wearing a smartwatch, and the function of unlocking a smartphone when the user gazes at the smartphone.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659