DETAILED ACTION
This action is in response to the initial filing of Application no. 17/112,227 on  08/21/2021.
Claims 1 – 20 are still pending in this application, with claims 1, 5 and 13 being independent.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1- 4 are rejected under 35 U.S.C. 103 as being unpatentable over Freed et al. (US 9,368,105) (“Freed”) in view of Bansal et al. (US 2020/0333875) (“Bansal”).
For claim 1, Freed discloses a computer-implemented method (Abstract) comprising: receiving, from a user device operating in a first mode (e.g. slumber mode), first input audio data representing a first utterance initiated by a wakeword (Fig.4, 402 and Fig.6,624; column 7 lines 41 – 53); processing the first utterance to determine a command to operate the user device in a second mode  (e.g. default operating mode) (column 7 lines 53 – 64); sending, to the user device, a command to operate in the second mode (column 7 lines 64 – 66); beginning operation in the second mode (column 8 lines 1 – 13). Yet, Freed fails to teach the following: using at least one microphone, generating second input audio data representing a second utterance, the second utterance spoken by a user; using at least one camera, generating image data representing the user, the image data indicating that the user is looking at the user device while speaking the second utterance; using the image data to determine that the second utterance is directed at the user device; and in response to using the image data to determine that the second utterance is directed at the user device, sending the second input audio data for speech processing to be performed.  
However, Bansal discloses a method for detecting an instruction from a user (Abstract) comprising the following: using at least one microphone (Fig.4, 402 and Fig.6, 608; [0092] [0093] [0107] [0109]), generating input audio data representing an utterance, the utterance spoken by user (During the ongoing conversation between the user and the virtual assistant, the user device captures an audio input …  the audio input may include audio information such as words or sentences spoken by the user and captured by the user device, Fig.19A, 1901 and 1905 and Fig.19B, 1909; [0093] [0186] [0187]); using, at least a camera (Fig.4, 402 and Fig.6, 608; [0092] [0093] [0107] [0109]), generating an image representing the user (During the ongoing conversation between the user and the virtual assistant, the user device captures … video input, The video input includes … a face of the user and features extracted from the gesture or face of the user, Fig.19A, 1901 and 1905 and Fig.19B, 1909 and 1911; [0093] [0186] [0187]), the image data indicating that the user is looking at the user device while speaking the second utterance (For instance, when a user is pointing towards or looking at the user device 402 while speaking, it is detected that the user intends to interrupt the on-going conversation with the virtual assistant, [0093] [0095 – 0097] [0112] [0117 – 0122] [0186] [0187]); using the image data to determine that the second utterance is directed at the user device ([0117 – 0122] [0137] [0138] [0142 – 0145] [0186]); and in response to using the image data to determine that the second utterance is directed to the user device, detecting that the utterance is an instruction ([0003] [0004] [0010] [0013]).
Additionally, Freed further discloses performing speech processing of audio input both locally at a user device and remotely at a server (column 3 lines 17 – 32, column 4 lines 1 – 22, column 5 lines 55 – column 6 line 36, column 7 lines  40 – 55; column 9 lines 43 – 52).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve Freed’s invention in the same way that Bansal’s invention has been improved to achieve the following predictable results for the purpose of providing an efficient virtual assistant system that intelligently detects interruptions and commands provided by a user without misinterpreting the user’s speech (Bansal, [0003 – 0007]): in the second (default) operating mode, further generating second input audio data representing a second utterance using at least one microphone, the second utterance spoken by a user; in the second (default) operating mode further generating image data representing the user using at least one camera, the image data indicating that the user is looking at the user device while speaking the second utterance; using the image data to determine that the second utterance is directed at the user device; and in response to using the image data to determine that the second utterance is directed at the user device, detecting that the utterance is an instruction by sending the second input audio data for speech processing to be performed, wherein the speech processing functionality (Banal, Fig.7, 706 and 708; [0095] [0112 - 0114]) is located at a remote server.

For claim 2, Bansal further discloses, wherein using the image data to determine that the second utterance is directed at the user device comprises: processing the image data using a first feature extractor to determine first feature data corresponding to the image data (Bansal, the attention detection model uses two-level feature extraction, [0120] [0121]); using the first feature data and at least a first classifier (Bansal, confidence score calculator module) to determine output data (Bansal,  The confidence score calculator module compares the calculated confidence score with a predetermined threshold score. When the calculated confidence score exceed the predetermined threshold confidence score, the confidence score calculator module determines that the at least one of the audio input and the video input is an interrupt by the user in the ongoing conversation between the user and the virtual assistant, [0137] [0138] [0142 – 0145]); and determining the output data indicates that the second utterance is directed at the user device (Bansal, [0145]).  
For claim 3, Bansal further discloses: determining first weight data corresponding to the image data (Bansal, [0142] [0143]); 12507025162P72015-USO2 processing the first feature data using the first weight data to determine first weighted feature data (Bansal, [0137] [0138] [0142] [0143]); processing the second input audio data using a second feature extractor to determine second feature data corresponding to the second utterance (Bansal, [0115] [0124  - 0131]), determining second weight data corresponding to the second input audio data (Bansal, [0142] [0143]); processing the second feature data using the second weight data to determine second weighted feature data (Bansal, [0137] [0138] [0142] [0143); and processing the first weighted feature data and the second weighted feature data using the first classifier to determine the output data (Bansal, [0144] [0145]).  
For claim 4, Freed and Bansal further disclose: using at least one microphone (Bansal, Fig.4, 402 and Fig.6, 608; [0092] [0093] [0107] [0109]), generating third input audio data representing a third utterance, the second utterance spoken by a user  (Freed, column 8 lines 1 – 13) (Bansal, During the ongoing conversation between the user and the virtual assistant, the user device captures an audio input …  the audio input may include audio information such as words or sentences spoken by the user and captured by the user device, Fig.19B, 1909; [0093] [0186] [0187]); using at least one camera (Bansal, Fig.4, 402 and Fig.6, 608; [0092] [0093] [0107] [0109]), generating second image data representing the user (Bansal, During the ongoing conversation between the user and the virtual assistant, the user device captures … video input, The video input includes … a face of the user and features extracted from the gesture or face of the user, Fig.19A, 1901 and 1905, Fig.19B, 1909 and 1911; [0093] [0186] [0187]), the second image data indicating that the user is looking at the user device while speaking the second utterance (Bansal, During the ongoing conversation between the user and the virtual assistant, the user device captures … video input, The video input includes … a face of the user and features extracted from the gesture or face of the user, Fig.19A, 1901 and 1905 and Fig.19B, 1909 and 1911; [0093] [0186] [0187]); using the second image data to determine that the third utterance is not directed at the user device (Bansal, [0117 – 0122] [0137] [0138] [0142 – 0145] [0187]); and in response to using the second image data to determine that the third utterance is not directed at the user device, refraining from performing speech processing using the third input audio data (Bansal, audio input is not detected as an instruction to determine and execute a task, [0145] [0187]).  

Claim(s) 5 – 20 are rejected under 35 U.S.C. 103 as being unpatentable over Bansal et al. (US 2020/0333875) (“Bansal”) in view of Freed et al. (US 9,368,105) (“Freed”).
For claim 5, Bansal  discloses a computer-implemented method (Abstract), comprising:: receiving first data corresponding to at least one image representing a user (During the ongoing conversation between the user and the virtual assistant, the user device captures … video input, The video input includes … a face of the user and features extracted from the gesture or face of the user, Fig.19A, 1901 and 1905 and Fig.19B, 1909 and 1911; [0093] [0186] [0187]); receiving audio data corresponding to an utterance spoken by the user (During the ongoing conversation between the user and the virtual assistant, the user device captures an audio input …  the audio input may include audio information such as words or sentences spoken by the user and captured by the user device, Fig.19A, 1901 and 1905 and Fig.19B, 1909; [0093] [0186] [0187]); processing the first data to determine that the utterance is directed at a device (For instance, when a user is pointing towards or looking at the user device 402 while speaking, it is detected that the user intends to interrupt the on-going conversation with the virtual assistant, [0093] [0095 – 0097] [0112] [0117 – 0122] [0137] [0138] [0142 – 0145] [0186]); and in response to processing the first data to determine that the utterance is directed at the device, detecting that the utterance is an instruction ([0003] [0004] [0010] [0013]). Yet, Bansal fails to teach that the detecting that the utterance is an instruction comprises performing speech processing using the audio data.
However, Freed discloses a speech processing system and method (Abstract), wherein detecting that an utterance is an instruction comprises performing speech processing of the utterance (column 3 lines 17 – 32, column 4 lines 1 – 22, column 5 lines 55 – column 6 line 36, column 7 lines  40 – 55; column 9 lines 43 – 52).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to modify Bansal’s teachings with Freed’s teachings for the purpose of improving the user experience in interacting with natural language control devices (Freed, column 1 lines 7 – 35) so that the detecting that the utterance is an instruction further comprises performing speech processing using the audio data.

For claim 6 Bansal discloses processing the first data to determine the user is looking at the device (Bansal, [0093] [0095 – 0097] [0112] [0117 – 0122]).

For claim 7, Bansal further discloses: receiving image data representing the at least one image (Bansal, [0117]); and processing the image data using a first component to determine first feature data corresponding to the image, wherein the first data includes feature data (Bansal, the attention detection model uses two-level feature extraction, [0120] [0121]), wherein processing the first data to determine that the utterance is directed at a device comprises processing the feature data using at least one classifier (Bansal,  The confidence score calculator module compares the calculated confidence score with a predetermined threshold score. When the calculated confidence score exceed the predetermined threshold confidence score, the confidence score calculator module determines that the at least one of the audio input and the video input is an interrupt by the user in the ongoing conversation between the user and the virtual assistant, [0137] [0138] [0142 – 0145]).

For claim 8, Bansal and Freed further disclose prior to processing the first data to determine that the utterance is directed at a device: receiving a command to alter operation of a wakeword detection component (Bansal, During the ongoing conversation between the user and the virtual assistant, the user device captures an audio input …  the audio input may include audio information such as words or sentences spoken by the user and captured by the user device, Fig.19A, 1901 and 1905 and Fig.19B, 1909; [0093] [0186] [0187]) (Freed, Fig.4, 402 and Fig.6,624;  column 7 lines 41 – 64).

For claim 9, Bansal and Freed further disclose wherein causing speech processing to be performed using the audio data comprises sending the audio data to at least one remote device for the speech processing (Bansal, an instruction and related task is determined, [0003] [0004] [0010] [0013]) (Freed, speech processing to determine an instruction and related task occurs locally or remotely, column 3 lines 17 – 32, column 4 lines 1 – 22, column 5 lines 55 – column 6 line 36, column 7 lines  40 – 55; column 9 lines 43 – 52)

For claim 10, Bansal further discloses: processing the audio data to determine feature data corresponding to the utterance (Bansal, [0115] [0124  - 0131]), wherein processing the first data to process the first data to determine that the utterance is directed at a device comprises processing the first data and the feature data using at least one classifier (Bansal, [0137] [0138] [0142 – 0145]).

For claim 11, Bansal further discloses: determining a first weight corresponding to the first data (Bansal, [0142] [0143]); determining a second weight corresponding to the audio data (Bansal, [0142] [0143]); and further processing the first weight, the second weight, and the audio data to determine the utterance is directed at the device (Bansal, [0137] [0138] [0142- 0145]).
For claim 12 Bansal further discloses: receiving second data corresponding to at least one second image representing the user (Bansal, During the ongoing conversation between the user and the virtual assistant, the user device captures … video input, The video input includes … a face of the user and features extracted from the gesture or face of the user, Fig.19A, 1901 and 1905, Fig.19B, 1909 and 1911; [0093] [0186] [0187]); receiving second audio data corresponding to a second utterance spoken by the user (Bansal, During the ongoing conversation between the user and the virtual assistant, the user device captures an audio input …  the audio input may include audio information such as words or sentences spoken by the user and captured by the user device, Fig.19B, 1909; [0093] [0186] [0187]); processing the second data to determine that the second utterance was not directed at the device (Bansal, [0117 – 0122] [0137] [0138] [0142 – 0145] [0187]); and in response to processing the second data to determine that the second utterance was not directed at the device, refraining from at least some processing performed when an utterance is directed at the device (Bansal, audio input is not detected as an instruction to determine and execute a task, [0145] [0187]).  

For claim 13, Bansal discloses a computing system (Abstract; Fig.5), comprising: at least one processor (the modules, interfaces and query generator may be implemented as at least one hardware processor, [0102]), wherein the computing system: receives first data corresponding to at least one image representing a user (During the ongoing conversation between the user and the virtual assistant, the user device captures … video input, The video input includes … a face of the user and features extracted from the gesture or face of the user, Fig.19A, 1901 and 1905 and Fig.19B, 1909 and 1911; [0093] [0186] [0187]); receive audio data corresponding to an utterance spoken by the user (During the ongoing conversation between the user and the virtual assistant, the user device captures an audio input …  the audio input may include audio information such as words or sentences spoken by the user and captured by the user device, Fig.19A, 1901 and 1905 and Fig.19B, 1909; [0093] [0186] [0187]); process the first data to determine that the utterance is directed at a device (For instance, when a user is pointing towards or looking at the user device 402 while speaking, it is detected that the user intends to interrupt the on-going conversation with the virtual assistant, [0093] [0095 – 0097] [0112] [0117 – 0122] [0137] [0138] [0142 – 0145] [0186]) ; and in response to processing the first data to determine that the utterance is directed at the device, detecting that the utterance is an instruction ([0003] [0004] [0010] [0013]). Yet, Bansal fails to teach: at least one memory comprising instructions that, when executed by the at least one processor, cause the computing system to perform the method; and the detecting that the utterance is an instruction comprises performing speech processing using the audio data.
However, Freed discloses a speech processing system and method (Abstract), wherein modules such as instructions, datastores and so forth may be stored within a memory and configured to execute on the processor to perform a method (column 3 lines 33 – 67); and detecting that an utterance is an instruction further comprising performing speech processing of the utterance (column 3 lines 17 – 32, column 4 lines 1 – 22, column 5 lines 55 – column 6 line 36, column 7 lines  40 – 55; column 9 lines 43 – 52).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to modify Bansal’s teachings with Freed’s teachings for the purpose of improving the user experience in interacting with natural language control devices (Freed, column 1 lines 7 – 35) so that the system further comprises at least one memory comprising instructions that, when executed by the at least one processor, cause the computing system to perform the method; and the detecting that the utterance is an instruction further comprises performing speech processing using the audio data.

For claim 14, Bansal and Freed further discloses wherein the at least one memory further comprises instructions that, when executed by the at least one processor (Freed, column 3 lines 33 – 67), further cause the computing system to: process the first data to determine the user is looking at the device (Bansal, [0093] [0095 – 0097] [0112] [0117 – 0122]).

For claim 15, Bansal and Freed further disclose, wherein the at least one memory further comprises instructions that, when executed by the at least one processor (Freed, column 3 lines 33 – 67), further cause the computing system to: receive image data representing the at least one image (Bansal, [0117]); and process the image data using a first component to determine first feature data corresponding to the image, wherein the first data includes feature data (Bansal, the attention detection model uses two-level feature extraction, [0120] [0121]), wherein the instructions that cause the computing system to process the first data to determine that the utterance is directed at a device comprise instructions that, when executed by the at least one processor, further cause the computing system to process the feature data using at least one classifier (Bansal,  The confidence score calculator module compares the calculated confidence score with a predetermined threshold score. When the calculated confidence score exceed the predetermined threshold confidence score, the confidence score calculator module determines that the at least one of the audio input and the video input is an interrupt by the user in the ongoing conversation between the user and the virtual assistant, [0137] [0138] [0142 – 0145]).

For claims 16, Bansal and Freed further disclose, wherein the at least one memory further comprises instructions that, when executed by the at least one processor (Freed, column 3 lines 33 – 67), further cause the computing system to, prior to processing the first data to determine that the utterance is directed at a device: receive a command to alter operation of a wakeword detection component (Bansal, During the ongoing conversation between the user and the virtual assistant, the user device captures an audio input …  the audio input may include audio information such as words or sentences spoken by the user and captured by the user device, Fig.19A, 1901 and 1905 and Fig.19B, 1909; [0093] [0186] [0187]) (Freed, Fig.4, 402 and Fig.6,624;  column 7 lines 41 – 64).

For claim 17, Bansal and Freed further disclose wherein the instructions that cause the computing system to cause speech processing to be performed using the audio data comprise instructions that, when executed by the at least one processor, further cause the computing system to send the audio data to at least one remote device for the speech processing (Bansal, an instruction and related task is determined, [0003] [0004] [0010] [0013]) (Freed, speech processing to determine an instruction and related task occurs locally or remotely, column 3 lines 17 – 32, column 4 lines 1 – 22, column 5 lines 55 – column 6 line 36, column 7 lines  40 – 55; column 9 lines 43 – 52)

For claim 18, Bansal and Freed further disclose wherein the at least one memory  further comprises instructions that, when executed by the at least one processor (Freed, column 3 lines 33 – 67), further cause the computing system to: process the audio data to determine feature data corresponding to the utterance (Bansal, [0115] [0124  - 0131]), wherein the instructions that cause the computing system to process the first data to process the first data to determine that the utterance is directed at a device comprise instructions that, when executed by the at least one processor, cause the computing system to process the first data and the feature data using at least one classifier (Bansal, [0137] [0138] [0142 – 0145]).

For claim 19, Bansal and Freed further disclose, wherein the at least one memory further comprises instructions that, when executed by the at least one processor (Freed, column 3 lines 33 – 67), further cause the computing system to: determine a first weight corresponding to the first data (Bansal, [0142] [0143]); determine a second weight corresponding to the audio data (Bansal, [0142] [0143]); and further process the first weight, the second weight, and the audio data to determine the utterance is directed at the device (Bansal, [0137] [0138] [0142- 0145]).

For claim 20, Bansal and Freed further disclose, wherein the at least one memory further comprises instructions that, when executed by the at least one processor (Freed, column 3 lines 33 – 67), further cause the computing system to: receive second data corresponding to at least one second image representing the user (Bansal, During the ongoing conversation between the user and the virtual assistant, the user device captures … video input, The video input includes … a face of the user and features extracted from the gesture or face of the user, Fig.19A, 1901 and 1905, Fig.19B, 1909 and 1911; [0093] [0186] [0187]); receive second audio data corresponding to a second utterance spoken by the user (Bansal, During the ongoing conversation between the user and the virtual assistant, the user device captures an audio input …  the audio input may include audio information such as words or sentences spoken by the user and captured by the user device, Fig.19B, 1909; [0093] [0186] [0187]); process the second data to determine that the second utterance was not directed at the device (Bansal, [0117 – 0122] [0137] [0138] [0142 – 0145] [0187]); and in response to processing the second data to determine that the second utterance was not directed at the device, refraining from at least some processing performed when an utterance is directed at the device (Bansal, audio input is not detected as an instruction to determine and execute a task, [0145] [0187]).  

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SONIA L GAY whose telephone number is (571)270-1951. The examiner can normally be reached Monday-Friday 9-5 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SONIA L GAY/            Primary Examiner, Art Unit 2657