DETAILED ACTION
This Office Action is in response to Applicant’s argument filed in the reply on 6/15/2022. Claims 1, 8, and 15 were amended. Claims 2, and 16 were cancelled. As such claims 1, 3-15, 17 - 21 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claim 15, line 8 objected to because of the following informalities: 
".  
Appropriate correction is required.

Response to Arguments

Applicant’s arguments filed in the Amendment filed 6/15/2022 (herein “Amendment”) with respect to the 35 USC §102 rejection against claims 1, 8, and 15 have been fully considered, but they are not persuasive. 
On page 8 of the Amendment, Applicant argues that “Claim 1 has been amended to recite that the first audio clip is stored on the memory of a client device,” and that claim 1 also recites the “determined” and “rendered for presentation” steps as being performed “by a processor of the client device,” thus distinguishing over Zhang which has “two devices sending audio back and forth to one another.”
However, claim 1 actually recites "a memory configured to store a plurality of instructions .... determine an initial response to the request, the initial response corresponding to a first audio clip, … render the first audio clip for presentation prior to receiving the primary response, the primary response corresponding to a second audio clip.” Thus, contrary to Applicant's contentions, claim 1 does not require that the memory is part of the client device, moreover, claim 1 does not even recite "a memory of the client device." Therefore, Zhang's terminal and server system teach the storage of audio clips as claimed.
Similarly, claim 15 also does not require the memory to be "of the client device" and therefore, Zhang also teaches the limitations of claim 15 for which it is relied upon.
It is noted that in contrast to claims 1 and 15, claim 8 does require the memory being of a client device and for claim 8, Applicant's remarks and amendments were persuasive to require the new grounds of rejection on reliance upon Eide et al. (US20080167874A1) set forth herein.
Therefore, while all of the Applicant’s arguments and amendments filed in the Amendment have been fully considered, they are not persuasive. Please see below for more detail including updated citations and obviousness rationale.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims  1, 7, 15, 17, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Labsky (US20140058732A1), and Zhang (US20210256057A1).

Labsky and Zhang applied in the previous Office Action.
Note: Each of the functional elements of claim 1 are similar to that of claim 15 therefore claims 1, and 15 are mapped together.
Regarding claims 1, and 15 Labsky teaches a system, and method comprising: a processor of a client device; a memory configured to store a plurality of instructions, which, when executed, cause the processor to: (Labsky, Par. 0012:” One such embodiment comprises a computer program product that has a computer-storage medium [e.g., a non-transitory, tangible, computer-readable media, disparately located or commonly located storage media, computer storage media or medium, etc.] including computer program logic encoded thereon that, when performed in a computerized device having a processor and corresponding memory, programs the processor to perform [or causes the processor to perform] the operations disclosed herein. Such arrangements are typically provided as software, firmware, microcode, code data [e.g., data structures], etc., arranged or encoded on a computer readable storage medium such as an optical medium [e.g., CD-ROM], floppy disk, hard disk, one or more ROM or RAM or PROM chips, an Application Specific Integrated Circuit [ASIC], a field-programmable gate array [FPGA], and so on.”, and Par. 0016:” Also, it is to be understood that each of the systems, methods, apparatuses, etc. herein can be embodied strictly as a software program, as a hybrid of software and hardware, or as hardware alone such as within a processor, or within an operating system or within a software application, or via a non-software application such a person performing all or part of the operations.”).
encode an audio input signal received from a microphone, the audio input signal comprising a request; (Labsky, Par. 0056:” In step 300, the speech recognition response manager receives a spoken utterance at a client electronic device, such as a person speaking to the client device via microphone.”, and Par. 0004:” For example, a speaker can utter a command to execute a specific task, or utter a query to retrieve specific results. Spoken input can follow a rigid set of phrases that perform specific tasks, or spoken input can be natural language, which is interpreted by a natural language unit of a speech recognition system.”).
transmit the encoded audio input signal to a cloud service that is configured to generate a primary response; (Labsky, Par. 0011:” The client device transmits [ implied encoding] at least a portion of the spoken utterance over a communication network to a remote automated speech recognizer that analyzes spoken utterances and returns remote speech recognition results [primary response], such as by a network-accessible server.”).
determine an initial response to the request, the initial response corresponding to a first audio clip; (Labsky, Par. 0013:” …. prior to receiving a remote speech recognition result from the remote automated speech recognizer, initiating a response [initial response] via a user interface of the client electronic device, the response corresponding to the spoken utterance, at least an initial portion of the response is based on a local speech recognition result from the local automated speech recognizer …”).
render the first audio clip for presentation prior to receiving the primary response, the primary response corresponding to a second audio clip; and (Labsky, Par. 0027:” …With techniques herein, however, the client device 112 can initiate a response to spoken utterance prior to having any specific results [primary response]. For example, client device can analyze the spoken utterance 107 and identify that the user is searching for something. In response, the client device can initiate a response via a user interface [render], such as with a text-to-speech system. In this non-limiting example, the local recognizer initiates producing or speaking word 151, ‘Searching the Internet For.’ These introductory or filler words are then modified by adding words 152, ‘Apple Pie Recipe,’ which are presented after words 151. With such a technique, a response to user input is initiated via a user interface prior to having complete results, and then the UI response is modified [in this example the UI response is added-to] to convey results corresponding to the spoken query, such as search results.”, and Par. 0011:”For example, in embodiments in which the user interface includes a text-to-speech system, the client device can begin speaking words as if the client device were already in possession of results, and then add [append] words to the response after receiving results from a remote server. An initial response then begins immediately instead of waiting for results from all recognizers. This reduces a perceived delay by the user because even with the initial response comprising filler or introductory words, the commencement of a response signals to the user that results have been retrieved and the client device is in the process of conveying the results.”).
append the second audio clip to follow the first audio clip, the second audio clip being presented after the presentation of the first audio clip. (Labsky, Par. 0011:” For example, in embodiments in which the user interface includes a text-to-speech system, the client device can begin speaking words as if the client device were already in possession of results, and then add [append] words to the response after receiving results from a remote server. An initial response then begins immediately instead of waiting for results from all recognizers. This reduces a perceived delay by the user because even with the initial response comprising filler or introductory words, the commencement of a response signals to the user that results have been retrieved and the client device is in the process of conveying the results.”, and Par. 0082:”In step 340, the speech recognition response manager modifies the response after the response has been initiated and prior to completing delivery of the response via the user interface such that modifications to the response [appending process] are delivered via the user interface [rendering] as a portion of the response. The modifications are based on the remote speech recognition result.”).
Labsky fails to explicitly disclose, however, Zhang teaches wherein the first audio clip is a predetermined audio clip that is stored in the memory prior to the audio input signal being received from the microphone (Zhang, claim 8:” A method for playing audio, applied to a server, the method comprising: receiving a response request sent by a terminal; obtaining, based on the response request, a plurality of response audio data; sending, in response to determining that at least one piece of response audio data of the plurality of response audio data is synthesized into a response audio clip, the response audio clip to the terminal until finishing sending the plurality of response audio data.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Labsky, in view of Zhang to wherein the first audio clip is a predetermined audio clip that is stored in the memory prior to the audio input signal being received from the microphone, in order to improve accuracy of the information conveyed to the user during the voice playback of the terminal, as evidenced by Zhang (see Par. 0079).

Regarding claim 7 Labsky teaches the system of claim 1, wherein the plurality of instructions, which, when executed, further cause the processor to generate the second audio clip from the primary response. (Labsky, Par. Labsky, Par. 0011:” For example, in embodiments in which the user interface includes a text-to-speech system, the client device can begin speaking words as if the client device were already in possession of results, and then add [append] words to the response after receiving results [second audio] from a remote server. An initial [primary] response then begins immediately instead of waiting for results from all recognizers. This reduces a perceived delay by the user because even with the initial response comprising filler or introductory words, the commencement of a response signals to the user that results have been retrieved and the client device is in the process of conveying the results.”, and Par. 0082:”In step 340, the speech recognition response manager modifies the response after the response has been initiated and prior to completing delivery of the response via the user interface such that modifications to the response [appending process] are delivered via the user interface [rendering] as a portion of the response. The modifications are based on the remote speech recognition result.”).

Regarding claim 17 Labsky teaches the client device of claim 15, further comprising: categorizing, by the local device, the audio input signal into a directional result, the directional result indicating whether the remote service is able to respond to the request, wherein the initial response is generated according to the directional result. (Labsky, Par. 0011:” The speech recognition response manager or client device receives a spoken utterance at a client electronic device. The client electronic device includes a local automated speech recognizer. The speech recognition response manager analyzes the spoken utterance using the local automated speech recognizer. The client device transmits at least a portion of the spoken utterance over a communication network to a remote automated speech recognizer that analyzes spoken utterances and returns remote speech recognition results, such as by a network-accessible server. Prior to receiving a remote speech recognition result from the remote automated speech recognizer, the speech recognition response manager initiates a response [initial response] via a user interface of the client electronic device. The response corresponds to the spoken utterance [directional response]. At least an initial portion of the response is based on a local speech recognition result [directional response] from the local automated speech recognizer. The speech recognition response manager can then modify the response after the response has been initiated and prior to completing delivery of the response via the user interface such that modifications to the response are delivered via the user interface as a portion of the response. Such modifications are based on the remote speech recognition result. For example, in embodiments in which the user interface includes a text-to-speech system, the client device can begin speaking words as if the client device were already in possession of results, and then add words to the response after receiving results from a remote server. An initial response then begins immediately instead of waiting for results from all recognizers.”) Note: directional result is attained when the continuum result from the server is provided as taught by Labsky as if the entire result is spoken from the same source.

Regarding claim 20 Labsky teaches the client device of claim 15, further comprising generating, by the local device, the second audio clip from the primary response. (Labsky, Par. Labsky, Par. 0011:” For example, in embodiments in which the user interface includes a text-to-speech system, the client device can begin speaking words as if the client device were already in possession of results, and then add [append] words to the response after receiving results [second audio] from a remote server. An initial [primary] response then begins immediately instead of waiting for results from all recognizers. This reduces a perceived delay by the user because even with the initial response comprising filler or introductory words, the commencement of a response signals to the user that results have been retrieved and the client device is in the process of conveying the results.”, and Par. 0082:”In step 340, the speech recognition response manager modifies the response after the response has been initiated and prior to completing delivery of the response via the user interface such that modifications to the response [appending process] are delivered via the user interface [rendering] as a portion of the response. The modifications are based on the remote speech recognition result.”).

Claim 3 is rejected under 35 U.S.C. 103 as being unpatentable over Labsky, Zhang and in further view of Schiller (US20110035222A1).

Schiller was applied in the previous Office Action.
Regarding claim 3, Labsky, and Zhang fail to explicitly disclose, however, Schiller teaches the system of claim 1, wherein the first audio clip is randomly selected from a library of predetermined audio clips stored in the memory. (Schiller, Par. 0018:” The electronic device can select which of several audio clips to play back using any suitable approach. In some embodiments, the user can direct an audio clip for playback. In some embodiments, the electronic device can instead randomly select an audio clip, or cycle through the available audio clips each time an audio clip for the text item is provided. In some embodiments, the electronic device can instead or in addition select an audio clip based on an attribute of a media item being played back. For example, the electronic device can select an audio clip based on an attribute [e.g., metadata] of the played back media, media playlist, past or future media, or any other suitable media item. In some embodiments, the electronic device can select an audio clip based on an attribute of the environment of the electronic device playing back the media.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Labsky, and Zhang in view of Schiller to wherein the first audio clip is randomly selected from a library of predetermined audio clips stored in the memory, in order to enhance a user's ability to interact with such devices, as evidence by Schiller (see Par. 0003).

Claims 4, 5, 19 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Labsky, Zhang and in further view of Aher (US20200167384A1).

Aher was applied in the previous Office Action.
Regarding claim 4, Labsky, and Zhang fail to explicitly disclose, however, Aher teaches the system of claim 1, wherein the plurality of instructions, which, when executed, further cause the processor to determine the initial response to the request by applying a Deep Neural Network (DNN) algorithm to the audio input signal to generate the initial response. (Aher, Par. 0032:” The control circuitry 304 may include audio processing circuitry and/or audio generation circuitry, other digital encoding or decoding circuitry, or any other suitable audio circuits or combinations of such circuits. Encoding circuitry [e.g., for converting received audio input or digital signals to audio signals for analysis or storage] may also be provided. The audio circuitry may be used by the media device 300 to receive, process, and generate audio input [e.g., the search query 104] or output [e.g., the audio response 108].”, and Par. 0043:” At block 418, the control circuitry 304 generates audio output using the voice profile of the personality [identified at block 414]. The audio output generated by the control circuitry 304 is an audio response 108 that includes the answer to the search query 104. In some embodiments, the audio response 108 further includes the phrase, tune, or jingle identified at block 416 [initial response]. For example, the control circuitry 304 may execute one or more speech synthesis algorithms, including diphone synthesis, domain-specific synthesis, formant synthesis, articulatory synthesis, hidden Markov models-based synthesis, and sinewave synthesis, and/or may employ deep learning neural networks to generate the audio output using the voice profile of the personality. “).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Labsky, and Zhang in view of Aher to further cause the processor to determine the initial response to the request by applying a Deep Neural Network (DNN) algorithm to the audio input signal to generate the initial response, in order to improve the user experience of interactive searching tools by having audio responses that are more interactive and contextually relevant to the search queries provided, as evidence by Aher (See Par. 0002).

Regarding claim 5, Labsky further teaches wherein the DNN algorithm is configured to categorize the audio input signal into a directional result, the directional result indicating whether the cloud service is able to respond to the request, wherein the initial response is determined according to the directional result. (Labsky, Par. 0011:” The speech recognition response manager or client device receives a spoken utterance at a client electronic device. The client electronic device includes a local automated speech recognizer. The speech recognition response manager analyzes the spoken utterance using the local automated speech recognizer. The client device transmits at least a portion of the spoken utterance over a communication network to a remote automated speech recognizer that analyzes spoken utterances and returns remote speech recognition results, such as by a network-accessible server. Prior to receiving a remote speech recognition result from the remote automated speech recognizer, the speech recognition response manager initiates a response [initial response] via a user interface of the client electronic device. The response corresponds to the spoken utterance [directional response]. At least an initial portion of the response is based on a local speech recognition result [directional response] from the local automated speech recognizer. The speech recognition response manager can then modify the response after the response has been initiated and prior to completing delivery of the response via the user interface such that modifications to the response are delivered via the user interface as a portion of the response. Such modifications are based on the remote speech recognition result. For example, in embodiments in which the user interface includes a text-to-speech system, the client device can begin speaking words as if the client device were already in possession of results, and then add words to the response after receiving results from a remote server. An initial response then begins immediately instead of waiting for results from all recognizers.”) Note: directional result is attained when the continuum result from the server is provided as taught by Labsky as if the entire result is spoken from the same source.

Regarding claim 19, Labsky, and Zhang fail to explicitly disclose, however, Aher teaches the client device of claim 15, wherein the local device comprises a locally executed artificial intelligence module configured to generate the initial response. (Aher, Par. 0032:” The control circuitry 304 may include audio processing circuitry and/or audio generation circuitry, other digital encoding or decoding circuitry, or any other suitable audio circuits or combinations of such circuits. Encoding circuitry [e.g., for converting received audio input or digital signals to audio signals for analysis or storage] may also be provided. The audio circuitry may be used by the media device 300 to receive, process, and generate audio input [e.g., the search query 104] or output [e.g., the audio response 108].”, and Par. 0043:” At block 418, the control circuitry 304 generates audio output using the voice profile of the personality [identified at block 414]. The audio output generated by the control circuitry 304 is an audio response 108 that includes the answer to the search query 104. In some embodiments, the audio response 108 further includes the phrase, tune, or jingle identified at block 416 [initial response]. For example, the control circuitry 304 may execute one or more speech synthesis algorithms, including diphone synthesis, domain-specific synthesis, formant synthesis, articulatory synthesis, hidden Markov models-based synthesis, and sinewave synthesis, and/or may employ deep learning neural networks [AI] to generate the audio output using the voice profile of the personality. “).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Labsky, and Zhang in view of Aher to wherein the local device comprises a locally executed artificial intelligence module configured to generate the initial response, in order to improve the user experience of interactive searching tools by having audio responses that are more interactive and contextually relevant to the search queries provided, as evidence by Aher (See Par. 0002).

Regarding claim 21, Labsky, and Zhang fail to explicitly disclose, however, Aher teaches the client device of claim 15, further comprising generating the initial response to the request by applying a Deep Neural Network (DNN) algorithm to the audio input signal to generate the initial response. (Aher, Par. 0032:” The control circuitry 304 may include audio processing circuitry and/or audio generation circuitry, other digital encoding or decoding circuitry, or any other suitable audio circuits or combinations of such circuits. Encoding circuitry [e.g., for converting received audio input or digital signals to audio signals for analysis or storage] may also be provided. The audio circuitry may be used by the media device 300 to receive, process, and generate audio input [e.g., the search query 104] or output [e.g., the audio response 108].”, and Par. 0043:” At block 418, the control circuitry 304 generates audio output using the voice profile of the personality [identified at block 414]. The audio output generated by the control circuitry 304 is an audio response 108 that includes the answer to the search query 104. In some embodiments, the audio response 108 further includes the phrase, tune, or jingle identified at block 416 [initial response]. For example, the control circuitry 304 may execute one or more speech synthesis algorithms, including diphone synthesis, domain-specific synthesis, formant synthesis, articulatory synthesis, hidden Markov models-based synthesis, and sinewave synthesis, and/or may employ deep learning neural networks to generate the audio output using the voice profile of the personality. “).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Labsky, and Zhang in view of Aher to further comprising generating the initial response to the request by applying a Deep Neural Network (DNN) algorithm to the audio input signal to generate the initial response, in order to improve the user experience of interactive searching tools by having audio responses that are more interactive and contextually relevant to the search queries provided, as evidence by Aher (See Par. 0002).

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Labsky, Zhang, and Aher as set forth above regarding claim 4 from which claim 6 depends, and in further view of Venkataraman (US20170161320A1).

Venkataraman was applied in the previous Office Action.
Regarding claim 6, Labsky, Zhang and Aher fail to explicitly disclose, however, Aher teaches wherein the DNN algorithm is configured (Aher, Par. 0032:” … For example, the control circuitry 304 may execute one or more speech synthesis algorithms, including diphone synthesis, domain-specific synthesis, formant synthesis, articulatory synthesis, hidden Markov models-based synthesis, and sinewave synthesis, and/or may employ deep learning neural networks to generate the audio output using the voice profile of the personality. “
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Labsky, and Zhang in view of Aher to wherein the DNN algorithm is configured, in order to improve the user experience of interactive searching tools by having audio responses that are more interactive and contextually relevant to the search queries provided, as evidence by Aher (See Par. 0002).
Regarding claim 6, Labsky, Zhang and Aher fail to explicitly disclose, however, Venkataraman teaches the system of claim 4, [[wherein the DNN algorithm is configured]] to identify a topic associated with the audio input signal, wherein the plurality of instructions, which, when executed, further cause the processor to identify the initial response based on the identified topic. (Venkataraman, Par. 0114:” In some embodiments control circuitry 904 generates a response to a natural language query as described by process 1100 of FIG. 11. At step 1102, the control circuitry receives, from a user input interface, a natural language query. “, and Par. 0116:” At step 1104, control circuitry 904 determines which query template of a plurality of query templates corresponds to the natural language query. As referred to herein, the term ‘query template’ refers to a generalized template for a specific type of query. For example, a basic query template may be ‘Show me <. . . >,’ where the dots represent what the user wants to be shown. A more complete query template may be ‘Show me <. . . >,’ directed by <. . . >. Thus, a natural language query ‘Show me movies [topic] directed by James Cameron will fit this query template. It should be noted that these two query templates are used as examples and other more complex query templates may be used by the system described.”, and Par. 0125:” At step 1106, control circuitry 904 retrieves one or more search results corresponding to the natural language query. The control circuitry may determine a search query based on the query template. To continue with example above, ‘show me’ may be excluded from the search because it is part of the template. So, the search string may include terms such as ‘movie,’ ‘directed,’ and ‘James Cameron.” When the search results are retrieved, the control circuitry may extract movie titles from the results or this can be done by another system before the results reach the control circuitry.”, and par. 0126:” At step 1108, control circuitry 904 selects, based on a selection criteria, one or more attributes of a plurality of attributes associated with a user. For example, process 1300 of FIG. 13 illustrates one possible method of selecting the one or more attributes. …. Attributes associated with users may be stored in a database on a server and cached locally to a user equipment device as required.”, and Par. 0134:” At step 1110, control circuitry 904 identifies, based on the one or more attributes, a response template of a plurality of response templates previously assigned to the query template. Various ways may be used to make the identification. For example, if the control circuitry determines that time of day in the user's location is selected attribute, the control circuitry may use process 1400 of FIG. 14 to identify an appropriate response template.’, and Par. 0142:’ At step 1112, control circuitry 904 generates a response to the natural language query based on the identified response template and the retrieved one or more search results. For example, as described above, if the natural language query is: ‘Who directed Titanic,’ and the control circuitry selects time of day at the user's location as an attribute associated with the user, then the control circuitry may select a response template that is of the shortest length, based on that time of the day. As a result, the control circuitry may generate a response ‘James Cameron’ without any other words.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Labsky, Zhang and Aher in view of Venkataraman to [[wherein the DNN algorithm is configured to]] identify a topic associated with the audio input signal, wherein the plurality of instructions, which, when executed, further cause the processor to identify the initial response based on the identified topic, in order to generate an intelligent response to a natural language query based on one or more attributes associated with a device that receives the query, as evidence by Venkataraman (See Par. 0152).

Claim  8,  is rejected under 35 U.S.C. 103 as being unpatentable over Labsky (US20140058732A1), and Eide et al. (US20080167874A1)(herein "Eide").

Labsky was applied in the previous Office Action.
Regarding claim 8, Labsky teaches a client device comprising: a processor of a client device; a memory configured to store a plurality of instructions, which, when executed, cause the processor to: (Labsky, Par. 0012:” One such embodiment comprises a computer program product that has a computer-storage medium [e.g., a non-transitory, tangible, computer-readable media, disparately located or commonly located storage media, computer storage media or medium, etc.] including computer program logic encoded thereon that, when performed in a computerized device having a processor and corresponding memory, programs the processor to perform [or causes the processor to perform] the operations disclosed herein. Such arrangements are typically provided as software, firmware, microcode, code data [e.g., data structures], etc., arranged or encoded on a computer readable storage medium such as an optical medium [e.g., CD-ROM], floppy disk, hard disk, one or more ROM or RAM or PROM chips, an Application Specific Integrated Circuit [ASIC], a field-programmable gate array [FPGA], and so on.”, and Par. 0016:” Also, it is to be understood that each of the systems, methods, apparatuses, etc. herein can be embodied strictly as a software program, as a hybrid of software and hardware, or as hardware alone such as within a processor, or within an operating system or within a software application, or via a non-software application such a person performing all or part of the operations.”).
encode an audio input signal received from a microphone, the audio input signal comprising a request; (Labsky, Par. 0056:” In step 300, the speech recognition response manager receives a spoken utterance at a client electronic device, such as a person speaking to the client device via microphone.”, and Par. 0004:” For example, a speaker can utter a command to execute a specific task, or utter a query to retrieve specific results. Spoken input can follow a rigid set of phrases that perform specific tasks, or spoken input can be natural language, which is interpreted by a natural language unit of a speech recognition system.”).
transmit the encoded audio input signal to a cloud service that is configured to generate a primary response; (Labsky, Par. 0011:” The client device transmits [ implied encoding] at least a portion of the spoken utterance over a communication network to a remote automated speech recognizer that analyzes spoken utterances and returns remote speech recognition results [primary response], such as by a network-accessible server.”).
determine an initial response to the request, the initial response corresponding to a first audio clip; (Labsky, Par. 0013:” …. prior to receiving a remote speech recognition result from the remote automated speech recognizer, initiating a response [initial response] via a user interface of the client electronic device, the response corresponding to the spoken utterance, at least an initial portion of the response is based on a local speech recognition result from the local automated speech recognizer …”).
render the first audio clip for presentation prior to receiving the primary response, the primary response corresponding to a second audio clip; and (Labsky, Par. 0027:” …With techniques herein, however, the client device 112 can initiate a response to spoken utterance prior to having any specific results [primary response]. For example, client device can analyze the spoken utterance 107 and identify that the user is searching for something. In response, the client device can initiate a response via a user interface [render], such as with a text-to-speech system. In this non-limiting example, the local recognizer initiates producing or speaking word 151, ‘Searching the Internet For.’ These introductory or filler words are then modified by adding words 152, ‘Apple Pie Recipe,’ which are presented after words 151. With such a technique, a response to user input is initiated via a user interface prior to having complete results, and then the UI response is modified [in this example the UI response is added-to] to convey results corresponding to the spoken query, such as search results.”, and Par. 0011:”For example, in embodiments in which the user interface includes a text-to-speech system, the client device can begin speaking words as if the client device were already in possession of results, and then add [append] words to the response after receiving results from a remote server. An initial response then begins immediately instead of waiting for results from all recognizers. This reduces a perceived delay by the user because even with the initial response comprising filler or introductory words, the commencement of a response signals to the user that results have been retrieved and the client device is in the process of conveying the results.”).
append the second audio clip to follow the first audio clip, the second audio clip being presented after the presentation of the first audio clip. (Labsky, Par. 0011:” For example, in embodiments in which the user interface includes a text-to-speech system, the client device can begin speaking words as if the client device were already in possession of results, and then add [append] words to the response after receiving results from a remote server. An initial response then begins immediately instead of waiting for results from all recognizers. This reduces a perceived delay by the user because even with the initial response comprising filler or introductory words, the commencement of a response signals to the user that results have been retrieved and the client device is in the process of conveying the results.”, and Par. 0082:”In step 340, the speech recognition response manager modifies the response after the response has been initiated and prior to completing delivery of the response via the user interface such that modifications to the response [appending process] are delivered via the user interface [rendering] as a portion of the response. The modifications are based on the remote speech recognition result.”).
Labsky fails to explicitly disclose, however, Eide teaches wherein the first audio clip is a predetermined audio clip that is stored in the memory prior to the audio input signal being received from the microphone (Eide, Par. 0022: “As described above, as ASR engine 204, NLU unit 206 and NLG block 110 are each processing, a latency results that is equal to the sum of the processing latencies of ASR engine 204, NLU unit 206 and NLG block 210. To mask the resulting latency, ASR engine 204 first signals a filler generator 216 when caller 202 has finished speaking. Filler generator 216 selects a paralinguistic event or canned/fixed phrase from database 218. A speech synthesis system 212 of a TTS system 214 may immediately output or play the paralinguistic event or canned phrase from database 218, or filler generator 216 may delay the output by a few milliseconds before sending the paralinguistic event or canned phrase to speech synthesis system 212. Filler generator 216 may repeat selecting additional paralinguistic events or canned phrases from database 218 to be output by speech synthesis system 212 until NLG block 210 completes the formation of a response. Once NLG block 210 completes the formation of a response to caller 202, filler generator 216 stops selecting paralinguistic events and canned phrases to be output, and speech synthesis system 212 plays or outputs the response formed by NLG block 210 to caller 202.”, and Par. 0023:” The paralinguistic events or canned phrases may be prerecorded into database 218. The paralinguistic events may be selected randomly and may consist of coughs, breaths, and filled pauses such as, “uh,” “um,” and “hmmm.” Similarly, fixed phrases such as “well . . .” or “let's see . . . ” may also be prerecorded into database 200.”, and Par. 0025:”As the ASR engine, NLU unit, and NLG are processing, a latency results that is equal to a sum of the processing latencies of the ASR engine, NLU unit and NLG. In block 314, latency is determined by testing whether a response is ready after receiving a communication from a user in block 302. If a response is not ready, a filler generator selects a paralinguistic event or canned phrase from a database in block 316. In block 318, the random paralinguistic event or fixed phrase is conveyed to the user through a speech synthesis system. The methodology then returns to block 314 to determine whether the natural language generator has created the response. If it is determined in block 314 that the response from block 312 is ready, the response is conveyed to the user through the speech synthesis system in communication with the NLG, in block 320, terminating the methodology.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Labsky in view of Eide to wherein the first audio clip is a predetermined audio clip that is stored in the memory prior to the audio input signal being received from the microphone, in order to provide techniques for masking latency in an automatic dialog system, as evidence by Eide (see Par. 0007).


Claims 9, and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Labsky, Eide and in further view of Zhang (US20210256057A1).

Zhang was applied in the previous Office Action.
Regarding claim 9, Labsky and Eide fail to explicitly disclose, however, Zhang teaches wherein the first audio clip is a predetermined audio clip that is stored in the memory prior to the input signal being received at the client device. (Zhang, claim 8:” A method for playing audio, applied to a server, the method comprising: receiving a response request sent by a terminal; obtaining, based on the response request, a plurality of response audio data; sending, in response to determining that at least one piece of response audio data of the plurality of response audio data is synthesized into a response audio clip, the response audio clip to the terminal until finishing sending the plurality of response audio data.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Labsky, and Eide, in view of Zhang to wherein the first audio clip is a predetermined audio clip that is stored in the memory prior to the input signal being received at the client device, in order to improve accuracy of the information conveyed to the user during the voice playback of the terminal, as evidence by Zhang (see Par. 0079).

Regarding claim 10, Labsky, and Eide fail to explicitly disclose, however, Zhang teaches wherein the first audio clip is selected from a library of predetermined audio clips stored in the memory. (Zhang, claim 8:” A method for playing audio, applied to a server, the method comprising: receiving a response request sent by a terminal; obtaining, based on the response request, a plurality of response audio data; sending, in response to determining that at least one piece of response audio data of the plurality of response audio data is synthesized into a response audio clip, the response audio clip to the terminal until finishing sending the plurality of response audio data.”). Note: plurality of the response audio data implies storage in the memory.
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Labsky, and Eide in view of Zhang to wherein the first audio clip is selected from a library of predetermined audio clips stored in the memory, in order to improve accuracy of the information conveyed to the user during the voice playback of the terminal, as evidence by Zhang (see Par. 0079).


Claims 11, 12, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Labsky, Eide and in further view of Aher (US20200167384A1).

Aher was applied in the previous Office Action.
Regarding claim 11, Labsky, and Eide fail to explicitly disclose, however, Aher teaches the system of claim 1, wherein the processor is further configured to determine the initial response to the request by applying an artificial intelligence algorithm to the input signal to generate the initial response. (Aher, Par. 0032:” The control circuitry 304 may include audio processing circuitry and/or audio generation circuitry, other digital encoding or decoding circuitry, or any other suitable audio circuits or combinations of such circuits. Encoding circuitry [e.g., for converting received audio input or digital signals to audio signals for analysis or storage] may also be provided. The audio circuitry may be used by the media device 300 to receive, process, and generate audio input [e.g., the search query 104] or output [e.g., the audio response 108].”, and Par. 0043:” At block 418, the control circuitry 304 generates audio output using the voice profile of the personality [identified at block 414]. The audio output generated by the control circuitry 304 is an audio response 108 that includes the answer to the search query 104. In some embodiments, the audio response 108 further includes the phrase, tune, or jingle identified at block 416 [initial response]. For example, the control circuitry 304 may execute one or more speech synthesis algorithms, including diphone synthesis, domain-specific synthesis, formant synthesis, articulatory synthesis, hidden Markov models-based synthesis, and sinewave synthesis, and/or may employ deep learning neural networks [AI] to generate the audio output using the voice profile of the personality. “).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Labsky, and Eide in view of Aher to wherein the processor is further configured to determine the initial response to the request by applying an artificial intelligence algorithm to the input signal to generate the initial response, in order to improve the user experience of interactive searching tools by having audio responses that are more interactive and contextually relevant to the search queries provided, as evidence by Aher (See Par. 0002).

Regarding claim 12, Labsky further teaches wherein the artificial intelligence algorithm is configured to categorize the input signal into a binary result, the binary result indicating whether the cloud service is able to respond to the request, wherein the initial response is determined according to the binary result. (Labsky, Par. 0011:” The speech recognition response manager or client device receives a spoken utterance at a client electronic device. The client electronic device includes a local automated speech recognizer. The speech recognition response manager analyzes the spoken utterance using the local automated speech recognizer. The client device transmits at least a portion of the spoken utterance over a communication network to a remote automated speech recognizer that analyzes spoken utterances and returns remote speech recognition results, such as by a network-accessible server. Prior to receiving a remote speech recognition result from the remote automated speech recognizer, the speech recognition response manager initiates a response [initial response] via a user interface of the client electronic device. The response corresponds to the spoken utterance [directional response]. At least an initial portion of the response is based on a local speech recognition result [directional response] from the local automated speech recognizer. The speech recognition response manager can then modify the response after the response has been initiated and prior to completing delivery of the response via the user interface such that modifications to the response are delivered via the user interface as a portion of the response. Such modifications are based on the remote speech recognition result. For example, in embodiments in which the user interface includes a text-to-speech system, the client device can begin speaking words as if the client device were already in possession of results, and then add words to the response after receiving results from a remote server. An initial response then begins immediately instead of waiting for results from all recognizers.”, and Par. 0085:” In step 360, the speech recognition response manager identifies that the initiated response conveyed via a text-to-speech system is incorrect [binary zero] based on remote speech recognition results, and in response corrects [binary one] the initiated response using an audible excuse transition, such as word-based [“pardon me”] and/or otherwise [throat clearing sound].”) Note: directional result is attained when the continuum result from the server is provided as taught by Labsky as if the entire result is spoken from the same source.

Regarding claim 14, Labsky further teaches the client device of claim 11, wherein the second audio clip is received from the cloud service as part of the primary response. (Labsky, Par. 0011:” For example, in embodiments in which the user interface includes a text-to-speech system, the client device can begin speaking words as if the client device were already in possession of results, and then add [append] words to the response after receiving results from a remote server [cloud]. An initial response then begins immediately instead of waiting for results from all recognizers. This reduces a perceived delay by the user because even with the initial response comprising filler or introductory words, the commencement of a response signals to the user that results have been retrieved and the client device is in the process of conveying the results.”).

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Labsky, Eide, and Aher, as set forth above regarding claim 11 from which claim 13 depends, and in further view of Venkataraman (US20170161320A1).

Venkataraman was applied in the previous Office Action.
Regarding claim 13, Labsky, and Eide fail to explicitly disclose, however, Aher teaches wherein the artificial intelligence algorithm is configured (Aher, Par. 0032:” … For example, the control circuitry 304 may execute one or more speech synthesis algorithms, including diphone synthesis, domain-specific synthesis, formant synthesis, articulatory synthesis, hidden Markov models-based synthesis, and sinewave synthesis, and/or may employ deep learning neural networks [artificial intelligence] to generate the audio output using the voice profile of the personality. “
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Labsky, and Eide in view of Aher to wherein the artificial intelligence algorithm is configured, in order to improve the user experience of interactive searching tools by having audio responses that are more interactive and contextually relevant to the search queries provided, as evidence by Aher (See Par. 0002).
Regarding claim 13, Labsky, Eide and Aher fail to explicitly disclose, however, Venkataraman teaches the client device of claim 11, [[wherein the artificial intelligence algorithm is configured]] to identify a topic associated with the input signal, wherein the plurality of instructions, which, when executed, further cause the processor to identify the initial response based on the identified topic. (Venkataraman, Par. 0114:” In some embodiments control circuitry 904 generates a response to a natural language query as described by process 1100 of FIG. 11. At step 1102, the control circuitry receives, from a user input interface, a natural language query. “, and Par. 0116:” At step 1104, control circuitry 904 determines which query template of a plurality of query templates corresponds to the natural language query. As referred to herein, the term ‘query template’ refers to a generalized template for a specific type of query. For example, a basic query template may be ‘Show me <. . . >,’ where the dots represent what the user wants to be shown. A more complete query template may be ‘Show me <. . . >,’ directed by <. . . >. Thus, a natural language query ‘Show me movies [topic] directed by James Cameron will fit this query template. It should be noted that these two query templates are used as examples and other more complex query templates may be used by the system described.”, and Par. 0125:” At step 1106, control circuitry 904 retrieves one or more search results corresponding to the natural language query. The control circuitry may determine a search query based on the query template. To continue with example above, ‘show me’ may be excluded from the search because it is part of the template. So, the search string may include terms such as ‘movie,’ ‘directed,’ and ‘James Cameron.” When the search results are retrieved, the control circuitry may extract movie titles from the results or this can be done by another system before the results reach the control circuitry.”, and par. 0126:” At step 1108, control circuitry 904 selects, based on a selection criteria, one or more attributes of a plurality of attributes associated with a user. For example, process 1300 of FIG. 13 illustrates one possible method of selecting the one or more attributes. …. Attributes associated with users may be stored in a database on a server and cached locally to a user equipment device as required.”, and Par. 0134:” At step 1110, control circuitry 904 identifies, based on the one or more attributes, a response template of a plurality of response templates previously assigned to the query template. Various ways may be used to make the identification. For example, if the control circuitry determines that time of day in the user's location is selected attribute, the control circuitry may use process 1400 of FIG. 14 to identify an appropriate response template.’, and Par. 0142:’ At step 1112, control circuitry 904 generates a response to the natural language query based on the identified response template and the retrieved one or more search results. For example, as described above, if the natural language query is: ‘Who directed Titanic,’ and the control circuitry selects time of day at the user's location as an attribute associated with the user, then the control circuitry may select a response template that is of the shortest length, based on that time of the day. As a result, the control circuitry may generate a response ‘James Cameron’ without any other words.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Labsky, Eide and Aher in view of Venkataraman to [[wherein the artificial intelligence algorithm is configured to]] identify a topic associated with the input signal, wherein the plurality of instructions, which, when executed, further cause the processor to identify the initial response based on the identified topic, in order to generate an intelligent response to a natural language query based on one or more attributes associated with a device that receives the query, as evidence by Venkataraman (See Par. 0152).

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Labsky, Zhang and in further view of Venkataraman .

Regarding claim 18, Labsky, and Zhang fail to explicitly disclose, however, Venkataraman teaches the client device of claim 15, further comprising:  identifying, by the local device, a topic associated with the audio input signal; and identify, by the local device, the initial response based on the identified topic. (Venkataraman, Par. 0114:” In some embodiments control circuitry 904 generates a response to a natural language query as described by process 1100 of FIG. 11. At step 1102, the control circuitry receives, from a user input interface, a natural language query. “, and Par. 0116:” At step 1104, control circuitry 904 determines which query template of a plurality of query templates corresponds to the natural language query. As referred to herein, the term ‘query template’ refers to a generalized template for a specific type of query. For example, a basic query template may be ‘Show me <. . . >,’ where the dots represent what the user wants to be shown. A more complete query template may be ‘Show me <. . . >,’ directed by <. . . >. Thus, a natural language query ‘Show me movies [topic] directed by James Cameron will fit this query template. It should be noted that these two query templates are used as examples and other more complex query templates may be used by the system described.”, and Par. 0125:” At step 1106, control circuitry 904 retrieves one or more search results corresponding to the natural language query. The control circuitry may determine a search query based on the query template. To continue with example above, ‘show me’ may be excluded from the search because it is part of the template. So, the search string may include terms such as ‘movie,’ ‘directed,’ and ‘James Cameron.” When the search results are retrieved, the control circuitry may extract movie titles from the results or this can be done by another system before the results reach the control circuitry.”, and par. 0126:” At step 1108, control circuitry 904 selects, based on a selection criteria, one or more attributes of a plurality of attributes associated with a user. For example, process 1300 of FIG. 13 illustrates one possible method of selecting the one or more attributes. …. Attributes associated with users may be stored in a database on a server and cached locally to a user equipment device as required.”, and Par. 0134:” At step 1110, control circuitry 904 identifies, based on the one or more attributes, a response template of a plurality of response templates previously assigned to the query template. Various ways may be used to make the identification. For example, if the control circuitry determines that time of day in the user's location is selected attribute, the control circuitry may use process 1400 of FIG. 14 to identify an appropriate response template.’, and Par. 0142:’ At step 1112, control circuitry 904 generates a response to the natural language query based on the identified response template and the retrieved one or more search results. For example, as described above, if the natural language query is: ‘Who directed Titanic,’ and the control circuitry selects time of day at the user's location as an attribute associated with the user, then the control circuitry may select a response template that is of the shortest length, based on that time of the day. As a result, the control circuitry may generate a response ‘James Cameron’ without any other words.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Labsky, and Zhang in view of Venkataraman to identifying, by the local device, a topic associated with the audio input signal; and identify, by the local device, the initial response based on the identified topic, in order to generate an intelligent response to a natural language query based on one or more attributes associated with a device that receives the query, as evidence by Venkataraman (See Par. 0152).



Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. Goel (US-11194973B1) teaches: Col. 2, lines 24 – 33:” Certain systems may be configured to perform actions responsive to user inputs. For example, for the user input of “Alexa, play Adele music,” a system may output music sung by an artist named Adele. For further example, for the user input of “Alexa, what is the weather,” a system may output synthesized speech representing weather information for a geographic location of the user. In a further example, for the user input of “Alexa, send a message to John,” a system may capture spoken message content and cause same to be output via a device registered to “John.”
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DARIOUSH AGAHI whose telephone number is (408)918-7689. The examiner can normally be reached Monday - Thursday and alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DARIOUSH AGAHI/Examiner, Art Unit 2656                                                                                                                                                                                                        

/MICHELLE M KOETH/Primary Examiner, Art Unit 2656