DETAILED ACTION
This action is responsive to the Amendment filed on 02/21/2022. Claims 21, 23-31, 33-40 are pending in the case. Claims 21 and 31 are the independent claims.
This office action is FINAL.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Applicant’s Response
In Applicant’s response dated 21, 23-31, 33-40 (hereinafter Response), Applicant amended Claims 21 and 31; cancelled Claims 22 and 32; and argued against all objections and rejections previously set forth in the Office Action dated 12/09/2021.
Applicant’s amendment to claims 21, 23-31, 33-40 to further clarify the metes and bounds of the invention are acknowledged.
It is noted that the subject matter of independent claims 21 and 31 is different than the subject matter of (now canceled) claims 22 and 32, as can be seen in the table below comparing pending claim 21 and previous claim 22 (additions are underlined, deletions are struck-through):
Pending claim 21
Previous claim 22 (in independent form)
A computer-implemented method, comprising:

causing a graphical user interface (GUI) to be displayed on a device;

receiving a user selection corresponding to an application displayed using the GUI;

receiving input audio data corresponding to a user utterance;

performing speech processing using the input audio data to determine natural language understanding (NLU) data, the NLU data including intent data corresponding to the user utterance;




determining, based at least in part on the intent data, a first portion of output data;

determining the intent data corresponds to a first invocation of the application;

sending the intent data to the application;

receiving a second portion of the output data from the application;

performing text-to-speech processing on the first portion of the output data and the second portion of the output data to determine output audio data; and

causing the output audio data to be sent to the device.
A computer-implemented method, comprising: 

causing a graphical user interface (GUI) to be displayed on a device; 

receiving a user selection corresponding to an application displayed using the GUI; 

receiving input audio data corresponding to a user utterance; 

performing speech processing using the input audio data to determine natural language understanding (NLU) data, the NLU data including intent data corresponding to the user utterance; 



determining, based at least in part on the intent data, a first portion of output data;









performing text-to-speech processing on the first portion of the output data to determine output audio data; and 


causing output audio data to be sent to the device;



determining the intent data corresponds to a first invocation of the application;

sending the intent data to the application; and 

receiving a second portion of the output data from the application.


Response to Amendment/Arguments
Applicant’s prior art arguments with respect to the pending claims have been fully considered but are moot in view of the new grounds of rejection presented below, which are required in response to the Applicant’s amendments.
	
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 21, 23-31, 33-40 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 
Regarding claim 21, the claim weaves limitations from two distinct embodiments described in the instant application, where the combination as recited is not described.
The first embodiment is with respect to the preview of a skill (application), prior to its enablement (see e.g. [0011-0012, 0019, 0021,0029]) and includes the limitations not struck through below: 
causing a graphical user interface (GUI) to be displayed on a device;
receiving a user selection corresponding to an application displayed using the GUI;



{determining the user’s intent is to preview the application based on the user selection input}
determining, based at least in part on the intent data, a first portion of output data;

sending the intent data to the application;
receiving a second portion of the output data from the application;
performing text-to-speech processing on the first portion of the output data and the second portion of the output data to determine output audio data; and
causing the output audio data to be sent to the device.
The second embodiment is with respect to the invocation of an enabled skill (see e.g. [0009, 0019, 0034]) and includes the limitations not struck through below:


receiving input audio data corresponding to a user utterance;
performing speech processing using the input audio data to determine natural language understanding (NLU) data, the NLU data including intent data corresponding to the user utterance;

determining the intent data corresponds to a first invocation of {an} application;
sending the intent data to the application;
receiving a second portion of the output data from the application;
performing text-to-speech processing on 
causing the output audio data to be sent to the device.
The claim mixes these two embodiments by reciting:
determining the intent data {of the user utterance} corresponds to a first invocation of the application {selected by the user in the graphical user interface};
however this element is not described in the instant application in this manner.  
The disclosure explains how a skill which is not enabled may be previewed by receiving a touch input (e.g. long touch) provided to the user interface and if the user likes the skill, then the user can enable the skill so that the device will respond to the invocation utterance in the future. The disclosure does not ever clearly describe the user requesting a preview of a skill (or application) using voice (audio) input. At best, the device may be capable of also receiving voice (audio) or other types of input (e.g. [0043-0044], [0057-0058]), however only touch input is ever described as causing the system to determine the first and second output data portions (i.e. the example invocation text and the sample result for the skill) such that they are provided in a (single) output audio data.
Regarding claim 31, the claim is directed to the system for performing the method of claim 21 and is therefore rejected under similar rationale.
Regarding dependent claims 23-30 and 33-40, dependent claims necessarily inherit the deficiencies of their respective parent claim.
For purposes of rejection in view of cited art, only the plain meanings of the claim elements have been considered, as it is improper to import any limitations from the disclosure as to how the elements are intended to be interpreted by an end user (e.g. requesting a preview of a non-enabled skill vs activating an enabled skill).
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 21, 23-25, 27-28, 30-31, 33-35, 37-38, 40 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by MARTEL et al. (Pub. No.: US 2017/0185375 A1, newly cited).
Regarding claim 21, MARTEL may be relied upon to teach the computer-implemented method (relying on the various data flows between components in FIG 7B and broadly-recited process steps in FIGs 8A-8B), comprising:
causing a graphical user interface (GUI) to be displayed on a device ([0235] FIGS. 9A-D illustrate exemplary user interfaces of an electronic device for proactive assistance based on dialog communication between devices according to various examples);
receiving a user selection corresponding to an application displayed using the GUI ([0275] In some examples, the one or more tasks performed at block 816 {of FIG 8B, cited again below} can be specific to an application of the electronic device… in response to detecting a user input associated with the information on the user interface of the application, a corresponding action can be executed using the application);
receiving input audio data corresponding to a user utterance (FIG 8A (808) detect user input; 8A (810) is audio stream quality sufficient; 7B [0205] obtain speech input including responses to follow-up questions);
performing speech processing using the input audio data to determine natural language understanding (NLU) data, the NLU data including intent data corresponding to the user utterance (FIG 8A (812) generate text representation of speech; FIG 8B (814) text representation corresponds to one of plurality of types of information? (816) perform one or more tasks; FIG 7B [0205] forward speech to STT processing [0206] which can extract representative features from the speech input …produce intermediate recognitions results (e.g., phonemes, phonemic strings, and sub-words), and ultimately, text recognition results (e.g., words, word strings, or sequence of tokens)… Once STT processing module 730 produces recognition results containing a text string ( e.g., words, or sequence of words, or sequence of tokens), the recognition result can be passed to natural language processing module 732 for intent deduction [0212] Natural language processing module 732 ("natural language processor") of the digital assistant can take the sequence of words or tokens ("token sequence") generated by STT processing module 730, and attempt to associate the token sequence with one or more "actionable intents" recognized by the digital assistant);
determining, based at least in part on the intent data, a first portion of output data (FIG 8B (816) perform one or more tasks; FIG 7B [0225] once natural language processing module 732 identifies an actionable intent (or domain) based on the user request, natural language processing module 732 can generate a structured query to represent the identified actionable intent…[0226] natural language processing module 732 can pass the generated structured query…to task flow processing module 736 to complete the structured query…[0227] invokes dialogue processing, presents dialogue output to the user via audio and/or visual output);
determining the intent data corresponds to a first invocation of the application (FIG 8B (816) perform one or more tasks; [0275] In some examples, the one or more tasks performed at block 816 {of FIG 8B, cited again below} can be specific to an application of the electronic device; FIG 7B [0228] Once task flow processing module 736 has completed the structured query for an actionable intent, task flow processing module 736 can proceed to perform the ultimate task associated with the actionable intent…[0229] task flow processing module 736 can employ the assistance of service processing module 738 ("service processing module") to complete a task requested in the user input or to provide an informational answer requested in the user input);
sending the intent data to the application (FIG 8B (816) perform one or more tasks; FIG 7B [0229] make a phone call, set a calendar entry, invoke a map search, invoke or interact with other user applications installed on the user device, and invoke or interact with third-party services (e.g., a restaurant reservation portal, a social networking website, a banking portal, etc.) …request made via API);
receiving a second portion of the output data from the application (FIG 8B (816) perform one or more tasks; FIG 7B [0231] …and finally generate a response (i.e. an output to the user)…);
performing text-to-speech processing on the first portion of the output data and the second portion of the output data to determine output audio data (FIG 7B [0232] Speech synthesis module 740 can be configured to synthesize speech outputs for presentation to the user. Speech synthesis module 740 synthesizes speech outputs based on text provided by the digital assistant. For example, the generated dialogue response can be in the form of a text string …[0233] instead of (or in addition to) using speech synthesis module 740, speech synthesis can be performed on a remote device (e.g., the server system 108)); and
causing the output audio data to be sent to the device (FIG 7B [0233]… and the synthesized speech can be sent to the user device for output to the user.).
Regarding claim 31, MARTEL similarly teaches the system (system 100 in FIG 1) comprising at least one processor; and at least one memory comprising instructions that, when executed by the at least one processor (structural components of server 108, client 104 shown in FIG 6B), cause the system to: perform the operations of the method of claim 21, thus rejected under similar rationale.
Claims 22, 32 – canceled.
Regarding dependent claim 23 (33), incorporating the rejection of claim 21 (31), MARTEL further teaches wherein the GUI is configured to display data corresponding to a plurality of applications (see at least FIGs 9A (event, phone call information); 9C (different suggested actions); FIG 9D (icons 914 represent different applications or services which could be invoked when searching)).
Regarding dependent claim 24 (34), incorporating the rejection of claim 21 (31), MARTEL further teaches receiving, from the device, a first identifier corresponding to the application, wherein a representation corresponding to the first identifier was selected from a displayed set of options on the GUI ([0226] task flow processing module 736 receives the structured query from natural language processing module 732 and performs the task based on task flow models 754; [0228] task flow processing module 736 can execute the steps and instructions in the task flow model according to the specific parameters contained in the structured query; interpreting “identifier corresponding to the application” as the mapping within the task flow between a structured query such as: {restaurant reservation, restaurant=ABC Cafe, date=3/12/2012, time=7 pm, party size=5}, and the operations performed such as steps of: (1) logging onto a server of the ABC Cafe or a restaurant reservation system such as OPENTABLE®, (2) entering the date, time, and party size information in a form on the website, (3) submitting the form, and (4) making a calendar entry for the reservation in the user's calendar.
Regarding dependent claim 25 (35), incorporating the rejection of claim 21 (31), MARTEL further teaches wherein the user selection corresponds to a touch input ([0196] I/O interface 706 can couple input/output devices 716 of digital assistant system 700, such as displays, keyboards, touch screens, and microphones, to user interface module 722…[0243] user input can include a user selection of an affordance on the electronic device…displayed on the touchscreen; contrast with [0245] user input can be a voice command).
Regarding dependent claim 27 (37), incorporating the rejection of claim 21 (31), MARTEL further teaches causing a microphone associated with the device to determine the input audio data based at least in part on audio corresponding to the user utterance ([0240] At block 808, a user input can be detected. The user input can correspond to an action indicating that the user intends to perform a task on the electronic device related to the dialog…detected by microphone 213; see also [0196, 0201, 0205]).
Regarding dependent claim 28 (38), incorporating the rejection of claim 27 (37), MARTEL further teaches wherein the causing the microphone to determine the audio is performed at least in part in response to the user selection (see e.g. [0240] block 808, a user input can be detected. The user input can correspond to an action indicating that the user intends to perform a task on the electronic device related to the dialog [0243] user selection of a displayed affordance [0245] voice command received from microphone [0252] at least a portion of the stream of audio data can be less than the entire stream of audio data. In particular, the portion of the stream of audio data can include a predetermined duration of the stream of audio data prior to detecting the user input and a portion of the stream of audio data received after detecting the user input (or trigger condition) at block 808…. [0253] It can be advantageous for the user to trigger speech-to-text processing for the at least a portion of the stream of audio data.)
Regarding dependent claim 30 (40), incorporating the rejection of claim 21 (31), MARTEL further teaches wherein performing speech processing using the input audio data to determine the intent data is based at least in part on the application ([0247] the user input can define a portion of the text. For example, the user input can include highlighting or selecting the portion of the text via a user interface of the electronic device. The portion of text can correspond to the portion to be analyzed at block 814 to determine whether the portion contains information corresponding to one of a plurality of types of information; where as noted above, (814) is concerned with identifying what the user ultimately intends for an operation to be performed; [0263] block 814 can include performing natural language processing on the text representation (e.g., using natural language processing 732) to determine a domain corresponding to the text representation).
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 26 and 36 are rejected under 35 USC 103 as unpatentable over MATEL in view of GONG, Li (Pub. No.: US 2003/0167167 A1, previously cited).
Regarding dependent claim 26, incorporating the rejection of claim 21, MATEL does not appear to expressly disclose determining, based at least in part on the application, data corresponding to a voice type, wherein the performing the text-to-speech processing is based at least in part on the data corresponding to the voice type. At best, MATEL broadly describes the speech synthesis module 740 at [0232-0233], but this description is silent with respect to “voice type” (interpreted as, for example, using inflection or generating a regional accent).
GONG is similarly directed to (abstract) an intelligent personal assistant (agent) that assists a user in operating a computing device and using application programs on the computing device. As can be seen in FIG 3, information about the user (305) as well as application information (310) are used to adapt (using adaptation engine 330) to generate verbal responses (340) which are modified by an affect generator (360). FIG 5 is a specific method for assisting the device user and generating appropriate visual (facial expression of assistant) and verbal (vocal expression of assistant) feedback. GONG states:
[0046] The verbal generator 340 then sends the textual verbal content to an 1/0 device for the computer device, typically a display device, or a text-to-speech generation program that converts the text to speech and sends the speech to a speech synthesizer.
[0047] The affect generator 360 receives information from the adaptation engine 330 and produces the affective expression for the intelligent social agent 350. The affect generator 360 produces facial expressions and vocal expressions for the intelligent social agent 350 based on an indication from the dynamic adaptor module 336 as to what emotion the intelligent social agent 350 should express. 
[0069] The processor then generates the appropriate affect for the verbal expression of the intelligent social agent (step 555). This may be accomplished by modifying the speech style from the baseline style of speech for the intelligent social agent. Speech style may include speech rate, pitch average, pitch range, intensity, voice quality, pitch changes, and level of articulation.

Applicant may also wish to note the intended use of affect generator 360 in the overall system, for example as explained with respect to FIG 9:
[0095] an architecture 900 of an intelligent personal assistant helping a user to operate applications in a computing device. The intelligent personal assistant 910 may assist the user 915 across various application programs or functions. As described with respect to FIGS. 3 and 7, intelligent personal assistant 910 interacts with the user 915 and the application programs 920 in a computing device, including basic functions relating to the device itself and applications running on the device such as enterprise applications.

Thus GONG may clearly be relied upon to teach determining, based at least in part on the application (the context and content information for the request), data corresponding to a voice type (the affect for vocal expression), wherein the performing the text-to-speech processing is based at least in part on the data corresponding to the voice type (modifying the speech style from the baseline style of speech for the intelligent social agent. Speech style may include speech rate, pitch average, pitch range, intensity, voice quality, pitch changes, and level of articulation; all used by speech generation program that converts the text to speech and sends the speech to a speech synthesizer).
Accordingly, it would have been obvious to one having ordinary skill in graphical user interfaces before the effective filling date of the claimed invention, having the teachings of MATEL and GONG before them, to have combined MATEL (teaching text-to-speech conversion with respect to an automated assistant) and GONG (teaching a specific text-to-speech mechanism which adapts the vocal expression based on context information including the user and the application with respect to an automated assistant) and arrived at the claimed invention with expected and predictable results, motivated by GONG [0055] process 500 may help an intelligent social agent to act appropriately based on the user and the application context, for example [0041] more relaxed, [0042] reflect the user is happy or energetic, [0043] be apologetic when the user is frustrated with the device or the assistant itself.
Claims 29 and 39 are rejected under 35 USC 103 as unpatentable over MATEL in view of GRUBER et al. (Pub. No.: US 2013/0275164 A1, previously cited).
Regarding dependent claim 29 (39), incorporating the rejection of claim 21 (31), MARTEL further does not appear to expressly disclose determining a second user selection; and in response to the second user selection, ceasing certain processing with regard to the input audio data. It is noted that MATEL incorporates by reference at [0234] the parent application of GRUBER (U.S. Utility application Ser. No. 12/987,982) in order to include additional details on digital assistants.
GRUBER is similarly directed to the operations of a voice assistant system (see e.g. FIGs 3, 4 [0080] computing device 60 suitable for implementing at least a portion of the intelligent automated assistant features/functionalities… end-user, network server or server system; note also FIG 5 showing multiple clients communicating with multiple servers and service providers). 
GRUBER teaches determining a second user selection (breadth of interpretation includes receiving user input to close (or switch away from) the current conversation dialog because the user is finished (at least for the current session); see FIG 33, [0687] (790) User is done?) or selecting a different application that does not require voice interaction. Note that GRUBER provides suggestions which the user can select (see for example FIG 35 which shows an example set of restaurant results that the user can interact with (generating a map of the results by clicking “Map All”, calling a particular restaurant “Call”)) these at least suggest that the user is finished with the conversation with the digital assistant; and in response to the second user selection (If the user has indicated they are finished their conversation with the digital assistant) ceasing (the digital assistant will stop) certain {any additional} processing {of the conversation loop} with regard to the input audio data (no more conversational input required).
Accordingly, it would have been obvious to one having ordinary skill in graphical user interfaces before the effective filling date of the claimed invention, having the teachings of MATEL and GRUBER before them, to have used the interaction model in GRUBER (for stopping a first dialog in order to initiate a second dialog) with the digital assistant of MATEL with a reasonable expectation of success, the combination motivated by the teaching in MATEL [0234] where additional details about the operation of a digital assistant may be found in the GRUBER family of patent applications.
It is noted that any citation to specific pages, columns, lines, or figures in the prior art references and any interpretation of the references should not be considered to be limiting in any way. “The use of patents as references is not limited to what the patentees describe as their own inventions or to the problems with which they are concerned. They are part of the literature of the art, relevant for all they contain.” In re Heck, 699 F.2d 1331, 1332-33, 216 USPQ 1038, 1039 (Fed. Cir. 1983) (quoting In re Lemelson, 397 F.2d 1006, 1009, 158 USPQ 275, 277 (CCPA 1968)). Further, a reference may be relied upon for all that it would have reasonably suggested to one having ordinary skill the art, including nonpreferred embodiments. Merck & Co. v. Biocraft Laboratories, 874 F.2d 804, 10 USPQ2d 1843 (Fed. Cir.), cert. denied, 493 U.S. 975 (1989). See also Upsher-Smith Labs. v. Pamlab, LLC, 412 F.3d 1319, 1323, 75 USPQ2d 1213, 1215 (Fed. Cir. 2005); Celeritas Technologies Ltd. v. Rockwell International Corp., 150 F.3d 1354, 1361, 47 USPQ2d 1516, 1522-23 (Fed. Cir. 1998).


CONCLUSION
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
US 8924219 B1 (FIG 2 shows capturing second audio is responsive to determining which application was intended from first captured audio)

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMY M LEVY whose telephone number is 571-270-3771.  The examiner can normally be reached on Mon-Fri 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KIEU VU can be reached on 571-272-4057.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/Amy M Levy/Primary Examiner, Art Unit 2173