Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Detailed Action
2.	This Final Office Action is responsive to Applicant’s amendments and arguments filed 02/22/21.  Claims 1-20 remain pending, of which claims 1, 17, and 19 are independent.

Claim Rejections - 35 USC § 103
3.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

4.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office Action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


5.	The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.

4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

6.	Claims 1-4, 7, 9-12, and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Patent Application Publication No. 2019/0121612 (“Ashoori”) in view of U.S. Patent Application Publication No. 2018/0217810 (“Agrawal”).
Regarding claim 1, ASHOORI teaches a method comprising: 
displaying, on a display device of a client device, a user interface from an application that is active on the client device (“handheld computer” and the like or “information handling system” constitute the examples of “client device” featuring a “display device” as recited (see [0036], [0041]), the aforementioned computer/system is configured to capture/gather audio, which is parsed to determine audio components/segments, which are then persistently associated with “user interface traces” captured within an application window “prior to or following the audio capture or gathering” (see [0044]-[0045]) [in other words, an application window that is equivalent to the recited “user interface” that is subject to “displaying … from an application”]) and in response to the user interface being displayed, storing, in memory of the client device, audio data generated from a transducer on the client device (per [0044], the taught computer/system is configured to capture/gather audio, which is parsed to determine audio components/segments, and thereby teaches or at least implies a storing of audio data as recited for example to facilitate the parsing as discussed (with a memory element per [0075]-[0076] feasibly used for the storing), and the capture/gather for audio is accomplished using a microphone per at least [0042] and [0053] (which Applicants’ specification lists as an example or type of “transducer” as recited)); 
in response to the user interface being displayed, activating a machine learning scheme pre-associated with the user interface, the machine learning scheme comprising a machine learning model that is trained to detect a set of one or more keywords in audio data, the machine learning scheme being one of a plurality of machine learning schemes stored … , each of the plurality of machine learning schemes being trained with different sets of one or more keywords and in response to the user interface being displayed, detecting, using the machine learning scheme, a portion of the audio data as one of the keywords used to train the machine learning scheme ([0031] teaching “multiple reasoning algorithms …” which are applied to perform analysis of captured audio to determine/recognize keywords, e.g. per [0033]-[0034], the analysis is a result of searching the knowledge base / “corpus” per [0034] and [0029]-[0030], and particularly the corpus is a knowledge base that is grown/populated per [0042] and [0044]-[0045] (where the iterative/incremental growth/populating of a corpus in this manner is a training / machine learning aspect as recited)).

As discussed above, Ashoori’s capture and analysis of audio, e.g. verbal input from a user, may eventually result in the execution of a “user interface trace” (Ashoori’s [0029]), which is the device/system performing, as a result of verbal input processing and keyword recognition, a UI action.  That said, Ashoori does not teach that such a UI action as taught results in display of pre-associated UI content, e.g. per the further limitation of in response to detecting the portion of the audio data as one of the keywords, displaying user interface content pre-associated with the one of the keywords.  Rather, the Examiner relies upon AGRAWAL to teach what Ashoori may otherwise lack, see e.g. Agrawal’s voice-driven framework per FIG. 2 and [0015] and [0017]-[0027] where numerous examples would result in the presentation of a UI element responsive to the command (e.g., one of ordinary skill in the art would understand that a “reply” command would result in the display of a UI feature that allows a user to perform a reply function to a messaging/communication element, and for example similarly a “share” command would result in the display of a UI feature that allows the user to perform the share function to one of the listed social media platforms/services per [0015], and for example the “more” command is described to result in the invocation of additional but presently unseen menus, and so forth (and where the presented UI element provided responsive to the verbal invocation is a “UI content” as recited, and could feasibly be pre-associated with the verbal prompt/directive for example in a framework such as Ashoori’s)).
Further, Agrawal’s framework, as cited to just above, is defined in part by a “resource file” for the application and on device, which is used to generate the recognized “grammar list” (Agrawal’s [0016]-[0017], with clarifying examples provided per [0018]-[0027]).  Hence, even though Ashoori’s similar processing is taught to be performed in its server component per FIG. 1, Agrawal’s comparable system provides an example where at least parts of the similar processing elements are maintained on the client device.  In that sense, the Examiner believes Ashoori as modified by Agrawal likewise teaches the further limitation that the machine learning scheme being one of a plurality of machine learning schemes stored on the client device.
Both Ashoori and Agrawal relate to verbal input processing to facilitate request/command directives for/by a user in comparable user/client device frameworks.  Hence, the aforementioned prior art references are similarly directed and therefore analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate some of the concrete benefits and implementation details of Agrawal’s commands/directives into Ashoori’s general/comparable framework, with a reasonable expectation of success, such that the result of Ashoori’s taught framework of verbal input processing and keyword recognition is the performance of tasks/actions (as Ashoori contemplates) and particularly tasks/actions that might have an element resulting in the display of “UI content” (e.g., some onscreen/graphic element provided for the user’s benefit) as Agrawal concretely teaches, with the result being a useful and convenient extension of Ashoori’s verbal/voice UI to perform tasks/actions/functions routine in the state of the art (e.g., “reply”, 

Regarding claim 2, Ashoori in view of Agrawal teach the method of claim 1, as discussed above.  The aforementioned references further teach wherein the user interface is from a plurality of user interfaces of the application, and each of the plurality of machine learning schemes is pre-associated with one or more of the plurality of user interfaces of the application (per Ashoori’s [0044], the taught computer/system is configured to capture/gather audio, which is parsed to determine audio components/segments (i.e., matching the determined/parsed audio with pre-parsed and maintained text translated from previous audio in the corpus), and is operative in recognition of all applications and particularly an active/activated application per [0029], e.g. such that it follows that the corpus and numerous “reasoning algorithms” ([0031]) can be used to service voice-recognized invocation for a variety of applications; and further per Agrawal’s [0014]-[0015], each application and hence each UI would have its own defined set of “menu items” and hence corresponding “grammar list” per [0016]-[0017], and it follows that Ashoori’s corpus and reasoning algorithms and/or Agrawal’s grammar lists can be understood to encompass all of the applications operative in the device/system and hence the various screens or UI that may be operative for any one such application).  The motivation for combining the references is as discussed above in relation to claim 1.

Regarding claim 3, Ashoori in view of Agrawal teach the method of claim 2, as discussed above.  The aforementioned references further teach displaying, on the display device of the client device, an additional user interface from the plurality of user interfaces and in response to displaying the additional user interface, activating an additional machine learning scheme from the plurality of machine learning schemes on the client device, the additional machine learning scheme being trained to detect an additional keyword that is not in the set of one or more keywords used to train the machine learning scheme and detecting, using the additional machine learning scheme, an additional portion of the audio data as the additional keyword and in response to detecting the additional portion of the audio data as the additional keyword, displaying additional user interface content pre-associated with the additional keyword (Ashoori’s [0031] teaching “multiple reasoning algorithms …” which are applied to perform analysis of captured audio to determine/recognize keywords, e.g. per [0033]-[0034], the analysis is a result of searching the knowledge base / “corpus” per [0034] and [0029]-[0030], and particularly the corpus is a knowledge base that is grown/populated per [0042] and [0044]-[0045] (where the iterative/incremental growth/populating of a corpus in this manner is a training / machine learning aspect as recited) and would be specifically executed per Ashoori’s framework by listening, recognizing, and storing new keywords for observed actions that can later be invoked verbally as a UI trace as taught therein).  The motivation for combining the references is as discussed above in relation to claim 1.

Regarding claim 4, Ashoori in view of Agrawal teach the method of claim 1, as discussed above.  The aforementioned references further teach wherein the different sets of one or more keywords includes a first set of keywords and a second set of keywords, the first set of keywords including at least one keyword that is not in the second set of keywords (Ashoori’s [0031] teaching “multiple reasoning algorithms …” which are applied to perform analysis of captured audio to determine/recognize keywords, e.g. per [0033]-[0034], and it logically follows that different reasoning algorithms would be pertinent to different client usage and hence different commands and hence different keywords corresponding to those commands, but also alternatively/additionally Agrawal’s [0014]-[0015], numerous specific examples of applications are provided, and hence it logically follows that the framework is one that contemplates delivery of its experience/benefits for diverse applications 

Regarding claim 7, Ashoori in view of Agrawal teach the method of claim 1, as discussed above.  The aforementioned references teach wherein one or more of the plurality of machine learning schemes is configured to identify portions of audio data as keywords using audio template matching data (Ashoori’s [0029] and [0031] teaching a framework such that performing, as a result of verbal input processing and keyword recognition, a UI action, and additionally/alternatively Agrawal’s [0029] discussing the analysis of the microphone-obtained audio sample to identify a user’s voice command on the basis of a matching comparison with the grammar list that is operable for the current menu screen for a currently-active application (where either of the references can be said to perform voice/audio recognition of registered/acknowledged speech that essentially amounts to “audio template matching” as recited)).  The motivation for combining the references is as discussed above in relation to claim 1.

Regarding claim 9, Ashoori in view of Agrawal teach the method of claim 1, as discussed above.  The aforementioned references further teach wherein displaying user interface content includes displaying a selection of one or more user interface elements of the user interface being displayed on the client device (Agrawal’s [0015] discussing, for example, a “share” command/action by the user, such that “the selected control” is shared by the user to another platform such as a social media platform, and the Examiner construes this teaching to necessarily involve in a graphical/visual sense: (i) the user’s 

Regarding claim 10, Ashoori in view of Agrawal teach the method of claim 1, as discussed above.  The aforementioned references further teach wherein displaying user interface content includes generating an image using the client device, and displaying the image on the display device as the user interface content (Agrawal: some examples of the commands invoked by the voice-driven framework per FIG. 2 are discussed in [0015] and again per [0017]-[0027], and numerous examples would result in the presentation of a UI element responsive to the command, e.g. for example one of ordinary skill in the art would understand that a “reply” command would result in the display of a UI feature that allows a user to perform a reply function to a messaging/communication element, and for example similarly a “share” command would result in the display of a UI feature that allows the user to perform the share function to one of the listed social media platforms/services per [0015], and for example the “more” command is described to result in the invocation of additional but presently unseen menus, and so forth, and it logically follows that to accomplish these display aspects the device would necessarily be generating a corresponding image or image date to render to the device’s screen, e.g. so that the user would be able to see/view the UI to perform a messaging/communication reply or the UI to share the selected control to a social media platform or the additional menus that were previously hidden, and so forth).  The motivation for combining the references is as discussed above in relation to claim 1.

Regarding claim 11, Ashoori in view of Agrawal teach the method of claim 1, as discussed above.  The aforementioned references further teach wherein each of the set of one or more of the keywords used for training has pre-associated user interface content that is displayable upon a given keyword being detected by one of the machine learning schemes (per Ashoori’s [0044], the taught computer/system is configured to capture/gather audio, which is parsed to determine audio components/segments (i.e., matching the determined/parsed audio with pre-parsed and maintained text translated from previous audio in the corpus), and is operative in recognition of all applications and particularly an active/activated application per [0029], e.g. such that it follows that the corpus and numerous “reasoning algorithms” ([0031]) can be used to service voice-recognized invocation for a variety of applications; and further per Agrawal’s [0014]-[0015], each application and hence each UI would have its own defined set of “menu items” and hence corresponding “grammar list” per [0016]-[0017], and it follows that Ashoori’s corpus and reasoning algorithms and/or Agrawal’s grammar lists can be understood to encompass all of the applications operative in the device/system and hence the various screens or UI that may be operative for any one such application).  The motivation for combining the references is as discussed above in relation to claim 1.

Regarding claim 12, Ashoori in view of Agrawal teach the method of claim 1, as discussed above.  The aforementioned references further teach publishing, to a network site, the user interface content displayed in response to detecting the portion of the audio data using the machine learning scheme (Agrawal’s [0015] discussing, for example, a “share” command/action by the user, such that “the selected control” is shared by the user to another platform such as a social media platform (e.g., “a network site” that is being published to, as recited)).  The motivation for combining the references is as discussed above in relation to claim 1.

Regarding claim 17, the claim includes the same or similar limitations as discussed above in relation to claim 1, and is therefore rejected under the same rationale.  The additional recitations of one or more processors is further taught, e.g. Ashoori’s “processor” per [0079] and/or Agrawal’s “processor” per [0010].

Regarding claim 18, the claim includes the same or similar limitations as discussed above in relation to claim 2, and is therefore rejected under the same rationale.

Regarding claim 19, the claim includes the same or similar limitations as discussed above in relation to claim 1, and is therefore rejected under the same rationale. The additional recitations of a machine-readable storage device is further taught, e.g. Ashoori’s “computer readable storage medium” per [0075] and [0080].

Regarding claim 20, the claim includes the same or similar limitations as discussed above in relation to claim 2, and is therefore rejected under the same rationale.


7.	Claims 5-6 are rejected under 35 U.S.C. 103 as being unpatentable over Ashoori in view of Agrawal and further in view of U.S. Patent Application Publication No. 2013/0121796 (“Deisher”).
Regarding claim 5, Ashoori in view of Agrawal teach the method of claim 1, as discussed above.  While the aforementioned references further teach one or more … the plurality of machine learning schemes (Ashoori’s corpus and reasoning algorithms and/or Agrawal’s grammar lists, e.g. as discussed above in relation to claims 1-2 for example), they do not teach the entirety of the further limitation wherein one or more of the plurality of machine learning schemes is a neural network configured to process audio data.  Rather, the Examiner relies upon DEISHER to teach that which Ashoori and Agrawal may otherwise lack, see e.g. Deisher’s [0031] discussing the notion of “speech recognition processing” framework that leverages “neural networks” specifically, and where the use context similarly involves device-captured audio by way of a microphone per [0037].
Ashoori and Agrawal and Deisher relate to speech/audio-driven frameworks to facilitate software/device usability, and particularly doing so by leveraging cloud/networked resources specifically.  Hence, the aforementioned references are similarly directed and therefore analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to implement Ashoori’s and Agrawal’s combined framework, and particularly the audio-processing aspect, using neural networks per Deisher with a reasonable expectation of success, e.g. to take advantage of well-known benefits of offloading intensive computational loads from the client device to a cloud/network (see, e.g., Deisher’s [0003]) and thereby improve performance of the framework, generally.

Regarding claim 6, Ashoori in view of Agrawal and further in view of Deisher teach the method of claim 5, as discussed above.  The aforementioned references teach the further limitation wherein the neural network is a recurrent neural network (Deisher’s [0003] (“recurrent layer”) and [0238]).  The motivation for combining the references is as discussed above in relation to claim 5.


8 is rejected under 35 U.S.C. 103 as being unpatentable over Ashoori in view of Agrawal and further in view of U.S. Patent Application Publication No. 2019/0196698 (“Cohen”).
Regarding claim 8, Ashoori in view of Agrawal teach the method of claim 1, as discussed above.  The aforementioned references teach “displaying user interface content”, e.g. as discussed above in relation to claim 1, but not in a manner that includes displaying an image effect on one or more images that are captured using an image capture sensor on the client device.  Rather, the Examiner relies upon COHEN to teach what Ashoori and Agrawal do not, see e.g. Cohen’s [0002] discussing voice/speech-driven image editing, e.g. onboard camera-equipped devices such as smartphones per [0001] (comparable to the same class of device that Agrawal contemplates, per Agrawal’s [0011], and Ashoori per [0036] and [0041]).
Ashoori and Agrawal and Cohen relate to voice/speech-driven device management and processing, and are therefore similarly directed and hence analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Cohen’s voice/speech-driven image processing aspect with Ashoori’s and Agrawal’s combined framework that provides a similar interaction/engagement aspect for a wide array of functions/tasks, with a reasonable expectation of success, e.g. to provide yet another known and desired function to Ashoori’s/Agrawal’s framework that would allow its users to perform image processing/editing in the speech/voice-driven manner that Agrawal already contemplates, and thereby extending Ashoori’s/Agrawal’s benefits to a wider array of functionality that a user might desire.

9.	Claims 13-14 are rejected under 35 U.S.C. 103 as being unpatentable over Ashoori in view of Agrawal and further in view of U.S. Patent Application Publication No. 2018/0039478 (“Sung”).
Regarding claim 13, Ashoori in view of Agrawal teach the method of claim 1, as discussed above.  The aforementioned references, e.g. Agrawal, teach a device framework that permits a user to engage with the device in a speech/voice-driven manner to perform the various device functions, e.g. per displaying a visual instruction on the display device that prompts a user of the client device to speak.  Rather, the Examiner relies upon SUNG to teach that which Ashoori and Agrawal may otherwise lack, see e.g. Sung’s [0030] discussing a comparable speech/voice-driven framework, and particularly one where the user is graphically/textually prompted to provide the speech/voice input/cues.
Ashoori and Agrawal and Sung relate to voice/speech-driven device management and processing, and are therefore similarly directed and hence analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Sung’s explicit teaching of a graphical/visual prompt with Ashoori and Agrawal’s combined framework with a reasonable expectation of success, e.g. to provide guidance and program-flow structure to Ashoori’s/Agrawal’s framework so that users are provided with the necessary notice as to what is expected of them to further engage.

Regarding claim 14, Ashoori in view of Agrawal and further in view of Sung teach the method of claim 13, as discussed above.  The aforementioned references teach the further limitation wherein the visual instruction is pre-associated with the one of the keywords (Sung’s [0030] discussing a comparable speech/voice-driven framework, and particularly one where the user is graphically/textually prompted to provide the speech/voice input/cues, and it would have been obvious to prompt a user for an expected input in certain situations and hence the association between a prompt and an input is a reasonable implication).  The motivation for combining the references is as discussed above in relation to claim 13.


10.	Claims 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Ashoori in view of Agrawal and further in view of U.S. Patent No. 9986394 (“Taylor”).
Regarding claim 15, Ashoori in view of Agrawal teach the method of claim 1, as discussed above.  The aforementioned references further teach in response to the user interface no longer being displayed on the display device, deactivating the machine learning scheme … on the client device (Agrawal’s [0031] discussing that a determined grammar list is flushed and repopulated in view of a context switch to a different application/UI for example).  That said, Ashoori and Agrawal do not teach the further limitation of terminating storing of audio data on the client device in response to the UI no longer being displayed.  Rather, the Examiner relies upon TAYLOR to teach that which the aforementioned references may otherwise lack, see e.g. Taylor’s column 18 lines 14-18 discussing the deletion of acquired audio data after a determined duration of time.
Ashoori and Agrawal and Taylor relate to voice/speech-driven device management and processing, and are therefore similarly directed and hence analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Taylor’s timely deletion/purge of acquired audio data with Ashoori’s and Agrawal’s combined framework with a reasonable expectation of success, e.g. to free memory/storage resources for information that is no longer needed.

Regarding claim 16, Ashoori in view of Agrawal teach the method of claim 1, as discussed above.  The aforementioned references do not teach the further limitation wherein the audio data is buffered such that only a most recent predetermined period of time is stored in the memory of the client device.  Rather, the Examiner relies upon TAYLOR to teach that which Ashoori and Agrawal may otherwise lack, see e.g. Taylor’s column 18 lines 14-18 discussing the deletion of acquired audio data after a determined duration of time.
Ashoori and Agrawal and Taylor relate to voice/speech-driven device management and processing, and are therefore similarly directed and hence analogous.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate Taylor’s timely 


Response to Arguments
11.	Applicants’ arguments with respect to the pending claims have been carefully considered but are respectfully moot in view of the newly-formulated grounds of rejection presented herein.  In particular, the Examiner believe Ashoori’s framework for iterative growing a corpus of keywords as associated with UI traces provides for a bedrock that can be used to parse vocal/verbal input for identifiable keywords that serve as the user’s directives to leverage a verbal/voice UI and thereby invoke device/system functionality.


Conclusion
12.	The prior art made of record and not relied upon is considered pertinent to Applicants’ disclosure:
US 2019/0121611 (“Ashoori”)
US 2017/0161382 (“Ouimet”)

Applicants’ amendment necessitated the new ground(s) of rejection presented in this Office Action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicants are reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHOURJO DASGUPTA whose telephone number is (571)272-7207.  The examiner can normally be reached on M-F 8am-5pm CST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sherief Badawi can be reached on 571 272 9782.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SHOURJO DASGUPTA/Primary Examiner, Art Unit 2174