DETAILED ACTION
This Office Action is in response to the correspondence filed by the applicant on 5/7//2021.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers submitted under 35 U.S.C. 119(a)-(d), which papers have been placed of record in the file.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 5/7/2021 has been entered.

Response to Arguments
Applicant’s argument, pages 7-13 with respect to the rejection of claims under 103 have been fully considered and are moot upon a further consideration and a new ground(s)7 of rejection made under AIA  35 U.S.C. 103 as being unpatentable over ZHANG (US 2017/0110128 A1), and further in view of KLEIN2 (US 2018/0113672 A1) for Claims 1-4, 8, and 16-19; as being unpatentable over KLEIN2 (US 2018/0113672 A1), and further in view of THANGARATHNAM (US 2019/0179606 A1) for Claims 9-11, and 15.  Please see the rejections below for more details.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 8, and 16-19 are rejected under 35 U.S.C. 103 as being unpatentable over ZHANG (US 2017/0110128 A1), and further in view of KLEIN2 (US 2018/0113672 A1).

REGARDING CLAIM 1, ZHANG discloses a voice control method, comprising: 
automatically triggering a voice receiver for receiving voice data in response to recognizing user input performed on a current interaction interface (ZHANG Par 39 – “In at least one embodiment of the disclosure, a user may use a remote control to trigger the terminal into voice control mode, or use a push button on the terminal to trigger the terminal into voice control mode.”) [of a touch display screen, wherein the user input is a touch gesture performed in any area of the current interaction interface];
directly receiving voice data (ZHANG Fig. 6 – “After a terminal has been triggered into voice control mode, receiving input voice data and obtaining corresponding speech text according to the voice data 100”; Par 81 – “For example, in a case that the speech text corresponding to the voice data input by the user is “watch American Dreams in China”, the terminal compares the character string “watch American Dreams in China” as a whole with character strings in the interface word list corresponding to the operating interface in FIG. 3, ….”;) on the current interaction interface without switching to a specific voice input interface (ZHANG Fig. 3; Par 43 – “For example, FIG. 3 illustrates the current operating interface, and the interface word list corresponding to the current operating system includes … ”; Par 39 – “In at least one embodiment of the disclosure, a user may use a remote control to trigger the terminal into voice control mode, or use a push button on the terminal to trigger the terminal into voice control mode. As an example, a user pushes voice control button on a remote control to trigger a smart TV into voice control mode and a voice input module starts to monitor voice data input by the user in real time. As an example, the voice input module has a voice recording function, which is realized by a microphone on the smart TV or a remote control corresponding to the smart TV.”); 
determining an action keyword based on the voice data (ZHANG Par 40 – “For example, when the user inputs voice data such as “play American Dreams in China”, “watch Bride Wars”, “watch”, “Let's Get Married”, and “Yawen Zhu”, the terminal could use the voice input module to receive above voice data input by the user … ”; Par 81 – “For example, in a case that the speech text corresponding to the voice data input by the user is “watch American Dreams in China” …  The terminal then semantically comprehends the speech text “watch American Dreams in China” input by the user, and the semantical comprehension's result is that the user wants to play the movie “American Dreams in China”.”); 
determining an object keyword(ZHANG Par 40 – “For example, when the user inputs voice data such as “play American Dreams in China”, “watch Bride Wars”, “watch”, “Let's Get Married”, and “Yawen Zhu”, the terminal could use the voice input module to receive above voice data input by the user …”; Par 81 – “For example, in a case that the speech text corresponding to the voice data input by the user is “watch American Dreams in China” …  The terminal then semantically comprehends the speech text “watch American Dreams in China” input by the user, and the semantical comprehension's result is that the user wants to play the American Dreams in China”.”) [based on a location where the user input is performed on the interaction interface]; 
generating a control instruction based on the action keyword and the object keyword (ZHANG Par 81 – “The terminal then generate a corresponding control command according to the semantical comprehension's result which is: play the movie “American Dreams in China”. Then the terminal performs the control command which is to play the movie “American Dreams in China” and the display interface of the terminal displays the play interface of the movie “American Dreams in China”.”), 
wherein the control instruction is used for controlling an object indicated by the object keyword (ZHANG Par 81 – “The terminal then semantically comprehends the speech text “watch American Dreams in China” input by the user, and the semantical comprehension's result is that the user wants to play the movie “American Dreams in China”. The terminal then generate a corresponding control command according to the semantical comprehension's result which is: play the movie “American Dreams in China”. Then the terminal performs the control command which is to play the movie “American Dreams in China” and the display interface of the terminal displays the play interface of the movie “American Dreams in China”.”).
ZHANG is silent to the [square-bracketed] limitations.


KLEIN2 discloses the [square-bracketed] limitations. KLEIN2 discloses a voice control method, comprising: 
automatically triggering a voice receiver for receiving voice data (KLEIN2 Fig. 4 – “Intelligent Personal Assistant 116”; Par 24 – “The IPA 116 includes a speech recognition module 140. Voice commands 142 are spoken into a microphone of the computing device 100. The speech recognition module 140 (or a remote service that it communicates with) uses known speech recognition algorithms and statistical models (e.g., Gaussian Mixture Models and triggers an interaction with the IPA, the IPA may enter an active-listening mode for speech commands, obviating the need for an interaction with the computing device specifically for the purpose of putting the IPA in a listening mode. In other words, if a touch input's pressure diverts the touch input (or an object selected thereby) to the IPA, the IPA/device can also respond to the touch input or object by beginning to capture audio from the microphone and interpreting any detected voice input command. An interaction such as “share this with my spouse” in combination with a pressure-filtered touch input allows seamless interaction, through the IPA, with an object represented on the display of the relevant computing device.”) in response to recognizing user input performed on a current interaction interface of a touch display screen (KLEIN2 Fig. 5 – “touch input sensed with pressure P”; Par 31 – “FIG. 5 shows a process for using pressure of a touch input 200 to resolve an exophoric phrase 201 of a voice command 202. At step 204 the touch input 200 is sensed with pressure P. The touch input 200 is coincident with a graphic object 206 of the application 190.”;Par 33 – “The step 210 of determining whether the pressure condition is satisfied can be implemented in numerous ways. When the touch input 200 is sensed, pressure can be sensed and associated with the touch input 200 by including a pressure value with the one or more points of the touch input 200.”; Par 28 – “Examples of objects are files, Uniform Resource Identifiers, messages, emails, elements in structured documents (e.g., elements of markup code), contacts, applications, user interface elements (e.g. views, containers, controls, windows) etc. Most objects are exposed to the user by respective graphic objects. The term “graphic object”, as used herein, refers to any discrete graphic element displayed on the display 102 to represent an object.”), wherein the user input is a touch gesture performed in any area of the current interaction interface (KLEIN2 Fig. 2 – “Sensing Surface 122”; Fig. 4; Par 22 – “FIG. 2 shows additional details of the computing contacts a sensing surface 122, the sensing surface 122 generates location signals that indicate the locations of the corresponding points of the sensing surface 122 contacted by the physical pointer 120. The sensing surface 122 also generates pressure signals that indicate measures of force applied to the sensing surface 122 by the physical pointer 120.”; Par 29 – “The pressure filter 114 evaluates the pressure properties against a pressure condition 184. If the pressure condition 184 is satisfied, then the corresponding touch input is provided to the IPA 116 but not the application layer 180.”);
directly receiving voice data on the current interaction interface without switching to a specific voice input interface (KLEIN2 Figs. 8-10; Par 54 – “Moreover, the use of a pressure filter or condition to differentiate touch inputs that are intended for the IPA avoids input association and interpretation conflicts with respect to the software managing the underlying user interface comprised of the graphic objects. The underlying ordinary user interface such as a graphical user shell, applications, etc. can continue to function as expected without modification to accommodate the IPA.”); 
determining an action keyword based on the voice data (KELIN2 Figs. 8-10; Par 50 – “FIG. 8 shows how certain features of a pressure-filtered touch input 158 can be used to help resolve exophoras in a voice command. When a voice command includes a direction such as “over there”, the direction of the corresponding filtered touch input can be analyzed to determine where “there” refers to. A voice command 142 to “copy this over here” can resolve both “this” as well as “over there”, which may be a direction of the touch input or an endpoint of the touch input.”); 
determining an object keyword [based on a location where the user input is performed on the interaction interface] (KLEIN2 Figs. 8-10; Par 27 – “The IPA 116 in FIG. 3 includes a touch input handler 156 to incorporate touch inputs 158 into the voice command processing pipeline. As will be described in detail below, touch inputs can be used for exophoric resolution. That is, touch inputs can be used to link exophoric phrases such as “this”, “that”, “them”, “those”, “it”, etc., to graphic objects that correspond to actionable objects stored on the computing device 100 (a “phrase”, as referred to herein, is one or more words within a command). An exophoric phrase is a word or phrase in a command that references something not in the command (or in earlier or later commands). Exophoric phrases may refer to things in the past, present, or future.”); 
generating a control instruction based on the action keyword and the object keyword (KLEIN2 Fig. 5 – “do {action} to that”; Par 32 – “At step 214, the IPA 116 identifies the most probable target graphic. For example, the IPA 116 may select whichever graphic object has the greatest intersection with the touch input 200. Other techniques for step 214 are described below. At step 216, given an identified graphic object 206, the corresponding object 208 is identified. At step 218, the IPA 116 links the exophora 201 to the object 208, thus enabling an action 209 of the voice-inputted command 202 to be carried out for the object 208.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of ZHANG to include receiving a touch input for triggering speech recognition and determining an object based on a location of a user input, as taught by KLEIN2.
One of ordinary skill would have been motivated to include receiving a touch input for triggering speech recognition and determining an object based on a location of a user input, in order to reduce verbose redundant operations (KLEIN2 Par 2).


REGARDING CLAIM 2, ZHANG in view of KLEIN2 discloses the method according to claim 1.
ZHANG further discloses wherein the determining an action keyword based on the voice data (ZHANG Par 40 – “For example, when the user inputs voice data such as “play American watch Bride Wars”, “watch”, “Let's Get Married”, and “Yawen Zhu”, the terminal could use the voice input module to receive above voice data input by the user … ”; Par 81 – “For example, in a case that the speech text corresponding to the voice data input by the user is “watch American Dreams in China” …  The terminal then semantically comprehends the speech text “watch American Dreams in China” input by the user, and the semantical comprehension's result is that the user wants to play the movie “American Dreams in China”.”) comprises: 
converting the voice data into text data (ZHANG Fig. 6 – “After a terminal has been triggered into voice control mode, receiving input voice data and obtaining corresponding speech text according to the voice data 100”); and 
determining the action keyword based on the text data (ZHANG Par 81 – “The terminal then semantically comprehends the speech text “watch American Dreams in China” input by the user, and the semantical comprehension's result is that the user wants to play the movie “American Dreams in China”.”).

REGARDING CLAIM 3, ZHANG in view of KLEIN2 discloses the method according to claim 2.
ZHANG further discloses wherein the determining the action keyword based on the text data (ZHANG Par 81 – “The terminal then semantically comprehends the speech text “watch American Dreams in China” input by the user, and the semantical comprehension's result is that the user wants to play the movie “American Dreams in China”.”) comprises:
matching the text data with preset instruction type text data (ZHANG Par 81 – “For example, in a case that the speech text corresponding to the voice data input by the user is “watch American Dreams in China”, the terminal compares the character string “watch American Dreams in China” as a whole with character strings in the interface word list corresponding to the operating interface in FIG. 3, and finds that there are the character strings “watch” and in the interface word list corresponding to the operating interface in FIG. 3, but no character string “watch American Dreams in China”, then the match fails.”); and 
determining the action keyword based on a matching result (ZHANG Par 81 – “… finds that there are the character strings “watch” and “American Dreams in China” in the interface word list corresponding to the operating interface in FIG. 3, but no character string “watch American Dreams in China”, then the match fails. The terminal then semantically comprehends the speech text “watch American Dreams in China” input by the user, and the semantical comprehension's result is that the user wants to play the movie “American Dreams in China”. The terminal then generate a corresponding control command according to the semantical comprehension's result which is: play the movie “American Dreams in China”. Then the terminal performs the control command which is to play the movie “American Dreams in China” and the display interface of the terminal displays the play interface of the movie “American Dreams in China”.”).

REGARDING CLAIM 4, ZHANG in view of KLEIN2 discloses the method according to claim 2.
ZHANG further discloses wherein the determining the action keyword based on the text data comprises: 
determining the action keyword by performing semantic analysis on the text data (ZHANG Par 81 – “The terminal then semantically comprehends the speech text “watch American Dreams in China” input by the user, and the semantical comprehension's result is that the user wants to play the movie “American Dreams in China”.”).

REGARDING CLAIM 8, ZHANG in view of KLEIN2 discloses the method according to claim 1.
ZHANG further discloses the method/system further comprising: 
executing the control instruction (ZHANG Par 81 – “Then the terminal performs the control command which is to play the movie “American Dreams in China” and the display interface of the terminal displays the play interface of the movie “American Dreams in China”.”).




REGARDING CLAIM 16, ZHANG in view of KLEIN2 discloses a voice control device, comprising: one or more processors; and a memory storing one or more programs, wherein the one or more processors execute the one or more programs (ZHANG Fig. 8) to perform operations of: the steps of Claim 1.  Thus, it is rejected under the same rationale.

Claim 17 is a device similar to the method of Claim 2; thus, it is rejected under the same rationale.

Claim 18 is a device similar to the method of Claim 3; thus, it is rejected under the same rationale.

REGARDING CLAIM 19, ZHANG discloses the device according to claim 17, wherein the one or more processors execute the one or more programs (ZHANG Fig. 8) to perform an operation of: performing semantic analysis on the text data to determine the action keyword (ZHANG Par 81 – “The terminal then semantically comprehends the speech text “watch American Dreams in China” input by the user, and the semantical comprehension's result is that the user wants to play the movie “American Dreams in China”.”).







Claims 5 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over ZHANG in view of KLEIN2, and further in view of ZEIGLER (US 2015/0143241 A1).


REGARDING CLAIM 5, ZHANG in view of KLEIN2 discloses the method according to claim 3, wherein the generating a control instruction based on the action keyword and the object keyword (ZHANG Par 81 – “The terminal then generate a corresponding control command according to the semantical comprehension's result which is: play the movie “American Dreams in China”. Then the terminal performs the control command which is to play the movie “American Dreams in China” and the display interface of the terminal displays the play interface of the movie “American Dreams in China”.”) comprises: mapping multiple action words (e.g., “play American Dreams in China” and  “watch American Dreams in China” in Pars 40 and 81) to a control instruction (e.g., playing the movie).  Thus, ZHANG implicitly teaches a second motion/action keyword (e.g., watch or play) to generate the control instruction.

ZEIGLER explicitly teaches the limitations.  ZEIGLER discloses a method/system for controlling a user interface using voice commands comprising: 
matching the text data with action keywords in preset instruction type text data to determine a second motion keyword (ZEIGLER Par 35 – “In step 208, the system determines whether the bit string correlates to a stored, predefined navigation command to navigate to a URL. For example, the system may look for the words “Navigate To . . . ,” or “Go To . . . ” Other words may be used for the navigation command in addition to or instead of that phrase in further embodiments. If the navigation command is not detected, the system may return to step 204 to look for further spoken words and commands.”; In other words, a user can use different words (e.g., “go to” or “navigate to”) for the same action (e.g., navigation command). Thus, “go to” is matched to a predefined navigation command to “navigate”), 
wherein the second motion keyword refers to an action keyword matched in the preset instruction type text data (ZEIGLER Par 35 – “Operation of the present technology will now be described with reference to the block diagram of FIG. 3 and the flowcharts of FIGS. 4-9. Referring to FIG. 4, in step 204, the speech recognition engine 192 determines whether voice is detected from microphone 30, and if so, it is converted into a bit string in step 206. In step 208, the system determines whether the bit string correlates to a stored, predefined navigation command to navigate to a URL. For example, the system may look for the words “Navigate To . . . ,” or “Go To . . . ” Other words may be used for the navigation command in addition to or instead of that phrase in further embodiments.”);  
generating the control instruction based on the second motion keyword and the object keyword (ZEIGLER Par 37 – “If an expression following the navigation command is detected in step 210, the system attempts to resolve the expression into a known URL in step 214.”; Par 44 – “The top URLs engine 196 may additionally or alternatively be used to map a spoken expression to a URL in step 214 of FIG. 4.  … For example, a user seeking the URL: www.nytimes.com may speak the name of that website any of a variety of different ways, including for example:“h t t p colon back slash back slash w w w dot new york times dot com” “h t t p colon slash slash w w w dot n y times dot com” “w w w dot new york times dot com” “w w w dot n y times dot com” “new york times dot com” “new york times” “the new york times” “n y times””; Par 64 – “On the other hand, if a match to the spoken expression is found in step 238 with sufficient confidence, the top URLs engine 196 returns the URL stored in association with the matched expression in step 240. The flow then returns to step 216 (FIG. 4), where the identified URL is shown to the user.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of ZHANG in view of KLEIN2 to include a second action/object keyword for generating a control instruction, as taught by ZEIGLER.
One of ordinary skill would have been motivated to include a second action/object keyword for generating a control instruction, in order to provide the best interpretation of the voice input (ZEIGLER Par 3).


REGARDING CLAIM 20, ZHANG in view of KLEIN2 discloses the device according to claim 18, wherein the one or more processors execute the one or more programs to perform operations of: 
mapping multiple action words (e.g., “play American Dreams in China” and  “watch American Dreams in China” in Pars 40 and 81) to a control instruction (e.g., playing the movie).  Thus, ZHANG implicitly teaches a second motion/action keyword (e.g., watch or play) to generate the control instruction.
ZEIGLER explicitly teaches the limitations.  ZEIGLER discloses a method/system for controlling a user interface using voice commands comprising: 
matching the text data with action keywords in preset instruction type text data (ZEIGLER Par 35 – “In step 208, the system determines whether the bit string correlates to a stored, predefined navigation command to navigate to a URL. For example, the system may look for the words “Navigate To . . . ,” or “Go To . . . ” Other words may be used for the navigation command in addition to or instead of that phrase in further embodiments. If the navigation command is not detected, the system may return to step 204 to look for further spoken words and commands.”) and determining a second motion keyword (ZEIGLER Par 35; In other words, a user can use , wherein the second motion keyword refers to an action keyword matched in the preset instruction type text data (ZEIGLER Par 35 – “Operation of the present technology will now be described with reference to the block diagram of FIG. 3 and the flowcharts of FIGS. 4-9. Referring to FIG. 4, in step 204, the speech recognition engine 192 determines whether voice is detected from microphone 30, and if so, it is converted into a bit string in step 206. In step 208, the system determines whether the bit string correlates to a stored, predefined navigation command to navigate to a URL. For example, the system may look for the words “Navigate To . . . ,” or “Go To . . . ” Other words may be used for the navigation command in addition to or instead of that phrase in further embodiments.”); 
generating the control instruction based on the second motion keyword and the object keyword (ZEIGLER Par 37 – “If an expression following the navigation command is detected in step 210, the system attempts to resolve the expression into a known URL in step 214.”; Par 44 – “The top URLs engine 196 may additionally or alternatively be used to map a spoken expression to a URL in step 214 of FIG. 4.  … For example, a user seeking the URL: www.nytimes.com may speak the name of that website any of a variety of different ways, including for example:“h t t p colon back slash back slash w w w dot new york times dot com” “h t t p colon slash slash w w w dot n y times dot com” “w w w dot new york times dot com” “w w w dot n y times dot com” “new york times dot com” “new york times” “the new york times” “n y times””; Par 64 – “On the other hand, if a match to the spoken expression is found in step 238 with sufficient confidence, the top URLs engine 196 returns the URL stored in association with the matched expression in step 240. The flow then returns to step 216 (FIG. 4), where the identified URL is shown to the user.”).

One of ordinary skill would have been motivated to include a second action/object keyword for generating a control instruction, in order to provide the best interpretation of the voice input (ZEIGLER Par 3).



Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over ZHANG in view of KLEIN2, and further in view of HUANG (US 2019/0005946 A1).

REGARDING CLAIM 6, ZHANG in view of KLEIN2 discloses the method according to claim 2, wherein the converting the voice data into text data comprises: 
converting the voice data into initial text data (ZHANG Fig. 6 – “After a terminal has been triggered into voice control mode, receiving input voice data and obtaining corresponding speech text according to the voice data 100”);
ZHANG is silent to the rest of the claim limitations.
HUANG disclose a method/system for speech recognition comprising:
converting the voice data into initial text data (HUAN Fig. 2A; Par 43 – “At block 201, speech recognition is performed on acquired speech data to obtain initial text information.”);
adjusting the initial text data by performing semantic analysis on the initial text data (HUANG Par 45 – “Specifically, the text contained in the initial text information can be segmented according to semantic analysis. For example, if the text contained in the initial text information is “I like the Baidu map”, then according to the semantic analysis, the text can be segmented into “I”, “like”, and “the Baidu map”.”), and taking the adjusted initial text data as the text data (HUANG Par 46 – “At block 203, the at least one word is encoded into a dense vector by an encoder in the NMT model and the dense vector is decoded by a decoder in the NMT model so as to obtain the final text recognition result.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of ZHANG in view of KLEIN2 to include performing semantic analysis to obtain the final recognition result, as taught by HUANG.
One of ordinary skill would have been motivated to include performing semantic analysis to obtain the final recognition result, in order to improve the speech recognition accuracy (HUANG Pars 47 and 50).



Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over ZHANG in view of KLEIN2, and further in view of WILLETT (US 2017/0278511 A1).

REGARDING CLAIM 7, ZHANG in view of KLEIN2 discloses the method according to claim 1, further comprising: 
displaying a voice recording pop-up window (ZHANG Fig. 2);
ZHANG is silent to the rest of the claim limitations.
WILLETT discloses a method/system for speech recognition comprising:
displaying a voice recording pop-up window (WILLETT Fig. 1A);
wherein a displaying form of the voice recording pop-up window when the voice data is received (WILETTE Fig. 1B – “Recording”) is different from a displaying form of the voice recording pop-up window when the voice data is not received (WILLETTE Fig. 1C – “Hi John, This is Daniel Willette at Nuance.”; Par 8 – “An example screen shot of the initial prompt interface from one such mobile device ASR application, Dragon Dictation for iPhone, is shown processes unprompted speech inputs and produces representative text output.  FIG. 1B shows a screen shot of the recording interface for Dragon Dictation for iPhone.  FIG. 1C shows an example screen shot of the results interface produced for the ASR results by Dragon 
Dictation for iPhone.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of ZHANG in view of KLEIN2 to include different forms of user interface, as taught by WILLETT.
One of ordinary skill would have been motivated to include different forms of user interface, in order to provide a user recognition results (WILLETT Par 8).





Claims 9-11, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over KLEIN2 (US 2018/0113672 A1), and further in view of THANGARATHNAM (US 2019/0179606 A1).

REGARDING CLAIM 9, KLEIN2 discloses a voice control method, comprising:  
automatically triggering a voice receiver for receiving voice data (KLEIN2 Fig. 4 – “Intelligent Personal Assistant 116”; Par 24 – “The IPA 116 includes a speech recognition module 140. Voice commands 142 are spoken into a microphone of the computing device 100. The speech recognition module 140 (or a remote service that it communicates with) uses known speech recognition algorithms and statistical models (e.g., Gaussian Mixture Models and Hidden Markov Models) to convert the voice commands 142 into text.”; Par 56 – “The pressure-triggered input modality described above can complement other aspects of the IPA. For example, when a touch input is determined to satisfy the pressure condition that triggers an interaction with the IPA, the IPA may enter an active-listening mode for speech commands, obviating the need for an interaction with the computing device specifically for the purpose of putting the IPA in a listening mode. In other words, if a touch input's pressure diverts the touch input (or an object selected thereby) to the IPA, the IPA/device can also respond to the touch input or object by beginning to capture audio from the microphone and interpreting any detected voice input command. An interaction such as “share this with my spouse” in combination with a pressure-filtered touch input allows seamless interaction, through the IPA, with an object represented on the display of the relevant computing device.”) in response to recognizing user input performed on a current interaction interface of a touch display screen (KLEIN2 Fig. 5 – “touch input sensed with pressure P”; Par 31 – “FIG. 5 shows a process for using pressure of a touch input 200 to resolve an exophoric phrase 201 of a voice command 202. At step 204 the touch input 200 is sensed with pressure P. The touch input 200 is coincident with a graphic object 206 of the application 190.”;Par 33 – “The step 210 of determining whether the pressure condition is satisfied can be implemented in numerous ways. When the touch input 200 is sensed, pressure can be sensed and associated with the touch input 200 by including a pressure value with the one or more points of the touch input 200.”; Par 28 – “Examples of objects are files, Uniform Resource Identifiers, messages, emails, elements in structured documents (e.g., elements of markup code), contacts, applications, user interface elements (e.g. views, containers, controls, windows) etc. Most objects are exposed to the user by respective graphic objects. The term “graphic object”, as used herein, refers to any discrete graphic element displayed on the display 102 to represent an object.”), wherein the user input is a touch gesture performed in any area of the current interaction interface (KLEIN2 Fig. 2 – “Sensing Surface 122”; Fig. 4; Par 22 – “FIG. 2 shows additional details of the computing device 100. When a physical pointer 120 such as a finger or stylus contacts a sensing surface 122, the sensing surface 122 generates location signals that indicate the locations of the corresponding points of the sensing surface 122 contacted by the physical pointer 120. The generates pressure signals that indicate measures of force applied to the sensing surface 122 by the physical pointer 120.”; Par 29 – “The pressure filter 114 evaluates the pressure properties against a pressure condition 184. If the pressure condition 184 is satisfied, then the corresponding touch input is provided to the IPA 116 but not the application layer 180.”);
directly receiving voice data on the current interaction interface without switching to a specific voice input interface (KLEIN2 Figs. 8-10; Par 54 – “Moreover, the use of a pressure filter or condition to differentiate touch inputs that are intended for the IPA avoids input association and interpretation conflicts with respect to the software managing the underlying user interface comprised of the graphic objects. The underlying ordinary user interface such as a graphical user shell, applications, etc. can continue to function as expected without modification to accommodate the IPA.”); 
determining an object keyword based on the voice data (KLEIN2 Fig. 6; Par 44 – “At step 240, given a set of candidate objects and respective feature sets/vectors, the IPA computes ranking scores for the objects. Ranking might be performed by a machine learning module that takes into account the feature sets as well as other factors such as current context of the relevant voice command, recent context accumulated by the IPA, elements of the voice command that relate to relevance of different features, and so forth. For example, for a command such as “edit that”, the ranking function might have a bias for document-type objects. A command such as “tell me how to get there” might increase the weight of map-related features in feature sets. If a command includes a pluralistic exophora then the ranking function might increase the scores of objects that are close together or share feature values such as a same object type or inclusion within a same container. A clustering algorithm might be incorporated into the ranking process when a pluralistic exophora is present. At the end of step 240, the object or objects with the highest scores are used in place of the exophora in the relevant voice command.”); 
determining an action keyword based on an operation that has a [highest] applicability to be performed on an object indicated by the object keyword (KLEIN2 Par 52 – “For example, if the object is a first type (e.g., a document), then a set of corresponding actions such as “edit”, “email” and “print” might be determined to be relevant. If the object is a media object, then actions such as “play”, “share” or others might be identified. The same technique of exposing information about an object can be triggered when the recent voice commands lack any exophoras. This can allow the user to use the IPA to carry out non-exophoric commands for one purpose while concurrently using the IPA to discover information about objects or candidate actions to perform on objects. Features of the filtered touch inputs can be used to shape the type of information that the IPA seeks. For example, a short pressure dwell might cause the IPA to show potential actions for the object and a long pressure dwell might cause the IPA to show metadata about the object.”); 
generating a control instruction based on the action keyword and the object keyword (KLEIN2 Fig. 5 – “do {action} to that”; Par 32 – “At step 214, the IPA 116 identifies the most probable target graphic. For example, the IPA 116 may select whichever graphic object has the greatest intersection with the touch input 200. Other techniques for step 214 are described below. At step 216, given an identified graphic object 206, the corresponding object 208 is identified. At step 218, the IPA 116 links the exophora 201 to the object 208, thus enabling an action 209 of the voice-inputted command 202 to be carried out for the object 208.”), 
wherein the control instruction is used to control the object indicated by the object keyword (KLEIN2 Fig. 8 – “send that over there” “copy this over here”; Fig. 9 – “Copy there over here” “send them to X; Par 32 – “At step 218, the IPA 116 links the exophora 201 to the object 208, thus enabling an action 209 of the voice-inputted command 202 to be carried out for the object 208.”).

KLEIN2 does not explicitly teach the [square-bracketed] limitations.
THANGARATHNAM discloses a method/system for voice enabling applications comprising:
determining an object keyword based on the voice data (THANGARATHNAM Par 99 – “The user may provide then provide a voice command that corresponds to selection of one of the hints 502(a)-(e). For example, the user may provide a voice command of “select number 2,” “select 2,” “2,” “select the second one,” and/or so forth. The remote system may be provided with data indicating that hints are being provided to the user, along with data indicating which hints are associated with which objects.”); 
determining an action keyword based on an operation that has a [highest] applicability to be performed on an object indicated by the object keyword (THANGARATHNAM Par 23 – “Additionally, or alternatively, the system may be configured to rank directive data and/or actions in examples where a determined intent corresponds to more than one action to be performed on a given object. For example, the user utterance may represent an intent that may be determined to correspond to more than one action and/or that may correspond to an action that may be performed with respect to multiple objects. In these examples, the directive data and/or actions may be ranked such that an ambiguous utterance may result in highest-ranked directive data being sent to the user device and/or a highest-ranked action being selected. Ranking of directive data and/or actions may be based at least in part on historical use data, the application associated with the displayed content, location of objects with respect to each other as displayed on the user device, categorization of intents, previous voice commands, and/or context information updating, for example.”; Par 136 – “The process 700 may also include determining that the first priority is greater than the second priority and selecting one of the first action or the second action to be performed on an object based at least in part on the priority. For example, a “select” intent may correspond to opening a hyperlink, causing a video to play, causing additional information to be displayed, or other actions.”).

One of ordinary skill would have been motivated to include a highest probability for an action, in order to resolve an ambiguous command (THANGARATHNAM Par 23).


REGARDING CLAIM 10, KLEIN2 in view of THANGARATHNAM discloses the method according to claim 9.
KLEIN2 further discloses wherein the determining an object keyword based on the voice data (KLEIN2 Fig. 6; Par 44 – “At step 240, given a set of candidate objects and respective feature sets/vectors, the IPA computes ranking scores for the objects. Ranking might be performed by a machine learning module that takes into account the feature sets as well as other factors such as current context of the relevant voice command, recent context accumulated by the IPA, elements of the voice command that relate to relevance of different features, and so forth. For example, for a command such as “edit that”, the ranking function might have a bias for document-type objects. A command such as “tell me how to get there” might increase the weight of map-related features in feature sets. If a command includes a pluralistic exophora then the ranking function might increase the scores of objects that are close together or share feature values such as a same object type or inclusion within a same container. A clustering algorithm might be incorporated into the ranking process when a pluralistic exophora is present. At the end of step 240, the object or objects with the highest scores are used in place of the exophora in the relevant voice command.”) comprises: 
converting the voice data into text data (KLEIN2 Par 24 – “The IPA 116 includes a speech recognition module 140. Voice commands 142 are spoken into a microphone of the computing device 100. The speech recognition module 140 (or a remote service that it communicates with) convert the voice commands 142 into text.”); and a
determining the object keyword based on the text data (KLEIN2 Par 25 – “The recognized text of a voice command is passed to a command interpretation module 144. The command interpretation module 144 (or a remote service equivalent), sometimes taking into account current context and recent user activities, classifies the converted text of the command as either a query for information or as a directive to perform an action. To help interpret a command and to construct a formal action for the command, the command interpretation module 144 might draw on various local data sources 146 and remote network resources 148. For example, if a command includes a proper noun, a contacts database might be consulted to obtain information about the corresponding person. Machine learning algorithms 150 may be used to infer the user intent and meaning of a command converted by speech recognition.”).

REGARDING CLAIM 11, KLEIN2 in view of THANGARATHNAM discloses the method according to claim 10.
KLEIN2 further discloses wherein the determining the object keyword based on the text data (KLEIN2 Par 25 – “The recognized text of a voice command is passed to a command interpretation module 144. The command interpretation module 144 (or a remote service equivalent), sometimes taking into account current context and recent user activities, classifies the converted text of the command as either a query for information or as a directive to perform an action. To help interpret a command and to construct a formal action for the command, the command interpretation module 144 might draw on various local data sources 146 and remote network resources 148. For example, if a command includes a proper noun, a contacts database might be consulted to obtain information about the corresponding person. Machine learning algorithms 150 may be used to infer the user intent and meaning of a command converted by speech recognition.”) comprises: 
matching the text data with preset instruction type text data (KLEIN2 Par 25 – “For example, if a command includes a proper noun, a contacts database might be consulted to obtain information about the corresponding person. Machine learning algorithms 150 may be used to infer the user intent and meaning of a command converted by speech recognition.”; Par 57 – “Using some of the techniques described above, touch inputs can be used for performing speech recognition in situations where the user doesn't use a phrase like “this.” For example, the user may verbally reference the touched object by its displayed name, which may not otherwise be in the IPA's speech grammar/vocabulary.”); and 
determining the object keyword based on a matching result (KLEIN2 Par 43 – “At step 238, given a set of identified candidate objects, a feature set or feature vector for each candidate object may be constructed. A feature set for an object might include information about the types of the object, times related to accessing or modifying the object, metadata attributes, attributes derived from content of the object, display location, etc. Metadata attributes might be derived from system metadata managed by the operating system, analyzing content of the object (e.g., identities of persons derived from face/voice recognition), or other information associated with the object.”; Par 44 – “At step 240, given a set of candidate objects and respective feature sets/vectors, the IPA computes ranking scores for the objects. Ranking might be performed by a machine learning module that takes into account the feature sets as well as other factors such as current context of the relevant voice command, recent context accumulated by the IPA, elements of the voice command that relate to relevance of different features, and so forth. For example, for a command such as “edit that”, the ranking function might have a bias for document-type objects”).


REGARDING CLAIM 15, KLEIN2 in view of THANGARATHNAM discloses the method according to claim 9.
KLEIN2 further discloses the method/system further comprising: executing the control instruction (KLEIN2 Fig. 5 – “do {action} to that”; Par 32 – “At step 214, the IPA 116 identifies the most probable target graphic. For example, the IPA 116 may select whichever graphic object has the greatest intersection with the touch input 200. Other techniques for step 214 are described below. At step 216, given an identified graphic object 206, the corresponding object 208 is identified. At step 218, the IPA 116 links the exophora 201 to the object 208, thus enabling an action 209 of the voice-inputted command 202 to be carried out for the object 208.”).




Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over KLEIN2 in view of THANGARATHNAM, and further in view of ZEIGLER (US 2015/0143241 A1).

REGARDING CLAIM 12, KLEIN2 in view of THANGARATHNAM discloses the method according to claim 11, wherein the generating a control instruction based on the action keyword and the object keyword (KLEIN2).

ZEIGLER explicitly teaches the rest of the limitations.  ZEIGLER discloses a method/system for controlling a user interface using voice commands comprising: 
matching in the text data with an object keyword in preset instruction type text data to determine a second object keyword (ZEIGLER Par 44 – “The top URLs engine 196 may additionally or alternatively be used to map a spoken expression to a URL in step 214 of FIG. 4.  … For example, a user seeking the URL: www.nytimes.com may speak the name of that website any of a variety of different ways, including for example:“h t t p colon back slash back ; 
wherein the second object keyword refers to an object keyword matched in the preset instruction type text data (ZEIGLER Par 54 – “Further details of the top URLs engine 196 will now be explained with reference to FIGS. 7 and 8. In step 236, the top URLs engine 196 may take the spoken expression, which has been resolved into a bit string, and compare it to binary representations of expressions stored in a closed-set grammar for a predefined set of the top, most commonly accessed websites.”); 
determining a second action keyword according to the second object keyword (ZEIGLER Par 42 – “On the other hand, if a match to the spoken expression is found in step 224 with sufficient confidence, the one-to-one lookup engine 194 returns the URL stored in association with the matched expression in step 228. The flow then returns to step 216 (FIG. 4), where the identified URL is shown to the user.”); 
generating the control instruction based on the second action keyword and the second object keyword (ZEIGLER Par 64 – “On the other hand, if a match to the spoken expression is found in step 238 with sufficient confidence, the top URLs engine 196 returns the URL stored in association with the matched expression in step 240. The flow then returns to step 216 (FIG. 4), where the identified URL is shown to the user.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of KLEIN2 in view of THANGARATHNAM to include a second/third action/object keyword for generating a control instruction, as taught by ZEIGLER.
One of ordinary skill would have been motivated to include a second/third action/object keyword for generating a control instruction, in order to provide the best interpretation of the voice input (ZEIGLER Par 3).


Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over KLEIN2 in view of THANGARATHNAM, and further in view of HUANG (US 2019/0005946 A1).

REGARDING CLAIM 13, KLEIN2 in view of THANGARATHNAM discloses the method according to claim 10, wherein the converting the voice data into text data comprises: 
converting the voice data into initial text data (KLEIN2);
KLEIN2 in view of THANGARATHNAM is silent to the rest of the claim limitations.

HUANG disclose a method/system for speech recognition comprising:
converting the voice data into initial text data (HUANG Fig. 2A; Par 43 – “At block 201, speech recognition is performed on acquired speech data to obtain initial text information.”);
adjusting the initial text data by performing semantic analysis on the initial text data (HUANG Par 45 – “Specifically, the text contained in the initial text information can be segmented according to semantic analysis. For example, if the text contained in the initial text information is “I like the Baidu map”, then according to the semantic analysis, the text can be segmented into “I”, “like”, and “the Baidu map”.”), and taking the adjusted initial text data as the text data (HUANG Par 46 – “At block 203, the at least one word is encoded into a dense vector by an encoder in the NMT model and the dense vector is decoded by a decoder in the NMT model so as to obtain the final text recognition result.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of KLEIN2 in view of THANGARATHNAM to include performing semantic analysis to obtain the final recognition result, as taught by HUANG.
.


Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over KLEIN2 in view of THANGARATHNAM, and further in view of WILLETT (US 2017/0278511 A1).

REGARDING CLAIM 14, KLEIN2 in view of THANGARATHNAM discloses the method according to claim 9.
KLEIN2 in view of THANGARATHNAM is silent to the rest of the claim limitations.

WILLETT discloses a method/system for speech recognition comprising:
displaying a voice recording pop-up window (WILLETT Fig. 1A);
wherein a displaying form of the voice recording pop-up window when the voice data is received (WILETTE Fig. 1B – “Recording”) is different from a displaying form of the voice recording pop-up window when the voice data is not received (WILLETTE Fig. 1C – “Hi John, This is Daniel Willette at Nuance.”; Par 8 – “An example screen shot of the initial prompt interface from one such mobile device ASR application, Dragon Dictation for iPhone, is shown in FIG. 1A which processes unprompted speech inputs and produces representative text output.  FIG. 1B shows a screen shot of the recording interface for Dragon Dictation for iPhone.  FIG. 1C shows an example screen shot of the results interface produced for the ASR results by Dragon 
Dictation for iPhone.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of KLEIN2 in view of THANGARATHNAM to include different forms of user interface, as taught by WILLETT.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONATHAN C. KIM whose telephone number is (571)272-3327.  The examiner can normally be reached on Monday to Friday 9:00 AM thru 5:30 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/JONATHAN C KIM/Primary Examiner, Art Unit 2659