DETAILED ACTION
This communication is in response to the Amendments and Arguments filed on   09/21/2021. 
Claims 1-3, 5-12, and 14-20 are pending and have been examined.
All previous objections/rejections not mentioned in this Office Action have been withdrawn by the examiner. 
	Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant's arguments filed 09/23/2021 have been fully considered but they are not persuasive. 
Applicant asserts on pages 8 and 9 that Sung does not teach or imply that video obtained from a video conference is used to determine a collection of identified languages. The Examiner respectfully disagrees with this assertion. Sung teaches that languages may be identified in a video conference (see [0072]). Sung does not teach specifically how the video is used. Wang, however, which cited in combination, further teaches the extraction of features specifically from an image of a user (see [0065]). The remaining arguments regarding the use of extracted features to determine candidate languages have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Please see new mappings with regard to these limitations. 
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 5-8, 10, 14-17, 19, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gonzalez-Dominguez et al. (U.S. PG Pub No. 2016/0035344), as found in the IDS, hereinafter Gonzalez-Dominguez, in view of Wang et al. (U.S. PG Pub No. 2018/0314689), hereinafter Wang, in view of Sung et al. (U.S. PG Pub No. 2013/0238336), hereinafter Sung, and further in view of Kobayashi et al. (U.S. PG Pub No. 2019/0103096), hereinafter Kobayashi.

Regarding claims 1 and 10, Gonzalez-Dominguez teaches
(claim 1) A computing system for language-based service hailing (a language identification system, i.e. computing system [0012:1-5]), comprising:
one or more processors (computers suitable for execution of the program include microprocessors or any other kind of processing unit, i.e. one or more processors [0036:1-4]); and
(claim 1) a memory storing instructions that, when executed by the one or more processors, cause the computing system to (the computer includes one or more memory devices, i.e. memory, that stores instructions that the processing unit can receive and execute, i.e. when executed by the one or more processors [0036]):
 (claim 1) obtain a plurality of speech samples, each speech sample comprising one or more words spoken in a language (training sequences, i.e. plurality of speech samples, that represent an utterance for which the spoken language has been identified, i.e. each speech sample comprising one or more words spoken in a language, can have the process performed on them for the purposes of training [0026], which includes the system receiving a sequence of audio frames that collectively represents an utterance, i.e. obtain [0020:1-3]);
(claims 1) train a neural network model with the speech samples to obtain a trained model for determining languages of speeches (the process is performed on training sequences, i.e. speech samples, to train an LSTM neural network, i.e. train a neural network model…to obtain a trained model [0026], where the LSTM neural network output is language scores that are used to classify the utterance as being spoken in a particular language, i.e. trained model for determining languages of speeches [0017-18:3]);
(claim 10) A method for language-based service hailing (a method [0004]), comprising:
obtain a voice input …(the system receives, i.e. obtain, a sequence of audio frames that collectively represent a spoken utterance, i.e. voice input [0012:1-5]);
identify a language corresponding to the voice input based at least on applying the trained model to the voice input …(the system classifies the utterance as being spoken in one language, i.e. identify a language corresponding to the voice, using an LSTM neural network that has been trained, i.e. applying the trained model, to process each audio frame in the sequence, i.e. to the voice input [0012], [0026]).
(claim 10) wherein the trained neural network model has been trained with a plurality of speech samples to determine languages of speeches (the process is performed on training sequences, i.e. plurality of speech samples, to train an LSTM neural network, i.e. trained neural network model has been trained [0026], where the LSTM neural network output is language scores that are used to classify the utterance as being spoken in a particular language, i.e. trained model to determine languages of speeches [0017-18:3]), and each of the speech samples comprises one or more words spoken in a language (training sequences, i.e. plurality of speech samples, that represent an utterance for which the spoken language has been identified, i.e. each speech sample comprising one or more words spoken in a language, can have the process performed on them for the purposes of training [0026]).
While Gonzalez-Dominguez provides for language identification being used in speech-to-text systems, Gonzalez-Dominguez does not specifically teach the use of images, the identification of a set of candidate languages, or the output of a message in an identified language, and thus does not teach
obtain…an image of a user associated with the voice input,;
extract features from the image of the user, wherein the features include one or more of: a facial feature of the user, a posture of the user, an outfit of the user, a surrounding environment, conspicuous text, or an insignia worn by the user;
determine a first set of candidate languages based on the extracted features;
identify a language corresponding to the voice input based at least on applying the trained model to the voice input and on the determined first set of candidate languages;
communicate a message in the identified language.  
	Wang, however, teaches
obtain a voice input and an image of a user associated with the voice input (a smartphone that includes a virtual personal assistant can use various sensory input, such as audio input, i.e. voice input, and visual input, which can be an image of the person using the phone, i.e. obtain at least one of an image…of a user associated with the voice input [0085], or a video of the speaker, i.e. video of a user [0065]);
extract features from the image of the user, wherein the features include one or more of: a facial feature of the user, a posture of the user, an outfit of the user, a surrounding environment, conspicuous text, or an insignia worn by the user (the personal assistant can determine information, i.e. extract features, in the video or still images, i.e. from the image of the user, where the information can be facial expressions or iris biometrics, i.e. facial feature of the user, or gestures, i.e. a posture of the user [0065]);
communicate a message in the identified language (where verbal input is provided in a first language [0003:1-5], which is identified as a specific language by a language identification subsystem, i.e. identified language [0367], and a verbal output is determined and output, i.e. communicate a message, in a third language that is the same as the first language, i.e. in the identified language [0005]).  
Gonzalez-Dominguez and Wang are analogous art because they are from a similar field of endeavor in processing input speech to determine the language of the speech. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the identification of the language of a spoken utterance teachings of Gonzalez-Dominguez with the use of information from visual input and verbal output being in the identified spoken language as taught by Wang. The motivation to do so would have been to achieve a predictable result of enabling the system to choose the appropriate language-specific ASR/NLP components for the virtual assistant to use when interacting with a speaker (Wang [0368]).
While Gonzalez-Dominguez in view of Wang provides use of visual input, such as video, by a personal assistant, Gonzalez-Dominguez in view of Wang does not specifically teach that the video is used in the identification of languages spoken, and thus does not teach
determine a first set of candidate languages based on the extracted features;
identify a language corresponding to the voice input based at least on applying the trained model to the voice input and on the determined first set of candidate languages.
Sung, however, teaches determine a first set of candidate languages based on the --video-- (recognition candidates, i.e. candidate languages, may be determined as a collection, i.e. determine a second set, based on input from a video conference, i.e. based on the at least one…video [0031], [0072]);
identify a language corresponding to the voice input based at least on applying the trained model to the voice input and on the determined first set of candidate languages (a language identifier module processes incoming speech audio with a model, i.e. applying the trained model to the voice input, to identify the language in which the audio is spoken, i.e. identify the language corresponding to the voice input [0040], where the language identifier module can select which of the recognition candidates provided by the language recognizer components to use as the output language, and the recognition candidates resulting from the language recognizer components are given weightings from the video conferencing, i.e. on the determined second set of candidate languages [0042],[0072]).
Gonzalez-Dominguez, Wang, and Sung are analogous art because they are from a similar field of endeavor in processing input speech in different languages. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the use of a visual input by a dialog assistant teachings of Gonzalez-Dominguez, as modified by Wang, with the use of information from a video conference to identify the language of the incoming speech audio as 
 While Gonzalez-Dominguez in view of Wang and Sung provides the determination of language recognition candidates from a video, Gonzalez-Dominguez in view of Wang and Sung does not specifically teach that the determination is based on extracted features, and thus does not teach
determine a first set of candidate languages based on the extracted features.
Kobayashi, however, teaches determine a first set of candidate languages based on the extracted features (the facial feature information of the person from an image, i.e. extracted features, is used to determine a race of the person, and a plurality of available language candidates is identified based on the face information, where each race has associated candidate languages, i.e. determine a first set of candidate languages [0031:1-11],[0032-3]).
Gonzalez-Dominguez, Wang, Sung, and Kobayashi are analogous art because they are from a similar field of endeavor in identifying the language of a user in order to output information in the language of the user. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the determination of language recognition candidates from a video teachings of Gonzalez-Dominguez, as modified by Wang and Sung, with the identification of candidate languages based on facial information from an image of the user as taught by Kobayashi. The motivation to do so would have been to achieve a predictable result 

Regarding claims 5 and 14, Gonzalez-Dominguez in view of Wang, Sung, and Kobayashi teaches claims 1 and 10, and Wang further teaches 
obtain a location of a user associated with the voice input (the system can receive and use a number of different forms of input, i.e. obtain, such as geographic location [0435], and where the location can be used by the assistant during dialog with a user, i.e. user associated with the voice input [0441-2]).
And Sung further teaches
determine a second set of candidate languages based on the location (recognition candidates, i.e. candidate languages, may be partly selected, i.e. determine a second set, based on location information, i.e. based on the location [0031], [0059]); and
to identify the language corresponding to the voice input based at least on applying the trained ((claim 14) neural network) model to the voice input and on the determined first set of candidate languages, the computing system is caused to identify the language corresponding to the voice input based at least on applying the trained ((claim 14) neural network) model to the voice input, on the determined first set of candidate languages, and on the determined second set of candidate languages (a language identifier module of the language identification system, i.e. computing system [0012:1-5], processes incoming speech audio with a model, i.e. applying the trained…model to the voice input, to identify the language in .  
Where Gonzalez-Dominguez teaches that the trained model is a neural network [0026]).
Where the motivation to combine is the same as previously presented.

Regarding claims 6 and 15, Gonzalez-Dominguez in view of Wang, Sung, and Kobayashi teaches claims 1 and 10, and Wang further teaches
(claim 6) the system is coupled to a computing device (the virtual personal assistant is integrated into a device or system, i.e. system, such as a computer or smartphone, i.e. computing device [0083]); and
(claim 15) obtaining, by a system coupled to a computing device, the voice input (the system receives, i.e. obtain, a sequence of audio frames that collectively represent a spoken utterance, i.e. voice input [0012:1-5], and the virtual personal assistant is integrated into a device or system, i.e. system, such as a computer or smartphone, i.e. computing device [0083]) ; and
the computing device comprises a microphone configured to receive the voice input and transmit the received voice input to the (claim 6)one or more processors/(claim 15)system (the device that includes the virtual personal assistant, i.e. computing device, may have a microphone to capture the audio input, i.e. microphone configured to receive the voice input [0085], and provide it to the virtual personal assistant system, i.e. transmit the received voice input [0086:1-4], the functionality of which is executed by the device processors, i.e. to the one or more processors [0542]).  
Where the motivation to combine is the same as previously presented.

Regarding claims 7 and 16, Gonzalez-Dominguez in view of Wang, Sung, and Kobayashi teaches claims 6 and 15, and Wang further teaches
the message comprises at least one of a voice or a text (the output generation component can create responses, i.e. message, which can be either a textual response displayed on a screen, i.e. a text, or a vocalized response, i.e. a voice [0005], [0116]); and
to communicate the message in the identified language, the instructions cause the system to perform at least one of playing the message via the computing device or identifying a person at least knowing the identified language to play the message (when the verbal output is determined, i.e. communicate a message, in a third language that is the same as the first language, i.e. in the identified language [0003:1-5],[0005],[0367], the text-to-speech component can convert the text output into audio output that can be provided through the output device of the user .  
Where the motivation to combine is the same as previously presented.

Regarding claims 8 and 17, Gonzalez-Dominguez in view of Wang, Sung, and Kobayashi teaches claims 6 and 15, and Wang further teaches
the computing device is a mobile phone associated with the user (an automobile may include a dock for a mobile device, i.e. computing device, as either a physical or wireless connection, where the driver is able to make calls by the mobile device, i.e. mobile phone associated with a user [0524]);
the system is a vehicle information platform (when a mobile device is connected to the automobile, i.e. system, the automobile is treated as an extension of the mobile device, where the mobile device may be able to obtain information about the automobile or control some systems of the automobile, i.e. vehicle information platform [0524]); and
the message is associated with a vehicle for servicing the user (the virtual personal assistant may provide information, i.e. message, to the driver, i.e. user, in response to a driver question, where the question is about the status of the vehicle, such as a light that is lit on the dashboard, i.e. associated with a vehicle for servicing the user Fig. 36, [0526]).  
Where the motivation to combine is the same as previously presented.



A non-transitory computer-readable medium for language- based vehicle hailing, comprising instructions stored therein, wherein the instructions, when executed by one or more processors, cause the one or more processors to perform the steps of (computer readable media, i.e. non-transitory computer-readable medium, suitable for storing computer program instructions, i.e. comprising instructions stored therein [0037], where a processing unit, i.e. one or more processors, execute instructions stored on the memory devices, i.e. instructions…executed by one or more processors [0036]):
obtaining a plurality of speech samples, each speech sample comprising one or more words spoken in a language (training sequences, i.e. plurality of speech samples, that represent an utterance for which the spoken language has been identified, i.e. each speech sample comprising one or more words spoken in a language, can have the process performed on them for the purposes of training [0026], which includes the system receiving a sequence of audio frames that collectively represents an utterance, i.e. obtain [0020:1-3]);
training a neural network model with the speech samples to obtain a trained model for determining languages of speeches (the process is performed on training sequences, i.e. speech samples, to train an LSTM neural network, i.e. train a neural network model…to obtain a trained model [0026], where the LSTM neural network output is language scores that are used to classify the utterance as being spoken in a particular language, i.e. trained model for determining languages of speeches [0017-18:3]);
obtaining a voice input…( the system receives, i.e. obtain, a sequence of audio frames that collectively represent a spoken utterance, i.e. voice input [0012:1-5]);
identifying a language corresponding to the voice based at least on applying the trained model to the voice input…(the system classifies the utterance as being spoken in one language, i.e. identify a language corresponding to the voice, using an LSTM neural network that has been trained, i.e. applying the trained model, to process each audio frame in the sequence, i.e. to the voice input [0012], [0026]).  
 While Gonzalez-Dominguez provides the input of audio data representing an utterance, Gonzalez-Dominguez does not specifically teach the receipt of visual or location data of a user, and thus does not teach
obtaining…at least one of an image or a video of a user associated with the voice input, and a location of the user;
extracting features from the image or the video of the user, wherein the features include one or more of: a facial feature of the user, a posture of the user, an outfit of the user, a surrounding environment, conspicuous text, or an insignia worn by the user;
determining a first set of candidate languages based on at least the extracted features, and a second set of candidate languages based on the location; and
 identifying a language corresponding to the voice input based at least on applying the trained model to the voice input, on the first set of candidate languages, and on the second set of candidate languages.  
obtaining a voice input, at least one of an image or a video of a user associated with the voice input, and a location of the user (a smartphone that includes a virtual personal assistant can use various sensory input, such as audio input, i.e. voice input, and visual input, which can be an image of the person using the phone, i.e. obtain at least one of an image…of a user associated with the voice input [0085], or a video of the speaker, i.e. video of a user [0065], and the system can receive and use geographic location [0435], where the location can be used by the assistant during dialog with a user, i.e. user associated with the voice input [0441-2]);
extract features from the image of the user, wherein the features include one or more of: a facial feature of the user, a posture of the user, an outfit of the user, a surrounding environment, conspicuous text, or an insignia worn by the user (the personal assistant can determine information, i.e. extract features, in the video or still images, i.e. from the image of the user, where the information can be facial expressions or iris biometrics, i.e. facial feature of the user, or gestures, i.e. a posture of the user [0065]);
Gonzalez-Dominguez and Wang are analogous art because they are from a similar field of endeavor in processing input speech to determine the language of the speech. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the identification of the language of a spoken utterance teachings of Gonzalez-Dominguez with the use of information from visual or location data of a user as taught by Wang. The motivation to do so would have been to achieve a predictable result of enabling the system to use various input to 
While Gonzalez-Dominguez in view of Wang provides use of visual input, such as video, and location information by a personal assistant, Gonzalez-Dominguez in view of Wang does not specifically teach that the video or location are used in the identification of languages spoken, and thus does not teach
determining a first set of candidate languages based on at least the extracted features, and a second set of candidate languages based on the location; and
 identifying a language corresponding to the voice input based at least on applying the trained model to the voice input, on the first set of candidate languages, and on the second set of candidate languages.  
Sung, however, teaches determining a first set of candidate languages based on at least --the video--, and a second set of candidate languages based on the location (recognition candidates, i.e. candidate languages, may be determined as a collection, i.e. determine a first set, based on input from a video conference, i.e. based on the at least one…video [0031], [0072], and where recognition candidates may be further weighted for selection, i.e. determine a second set, based on location information, i.e. based on the location [0031], [0059]); and
 identifying a language corresponding to the voice input based at least on applying the trained model to the voice input, on the first set of candidate languages, and on the second set of candidate languages (a language identifier .  
Gonzalez-Dominguez, Wang, and Sung are analogous art because they are from a similar field of endeavor in processing input speech in different languages. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the use of visual input and geographic location by a dialog assistant teachings of Gonzalez-Dominguez, as modified by Wang, with the use of information from a video conference and a location to identify the language of the incoming speech audio as taught by Sung. The motivation to do so would have been to achieve a predictable result of enabling the facilitation of translation between multiple languages being spoken in a video conference, where location of the speaker can assist in determining the languages spoken in the audio (Sung [0059], [0072]).
While Gonzalez-Dominguez in view of Wang and Sung provides the determination of language recognition candidates from a video, Gonzalez-Dominguez in view of Wang and Sung does not specifically teach that the determination is based on extracted features, and thus does not teach
determine a first set of candidate languages based on at least the extracted features...
Kobayashi, however, teaches determine a first set of candidate languages based on at least the extracted features (the facial feature information of the person from an image, i.e. extracted features, is used to determine a race of the person, and a plurality of available language candidates is identified based on the face information, where each race has associated candidate languages, i.e. determine a first set of candidate languages [0031:1-11],[0032-3]).
Gonzalez-Dominguez, Wang, Sung, and Kobayashi are analogous art because they are from a similar field of endeavor in identifying the language of a user in order to output information in the language of the user. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the determination of language recognition candidates from a video teachings of Gonzalez-Dominguez, as modified by Wang and Sung, with the identification of candidate languages based on facial information from an image of the user as taught by Kobayashi. The motivation to do so would have been to achieve a predictable result of identifying candidate languages from which the final output language that the user understands can be determined (Kobayashi [0036-8]).

Regarding claim 20, Gonzalez-Dominguez in view of Wang, Sung, and Kobayashi teaches claim 19, and Wang further teaches
the voice input comprises a request for a vehicle service (the driver asks the virtual personal assistant for information regarding the vehicle, such as a verbal request, 
Where the motivation to combine is the same as previously presented.

Claim(s) 2 and 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gonzalez-Dominguez, in view of Wang, in view of Sung, in view of Kobayashi, and further in view of Yu (U.S. PG Pub No. 2009/0258333), hereinafter Yu.

Regarding claims 2 and 11, Gonzalez-Dominguez in view of Wang, Sung, and Kobayashi teaches claims 1 and 10, and Gonzalez-Dominguez further teaches
convert the speech samples to --features-- and train the neural network model with the --features-- (a sequence of audio frames is received, where the audio frames can each be 39-dimensional perceptual linear predictive features calculated at respective time steps in the utterance, i.e. convert the speech samples to --features-- [0020], and the process is performed on training sequences, i.e. speech samples, to train an LSTM neural network, i.e. train a neural network model…with the --features--  [0026]).  
While Gonzalez-Dominguez in view of Wang, Sung, and Kobayashi provides the training of a neural network using PLP features calculated for audio frames, Gonzalez-Dominguez in view of Wang, Sung, and Kobayashi does not specifically teach that the PLP features are frequency-related, and thus does not teach
convert the speech samples to spectrograms.
convert the speech samples to spectrograms (speech signals are converted into streams of acoustic feature vectors, i.e. convert the speech samples [0165:1-3], where cepstral features reflect the spectrum of phones in user’s speech, and where PLP features are a particular kind of cepstral feature used in speech recognition, i.e. spectrograms [0162]).
Gonzalez-Dominguez, Wang, Sung, Kobayashi, and Yu are analogous art because they are from a similar field of endeavor in processing input speech. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the training of a neural network using PLP features calculated for audio frames teachings of Gonzalez-Dominguez, as modified by Wang, Sung, and Kobayashi, with the specific teaching that PLP features are a particular kind of cepstral feature reflecting the spectrum of phones in a user’s speech as taught by Yu. The motivation to do so would have been to achieve a predictable result of enabling a speech recognition system to process user speech (Yu [0165]).

Claim(s) 3 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gonzalez-Dominguez, in view of Wang, in view of Sung, in view of Kobayashi, and further in view of Trong et al. (Deep Language: a comprehensive deep learning approach to end-to-end language recognition, Jun 2016), as found in the IDS, hereinafter Trong.


 
While Gonzalez-Dominguez in view of Wang, Sung, and Kobayashi provides the use of PLP features for the audio data to feed into an LSTM neural network with multiple layers, Gonzalez-Dominguez in view of Wang, Sung, and Kobayashi does not specifically teach that the neural network model includes a convoluted neural network and a gated recurrent unit, and thus does not teach
the neural network model comprises a convoluted neural network (CNN) configured to transform the voice input by multiple layers to determine its language, and comprises one or more Gated Recurrent Units (GRUs) applied to each channel output of the CNN.  
Trong, however, teaches the neural network model comprises a convoluted neural network (CNN) configured to transform the voice input by multiple layers to determine its language (the first two layers of a neural network for identifying the language spoken in an utterance, i.e. neural network model…to determine its language (Intro, p.1, l.1-2), are convolutional neural network layers, i.e. comprises a convoluted neural network (CNN), to extract spectral information from the input, i.e. configured to transform the voice input by multiple layers (Sec. 4.3, p.1-2)), and comprises one or more Gated Recurrent Units (GRUs) applied to each channel output of the CNN (the output of the 2 CNN layers, i.e. each channel output of the CNN, is fed to the next 2 GRU layers, i.e. comprises one or more Gated Recurrent Units (GRUs) applied to each channel).  
.

Claim(s) 9 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Gonzalez-Dominguez, in view of Wang, in view of Sung, in view of Kobayashi, and further in view of Moreno et al. (U.S. Patent No. 10403291), hereinafter Moreno.

Regarding claims 9 and 18, Gonzalez-Dominguez in view of Wang, Sung, and Kobayashi teaches claims 1 and 10, and Wang further teaches
the one or more words comprise one or more phrases for starting a phone call conversation (audio input can include words, i.e. one or more words, such as the phrase “please call John.”, i.e. comprise one or more phrases for starting a phone call conversation [0199]).  

the one or more phrases comprise "hi" in various languages.
Moreno, however, teaches the one or more phrases comprise "hi" in various languages (the speaker recognition model and be configured to use the same hotwords in multiple different languages, i.e. various languages (5:53-64), where an example hotword phrase is “Ni hao Android”, i.e. comprise “hi” (5:25-28)).
Gonzalez-Dominguez, Wang, Sung, Kobayashi, and Moreno are analogous art because they are from a similar field of endeavor in processing input speech in different languages. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the interaction of a user with a virtual assistant using words and phrases teachings of Gonzalez-Dominguez, as modified by Wang, Sung, and Kobayashi, with the use of the same hotword phrases in different languages that can include “hello” as taught by Moreno. The motivation to do so would have been to achieve a predictable result of configuring a speaker recognition model to use the same predetermined hotword in different languages and accents to fine-tune the verification process (Moreno (5:53-64)).



Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICOLE A K SCHMIEDER whose telephone number is (571)270-1474. The examiner can normally be reached 8:00 - 5:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571) 272-7799. The fax phone 
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NICOLE A K SCHMIEDER/           Examiner, Art Unit 2659                                                                                                                                                                                             
/PIERRE LOUIS DESIR/           Supervisory Patent Examiner, Art Unit 2659