Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Preliminary amendment
2.	In a preliminary amendment, claims 3, 21-27, and 29-32 are canceled; and claims 4-8, 14, and 16-18 are amended.  The pending claims are 1, 2, 4-20, and 28.
Information Disclosure Statement
3.	The submitted information disclosure statement (IDS) complies with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Objections
4.	Claim 9, 10, 19 are objected to because of the following informalities:  
	Claim 9 recites: “wherein the one or more conditions include at least one of: the client device is charging, the client device has at least a threshold state of charge, or the client device is not being carried by a user.”
	The claim is interpreted as “wherein the one or more conditions include at least one of: the client device is charging, the client device has at least a threshold state of charge, and the client device is not being carried by a user.”
	Claim 10 recites: “wherein the one or more conditions include two or more of: the client device is charging, the client device has at least a threshold state of charge, or the client device is not being carried by a user.”
two or more of: the client device is charging, the client device has at least a threshold state of charge, and the client device is not being carried by a user.”
Claim 19, end of line 3 is missing the word “and”.
Appropriate correction is required.


Claim Rejections - 35 USC § 103
5.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 4-7, 11-20, and 28 are rejected under 35 U.S.C. 103 as being unpatentable over Grost (US 2017/0069311) in view of Song (US 2006/0136205).
As per claim 1, Grost teaches a method performed by one or more processors of a client device (Fig. 1), the method comprising: 
identifying a textual segment stored locally at the client device ([0030], identifying stored text from the text source 212);
[0030],receiving text from the text source 212 and converting the text into speech units);
processing, using an end-to-end speech recognition model stored locally at the client device, the synthesized audio data to generate a predicted textual segment ([0065], wherein the speech recognition generates a phonetic interpretation of the identified text);
generating a gradient based on comparing the predicted textual segment to the textual segment (necessarily disclosed by [0065] and [0066], wherein the phonetic interpretation is compared against the confirmed phonetic transcription to update the TTS system and  ASR system); and 
updating one or more weights of the end-to-end speech recognition model based on the generated gradient(See [0066], wherein the TTS system and  ASR system are updated based on the comparison result).
Grost may not explicitly disclose the exact language of using a gradient.  However, Song in the same field of endeavor teaches a speech recognition system, wherein a gradient is generated based on comparing the predicted output to ground truth output and updating one or more weights of the speech recognition model based on the generated gradient (Figs. 3, 4, and [0055]-[0070]).
Therefore, it would have been obvious at the time the application was filed to use Song’s features of generating a gradient based on comparing the predicted output to ground truth output and updating one or more weights of the speech recognition model based on the generated Song, [0008]).
As per claim 2, Grost teaches transmitting, over a network to a remote system, the generated gradient without transmitting any of: the textual segment, the synthesized speech audio data, and the predicted textual segment; wherein the remote system utilizes the generated gradient, and additional gradients from additional client devices, to update global weights of a global end-to-end speech recognition model (Grost, [0029], [0045], wherein said that some or all of the TTS and ASR systems can be resident on, and processed using, the telematics unit 30 of FIG. 1; and according to an alternative illustrative embodiment, some or all of the TTS and ASR systems can be resident on, and processed using, computing equipment in a location remote from the vehicle 12. See also, [0067], wherein said that results generated by the TTS system 210 and the ASR system 310 can be compared against remotely-located speech recognition systems that are accessible by the vehicle telematics unit 30).
As per claim 4, Grost teaches receiving, at the client device and from the remote system, the global end-to-end speech recognition model, wherein receiving the global end-to-end speech recognition model is subsequent to the remote system updating the global weights of the global end- to-end speech recognition model based on the gradient and the additional gradients; and responsive to receiving the global speech recognition model, replacing in local storage of the client device the end-to-end speech recognition model with the global speech recognition model ([0045], [0066]-[0067], wherein the TTS and ASR models are adapted based on results generated remotely by external devices).

As per claim 5, Grost teaches receiving, at the client device and from the remote system, the updated global weights, wherein receiving the updated global weights is subsequent to the remote system updating the global weights of the global end-to-end speech recognition model based on the gradient and the additional gradients; and responsive to receiving the updated global weights, replacing in local storage of the client device weights of the end-to-end speech recognition model with the updated global weights (by updating the TTS and ASR models as in [0045], [0066]-[0067], the corresponding weights are automatically updated).
As per claim 6, Grost teaches wherein the textual segment is identified from a contacts list, a media playlist, a list of aliases of linked smart devices, or from typed input received at the client device ([0006], text representing a phone book entry).
As per claim 7, Grost teaches wherein the textual segment is identified based on the textual segment being newly added as an alias for a contact or as an alias for a linked smart device ([0005]).
As per claim 11, Grost teaches identifying the textual segment based on: determining that a prior human utterance, detected via one or more microphones, included the textual segment; and determining that a prior speech recognition of the prior human utterance, performed using the end-to-end speech recognition model, failed to correctly recognize the textual segment ([0064], identifying that the ASR system failed to return the proper recognition result).
As per claim 12, Grost teaches wherein determining that the prior speech recognition failed to correctly recognize the textual segment is based on received user input that cancels an action predicted based on the prior speech recognition, and wherein determining that the prior human utterance included the textual segment is based on additional received user input received [0058]-[0059], if the ASR system failed to return the proper transcription result, the ASR system prompts the user to pronounce the name again, repeating steps 405-420).
As per claim 13, Grost teaches wherein the additional received user input comprises input of the textual segment ([0059], a new phonetic transcription is offered without asking the user to pronounce the name again).
As per claim 14, Grost teaches wherein generating the synthesized speech audio data that includes synthesized speech of the identified textual segment further comprises: determining an additional textual segment; and wherein generating the synthesized speech audio data comprises processing the textual segment, along with the additional textual segment, using the speech synthesis model ([0031], The text source 212 can include words, numbers, symbols, and/or punctuation to be synthesized into speech and for output to the text converter 214. Any suitable quantity and type of text sources can be used).
As per claim 15, Grost teaches wherein determining the additional textual segment is based on a defined relationship of the additional textual segment to a particular corpus from which the textual segment is identified (necessarily disclosed within the process of synthesizing sentences, words, numbers, symbols, and/or punctuation into speech as in [0030]- [0031], [0033], [0044]).
As per claim 16, Grost teaches wherein processing the textual segment using the speech synthesis model comprises processing a sequence of phonemes determined to correspond to the textual segment ([0059], a new phonetic transcription is generated).
As per claim 17, Grost teaches wherein the speech synthesis model is one of a plurality of candidate speech synthesis models for a given language, and is locally stored at the client ([0003], [0066], storing speech synthesis models for a given language according to regional accents, dialects, slang, silent letters, unfamiliar consonant blends, ethnicity of the name, or any other variations in name pronunciation).
As per claim 18, Grost teaches prior to generating the synthesized speech audio data: identifying prior audio data that is detected via one or more microphones of the client device and that captures a prior human utterance; identifying a ground truth transcription for the prior human utterance; processing the ground truth transcription using the speech synthesis model to generate prior synthesized speech audio data; generating a gradient based on comparing the prior synthesized speech audio data to the prior audio data; and updating one or more weights of the speech synthesis model based on the gradient ([0056], receiving speech input containing a name from a user or vehicle occupant at an ASR system; [0058]-[0060] generating transcription for the human utterance; and Fig. 4 and [0065]-[0069], wherein the TTS and STT models are adapted by adding confirmed recognition and synthesis results from previous interactions and used to recognize subsequent spoken utterances and generate subsequent synthesized speech).
Grost may not explicitly disclose the exact language of using a gradient.  However, Song in the same field of endeavor teaches a speech recognition system, wherein a gradient is generated based on comparing the predicted output to ground truth output and updating one or more weights of the speech recognition model based on the generated gradient (Figs. 3, 4, and [0055]-[0070]).  Therefore, it would have been obvious at the time the application was filed to use Song’s features of generating a gradient based on comparing the predicted output to ground truth output and updating one or more weights of the speech recognition model based on the Song, [0008]).

As per claim 19, Grost teaches wherein identifying the ground truth transcription comprises: generating a transcription using the speech recognition model ([0030], converting the selected units of speech into audio signals and audible speech.  See also [0065], wherein the speech recognition generates a phonetic interpretation of the identified text); identifying the transcription as the ground truth transcription based on a confidence measure in generating the transcription and/or based on a user action performed responsive to the transcription (confidence scores, [0036],[0050], [0055], [0059], [0062], [0067]).
As per claim 20, Grost teaches a method performed by one or more processors of a client device, the method comprising: identifying a textual segment stored locally at the client device ([0030], identifying stored text from the text source 212);
generating synthesized speech audio data that includes synthesized speech of the identified textual segment, wherein generating the synthesized speech audio data comprises processing the textual segment using a speech synthesis model stored locally at the client device ([0030],receiving text from the text source 212 and converting the text into speech units);
 processing, using an end-to-end speech recognition model stored locally at the client device, the synthesized audio data to generate a predicted textual segment ([0030],converting the selected units of speech into audio signals and audible speech.  See also [0065], wherein the speech recognition generates a phonetic interpretation of the identified text);
generating a gradient based on comparing the predicted textual segment to the textual segment; and transmitting, over a network to a remote system, the generated gradient without necessarily disclosed by [0065] and [0066], wherein the phonetic interpretation is compared against the confirmed phonetic transcription to update the TTS system and  ASR system; [0066], wherein the TTS system and  ASR system are updated based on the comparison result.  See also, [0029], [0045], wherein said that some or all of the TTS and ASR systems can be resident on, and processed using, the telematics unit 30 of FIG. 1; and according to an alternative illustrative embodiment, some or all of the TTS and ASR systems can be resident on, and processed using, computing equipment in a location remote from the vehicle 12. See also, [0067], wherein said that results generated by the TTS system 210 and the ASR system 310 can be compared against remotely-located speech recognition systems that are accessible by the vehicle telematics unit 30).
Grost may not explicitly disclose the exact language of using a gradient.  However, Song in the same field of endeavor teaches a speech recognition system, wherein a gradient is generated based on comparing the predicted output to ground truth output and updating one or more weights of the speech recognition model based on the generated gradient (Figs. 3, 4, and [0055]-[0070]).  Therefore, it would have been obvious at the time the application was filed to use Song’s features of generating a gradient based on comparing the predicted output to ground truth output and updating one or more weights of the speech recognition model based on the generated gradient with the speech recognition system of Grost, in order to improve accuracy without increasing computation cost (Song, [0008]).

As per claim 28, Grost teaches a method performed by one or more processors of a client device (Fig. 1), the method comprising: 
identifying a textual segment stored locally at the client device ([0030], identifying stored text from the text source 212);
generating synthesized speech audio data that includes synthesized speech of the identified textual segment, wherein generating the synthesized speech audio data comprises processing the textual segment using a speech synthesis model stored locally at the client device ([0030],receiving text from the text source 212 and converting the text into speech units);
processing, using a recognition model stored locally at the client device, the synthesized speech audio data to generate predicted output ([0030],converting the selected units of speech into audio signals and audible speech.  See also [0065], wherein the speech recognition generates a phonetic interpretation of the identified text);
generating a gradient based on comparing the predicted output to ground truth output that corresponds to the textual segment (necessarily disclosed by [0065] and [0066], wherein the phonetic interpretation is compared against the confirmed phonetic transcription to update the TTS system and  ASR system); and 
updating one or more weights of the speech recognition model based on the generated gradient (See [0066], wherein the TTS system and  ASR system are updated based on the comparison result).
Grost may not explicitly disclose the exact language of using a gradient.  However, Song in the same field of endeavor teaches a speech recognition system, wherein a gradient is generated based on comparing the predicted output to ground truth output and updating one or Figs. 3, 4, and [0055]-[0070]).
Therefore, it would have been obvious at the time the application was filed to use Song’s features of generating a gradient based on comparing the predicted output to ground truth output and updating one or more weights of the speech recognition model based on the generated gradient with the speech recognition system of Grost, in order to improve accuracy without increasing computation cost (Song, [0008]).

Claims 8-10 are rejected under 35 U.S.C. 103 as being unpatentable over Grost (US 2017/0069311) in view of Song (US 2006/0136205), and further in view of Naik (US 9,697,822).
As per claim 8, Grost in view of song does not explicitly disclose determining, based on sensor data from one or more sensors of the client device, that a current state of the client device satisfies one or more conditions; wherein generating the synthesized speech audio data, and/or processing the synthesized speech audio data to generate the predicted textual segment, and/or generating the gradient, and/or updating the one or more weights are performed responsive to determining that the current state of the client device satisfies the one or more conditions.
Naik in the same field of endeavor teaches a speech recognition system (Abstract) that performs speech recognition based on determining, based on sensor data from one or more sensors of the client device, that a current state of the client device satisfies one or more conditions (col. 22, lines 56-65; col. 24, lines 3-29; and col. 26, lines 3-20).  Therefore, it would have been obvious at the time the application was filed to use Naik’s sensors with the system of Grost in view of song, in order to o help determine whether and how to operate the voice trigger Naik, col. 1, lines 26-45)
As per claim 9, Grost in view of song does not explicitly disclose wherein the one or more conditions include at least one of: the client device is charging, the client device has at least a threshold state of charge, or the client device is not being carried by a user.
Naik in the same field of endeavor teaches a speech recognition system (Abstract) wherein the one or more conditions include at least one of: the client device is charging, the client device has at least a threshold state of charge, or the client device is not being carried by a user (col. 22, lines 56-65; col. 24, lines 3-29; and col. 26, lines 3-20).  Therefore, it would have been obvious at the time the application was filed to use Naik’s sensors with the system of Grost in view of song, in order to o help determine whether and how to operate the voice trigger and thereof providing better results and function with increased accuracy (Naik, col. 1, lines 26-45)
As per claim 10, Grost in view of song does not explicitly disclose wherein the one or more conditions include two or more of: the client device is charging, the client device has at least a threshold state of charge, or the client device is not being carried by a user.
Naik in the same field of endeavor teaches a speech recognition system (Abstract) wherein the one or more conditions include two or more of: the client device is charging, the client device has at least a threshold state of charge, or the client device is not being carried by a user (col. 22, lines 56-65; col. 24, lines 3-29; and col. 26, lines 3-20).  Therefore, it would have been obvious at the time the application was filed to use Naik’s sensors with the system of Grost in view of song, in order to o help determine whether and how to operate the voice trigger and thereof providing better results and function with increased accuracy (Naik, col. 1, lines 26-45).
Conclusion
6.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See PTO-892.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDELALI SERROU whose telephone number is (571)272-7638.  The examiner can normally be reached on M-F 9 Am - 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/ABDELALI SERROU/Primary Examiner, Art Unit 2659