DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Specification
The disclosure is objected to because of the following informalities: 
Page 9, [0022], line 16: “wakeup” should read “wake up”
Page 11, [0027], line 27: “publically” should read “publicly” 
The use of the terms Bluetooth and Ethernet in line 16 on page 27 paragraph [0057], which are trade names or marks used in commerce, have been noted in this application. The terms should be accompanied by the generic terminology; furthermore the terms should be capitalized wherever they appear or, where appropriate, include a proper symbol indicating use in commerce such as ™, SM , or ® following the terms.
Although the use of trade names and marks used in commerce (i.e., trademarks, service marks, certification marks, and collective marks) are permissible in patent applications, the proprietary nature of the marks should be respected and every effort made to prevent their use in any manner which might adversely affect their validity as commercial marks.

Appropriate correction is required.
Claim Objections
Claims 3 and 19 are objected to because of the following informalities: “device frequency interacts” should read “device frequently interacts” in the last lines of claims 3 and 19. 

Claims 4 and 20 are objected to because of the following informalities:  "the voice-enabled frequently" should read "the voice-enabled device frequently" in the second to last lines of claims 4 and 20.  

Appropriate correction is required. 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 4 and 20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. The claims recite the limitations “sampling the noisy audio data” in the first line, “randomly sampling noisy audio data” in the third line, and “the noisy audio data sampled” in the fifth line of the claims. Claims 4 and 20 are therefore indefinite as it is unclear whether these limitations refer to the same element “noisy data” or a further limitation. The examiner suggests amending the claim to read “randomly sampling the noisy audio data” in the third line of the claims to resolve the issue. For expedited prosecution, “randomly sampling noisy audio data” in the third line of claims 4 and 20 shall be interpreted as “randomly sampling the noisy audio data.” 

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-2, 5, 17-18, 21 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Meloney et al. (Doc. ID. US 20140278420 A1), hereinafter Meloney.

Regarding claim 1, Meloney teaches a method (Spec. page 1, [0010]) comprising: 
receiving, at data processing hardware of a voice-enabled device, a fixed set of training utterances, each training utterance in the fixed set of training utterances 5comprising a corresponding transcription paired with a corresponding speech representation of the corresponding training utterance (Spec. page 1, [0013], lines 1-5; a user device retrieves pre-recorded utterances from an utterance database to train a voice recognition [“VR”]model. Page 2, [0026]; the user device may be any of a variety of voice-enabled computer devices, i.e. voice-enabled devices comprising data processing hardware, such as smartphones. Page 4, [0043]; text and a corresponding utterance are used together to train the VR model); 
sampling, by the data processing hardware, noisy audio data from an environment of the voice-enabled device (Spec. page 1, [0014], lines 7-9; the device captures natural noise for training, i.e. samples noisy audio data from an environment of the device); 
for each training utterance in the fixed set of training utterances: 
10augmenting, by the data processing hardware, using the noisy audio data sampled from the environment of the voice-enabled device, the corresponding speech representation of the corresponding training utterance to generate one or more corresponding noisy audio samples (Spec. page 1, [0014], lines 7-9; the device makes a composite signal of speech and the captured natural noise, i.e. it augments the speech representation of the training utterance with the noisy audio data sampled from the environment of the voice-enabled device to generate a corresponding noisy audio sample); and 
pairing, by the data processing hardware, each of the one or more 15corresponding noisy audio samples with the corresponding transcription of the corresponding training utterance (as detailed above, page 4, [0043] teaches that text and a corresponding utterance are used together to train the VR model. Page 1, [0014], lines 7-9 teaches that the device uses the utterance and the captured natural noise, therefore the device uses both the text and the captured noise together with the utterance for training); and 
training, by the data processing hardware, a speech model on the one or more corresponding noisy audio samples generated for each speech representation in the fixed set of training utterances (Spec. page 1, [0014], lines 1-4; the device uses the method of using the composite speech and natural noise signal described in [0014], lines 7-9 to train the VR model).

Regarding claim 2, Meloney further teaches wherein sampling the noisy audio data from the environment of the voice-enabled device comprises randomly sampling noise from the environment of the voice-enabled device at least one of immediately before, during, or immediately after speech interactions between the voice-enabled device and a user 25associated with the voice-enabled device (Spec. page 4, [0041], lines 1-5; the device records an utterance of the user's speech including the natural background noise, i.e. randomly samples noise from the environment of the voice-enabled device during a speech interaction between the voice-enabled device and a user associated with the voice-enabled device).

Regarding claim 5, Meloney further teaches wherein a digital signal processor (DSP) of the data processing hardware samples the noisy audio data from the environment of the voice- 15enabled device (Spec. page 2, [0028], lines 3-8; the voice-enabled device can include a digital signal processor as the computer processor 204).

Regarding claim 17, the claim is directed to a system comprising: 
data processing hardware; and 
25memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform the operations of claim 1. Meloney teaches a system comprising: 
data processing hardware (Spec. page 2, [0028], lines 3-8; the voice-enabled device includes computing processor 204); and 
25memory hardware in communication with the data processing hardware, the memory hardware storing instructions (Spec. page 2, [0028], lines 3-8; the voice-enabled device includes memory, lines 15-18; all components of the device can be coupled in communication with one another) that when executed on the data processing hardware cause the data processing hardware to perform the operations of claim 1, therefore claim 17 is rejected on the same grounds.

Regarding claim 18, the claim is directed to the system of claim 17 for performing the elements of claim 2 and is rejected on the same grounds.

Regarding claim 21, the claim is directed to the system of claim 17 for performing the elements of claim 5 and is rejected on the same grounds.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 3-4 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Meloney in view of Meacham et al. (Doc. ID US 9886954 B1), hereinafter Meacham.

Regarding claim 3, Meloney teaches the method of claim 1 as detailed above, however, Meloney does not teach wherein sampling the noisy audio data from the environment of the voice-enabled device comprises: 
obtaining contexts and/or time windows when a user of the voice-enabled device 30frequently interacts with the voice-enabled device; and 30Attorney Docket No: 231441-475044 
sampling the noisy audio data from the environment of the voice-enabled device during the obtained contexts and/or time windows when the user of the voice-enabled device frequently interacts with the voice-enabled device.
Meacham discloses context aware processing of ambient sound using machine learning (Abstract). The system adjusts how the user experiences their auditory environment based on the context of the user, which is comprised of a contextual state (Spec. Col. 2, lines 30-43). The contextual state is determined by time-based information, ambient sound environment, and an ambient sound profile (Col. 13, lines 56-59). Meacham further teaches sampling the noisy audio data from the environment of the voice-enabled device comprising: 
obtaining contexts and/or time windows when a user of the voice-enabled device 30frequently interacts with the voice-enabled device (Spec. Col. 16, lines 27-35; time-based data is gathered showing windows of time when a user habitually performs an action associated with the personal audio system and used in the selection of an action set); and 30Attorney Docket No: 231441-475044 
sampling the noisy audio data from the environment of the voice-enabled device during the obtained contexts and/or time windows when the user of the voice-enabled device frequently interacts with the voice-enabled device (Col. 13, lines 56-59; The contextual state is determined by time-based information [contexts and/or time windows when the user of the voice-enabled device frequency interacts with the voice-enabled device as shown above], ambient sound environment (noisy audio data from the environment of the device), and an ambient sound profile, therefore the device samples noisy audio data from the environment of the voice-enabled device during the obtained contexts and/or time windows when the user of the voice-enabled device frequently interacts with the voice-enabled device to obtain the ambient sound environment).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meloney to incorporate the teachings of Meacham to provide the method according to claim 1, wherein sampling the noisy audio data from the environment of the voice-enabled device comprises: obtaining contexts and/or time windows when a user of the voice-enabled device 30frequently interacts with the voice-enabled device; and 30Attorney Docket No: 231441-475044 sampling the noisy audio data from the environment of the voice-enabled device during the obtained contexts and/or time windows when the user of the voice-enabled device frequency interacts with the voice-enabled device. Both disclosures are directed to the handling of audio processing in noisy environments. Meloney recognizes that a user can experience decreased performance of their device in the presence of background noise and in different audio environments (Spec. page 1, [0004]). Similarly, Meacham understands that it is useful to the user to have a device which can adjust to different audio environments (Spec. Col. 1, lines 25-30). Therefore, it would have been obvious to combine the features of both disclosures to solve the same problem, improving audio processing quality in a noisy environment.

Regarding claim 4, Meloney teaches the method of claim 1 as detailed above, however, Meloney does not teach wherein sampling the noisy audio data from the environment of the voice-enabled device comprises: 
randomly sampling noisy audio data from the environment of the voice-enabled device throughout a day; and 
applying weights to any of the noisy audio data sampled from the environment 10during contexts and/or time windows when a user of the voice-enabled frequently interacts with the voice-enabled device more.
Meacham discloses context aware processing of ambient sound using machine learning (Abstract). The system adjusts how the user experiences their auditory environment based on the context of the user, which is comprised of a contextual state (Spec. Col. 2, lines 30-43). The contextual state is determined by time-based information, ambient sound environment, and an ambient sound profile (Col. 13, lines 56-59). Meacham further teaches sampling the noisy audio data from the environment of the voice-enabled device comprising:
randomly sampling noisy audio data from the environment of the voice-enabled device throughout a day (Spec. Col. 16, lines 27-35; time-based data is gathered over the course of multiple occurrences); and 
applying weights to any of the noisy audio data sampled from the environment 10during contexts and/or time windows when a user of the voice-enabled frequently interacts with the voice-enabled device more (Col. 24, lines 9-18; the machine learning models output confidence values indicating characteristics of the ambient audio stream, which is comprised of noisy audio data sampled from the environment [Fig. 2 element 205]. Col. 13, lines 56-59; the device samples noisy audio data from the environment of the voice-enabled device during the obtained contexts and/or time windows when the user of the voice-enabled device frequently interacts with the voice-enabled device to obtain the ambient sound environment as detailed above with respect to claim 3).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meloney to incorporate the teachings of Meacham to provide the method according to claim 1, wherein sampling the noisy audio data from the environment of the voice-enabled device comprises: randomly sampling noisy audio data from the environment of the voice-enabled device throughout a day; and applying weights to any of the noisy audio data sampled from the environment 10during contexts and/or time windows when a user of the voice-enabled frequently interacts with the voice-enabled device more. Both disclosures are directed to the handling of audio processing in noisy environments. Meloney recognizes that a user can experience decreased performance of their device in the presence of background noise and in different audio environments (Spec. page 1, [0004]). Similarly, Meacham understands that it is useful to the user to have a device which can adjust to different audio environments (Spec. Col. 1, lines 25-30). Therefore, it would have been obvious to combine the features of both disclosures to solve the same problem, improving audio processing quality in a noisy environment.

Regarding claim 19, the claim is directed to the system of claim 17 for performing the elements of claim 3 and is rejected on the same grounds.

Regarding claim 20, the claim is directed to the system of claim 17 for performing the elements of claim 4 and is rejected on the same grounds.

Claims 6 and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Meloney in view of Pogue et al. (Patent No. US 9799329 B1), hereinafter Pogue.

Regarding claim 6, Meloney teaches the method of claim 1 as detailed above, however, Meloney does not teach the method further comprising, prior to augmenting the corresponding speech representation of the corresponding training utterance, de-noising, by the data processing hardware, the corresponding speech representation to remove any previously 20existing noise.
Pogue discloses a device for improving automatic speech recognition by using acoustic echo cancellation to remove environmental sounds from input audio signals (Spec. Col. 1, lines 27-33). 
Modifying Meloney to include the teachings of Pogue provides the method of claim 1, further comprising, prior to augmenting the corresponding speech representation of the corresponding training utterance, de-noising, by the data processing hardware, the corresponding speech representation to remove any previously 20existing noise (Meloney, Spec. page 1, [0014], lines 7-9; the device of claim 1, now adapted to remove existing environmental sound from the input audio signal as taught by Pogue Spec. Col. 2, lines 23-28 prior to augmenting the speech representation of the training utterance with the noisy audio data sampled from the environment of the voice-enabled device to generate a corresponding noisy audio sample).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meloney to incorporate the teachings of Pogue to provide the method according to claim 1, further comprising, prior to augmenting the corresponding speech representation of the corresponding training utterance, de-noising, by the data processing hardware, the corresponding speech representation to remove any previously 20existing noise. Both disclosures are directed to the handling of audio processing in noisy environments. Meloney teaches the use of speech with no environmental background sounds mixed with previously stored noise for training the model (Spec. page 1, [0014], lines 11-16). Pogue provides techniques for an automatic speech recognition system to remove environmental noise to improve performance (Spec. Col. 1, lines 27-33). Therefore, it would have been obvious to use the techniques of Pogue to clean samples of captured environmental noise for use in training, preserving the speech to be paired with previously stored noise files for training the system of Meloney. 

Regarding claim 22, the claim is directed to the system of claim 17 for performing the elements of claim 6 and is rejected on the same grounds.

Claims 7-8, 11, 16, 23-24, 27, and 32 are rejected under 35 U.S.C. 103 as being unpatentable over Meloney in view of Panchapagesan et al. (Doc. ID US 10147442 B1), hereinafter Panchapagesan.

Regarding claim 7, Meloney teaches the method of claim 1 as detailed above, however, Meloney does not teach the method further comprising, when the speech model comprises a speech recognition model, for each speech representation in the fixed set of training utterances and each noisy audio sample of the one or more noisy audio samples generated 25for the corresponding speech representation: 
determining, by the data processing hardware, for output by the speech model, a corresponding probability distribution over possible speech recognition hypotheses for the corresponding speech representation or the corresponding noisy audio sample; and 31Attorney Docket No: 231441-475044 
generating, by the data processing hardware, a loss term based on the corresponding probability distribution over possible speech recognition hypotheses for the corresponding speech representation or the corresponding noisy audio sample.
Panchapagesan teaches the training of an acoustic model for speech recognition in a noisy environment (Spec. Col. 3, lines 4-15). Panchapagesan further teaches 
determining, by the data processing hardware, for output by the speech model, a corresponding probability distribution over possible speech recognition hypotheses for the corresponding speech representation or the corresponding noisy audio sample (Spec. Col. 3, lines 29-34; the neural network predicts speech from an input signal and generates main acoustic model output as probabilities that the input corresponds to subword units of a language, i.e. the neural network determines for output by the speech model a corresponding probability distribution over possible speech recognition hypotheses for the corresponding speech or noisy audio input); and 31Attorney Docket No: 231441-475044 
generating, by the data processing hardware, a loss term based on the corresponding probability distribution over possible speech recognition hypotheses for the corresponding speech representation or the corresponding noisy audio sample (Spec. Col. 10, lines 57-62; a loss function is used to determine the error in the main acoustic model output, which is a set of probabilities of possible speech recognition hypotheses for the corresponding speech representation input or the corresponding noisy audio sample).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meloney to incorporate the teachings of Panchapagesan to provide the method according to claim 1, further comprising, when the speech model comprises a speech recognition model, for each speech representation in the fixed set of training utterances and each noisy audio sample of the one or more noisy audio samples generated 25for the corresponding speech representation: determining, by the data processing hardware, for output by the speech model, a corresponding probability distribution over possible speech recognition hypotheses for the corresponding speech representation or the corresponding noisy audio sample; and 31Attorney Docket No: 231441-475044 generating, by the data processing hardware, a loss term based on the corresponding probability distribution over possible speech recognition hypotheses for the corresponding speech representation or the corresponding noisy audio sample. Both disclosures are directed to the handling of audio processing in noisy environments. While Meloney teaches a manual method of error detection relying on user interaction to indicate a failed recognition for updating the model (Spec. page 4, [0043]) to gain more accurate results, Meloney also recognizes automated training processes (Spec. page 1, [0012]). Panchapagesan is also concerned with updating the model using a loss function to gauge how closely output matches the expected result given the input for the purpose of minimizing the difference between the desired and actual results (Spec. Col. 3, lines 35-47). Therefore, it would have been obvious to use the techniques of Panchapagesan to update the system of Meloney for improved results. 

Regarding claim 8, the combination Meloney and Panchapagesan detailed above with respect to claim 7 further teaches wherein training the speech model comprises updating parameters of the speech recognition model using the loss term generated for each speech representation in the fixed set of training utterances and each noisy audio sample of the one or more noisy audio samples generated for each corresponding speech representation in the fixed set of training utterances (Panchapagesan, Spec. Col. 3, lines 35-47; acoustic model parameters are modified during training using the output of a loss function for the output of each training data input).

Regarding claim 11, Meloney teaches the method of claim 1 as detailed above, however, Meloney does not teach the method further comprising wherein the corresponding speech representation for at least one training utterance comprises an audio feature representation of the 30corresponding training utterance.
Panchapagesan teaches the training of an acoustic model for speech recognition in a noisy environment (Spec. Col. 3, lines 4-15). Panchapagesan further teaches wherein the corresponding speech representation for at least one training utterance comprises an audio feature representation of the corresponding training utterance (Spec. Col. 4, lines 60-67; the neural network accepts input as feature vector representations of the training utterances). 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meloney to incorporate the teachings of Panchapagesan to provide the method according to claim 1, wherein the corresponding speech representation for at least one training utterance comprises an audio feature representation of the 30corresponding training utterance. Both disclosures are directed to the handling of audio processing in noisy environments. Meloney teaches that the disclosure is intended to apply to a variety of devices configured to accept sound inputs representative of vocalized information (Spec. page 2, [0027]). Panchapagesan is similarly directed to accepting sound input and lists several example options of format (Spec. Col. 4, lines 60-67). Therefore, it would have been obvious to adapt the system of Meloney to incorporate the features of Panchapagesan to produce a system capable of using audio feature representations of input training speech utterances.

Regarding claim 16, Meloney teaches the method of claim 1 as detailed above. Meloney further teaches that the device has a memory which contains the utterance database and noise database (Spec. page 4, [0037]). However, Meloney does not teach pairing each of the one or more corresponding noisy audio samples with the corresponding transcription of the corresponding training utterance and storing, by the data processing hardware, on memory hardware in communication with the data processing hardware, the pairing of each of the 20one or more corresponding noisy samples with the corresponding transcription of the corresponding training utterance.
Panchapagesan teaches the training of an acoustic model for speech recognition in a noisy environment (Spec. Col. 3, lines 4-15). Panchapagesan further teaches that the data aggregation system is a computing system which stores signals and other data for training the acoustic model (Spec. Col. 9, lines 21-28). Panchapagesan further teaches that the audio samples of the training data can be annotated with transcriptions of the utterances in the samples (Spec. Col. 10, lines 18-23).
Adapting Meloney to incorporate the features of Panchapagesan produces the method of claim 1 (as disclosed by Meloney above), further comprising, after pairing each of the one or more corresponding noisy audio samples with the corresponding transcription of the corresponding training utterance, storing, by the data processing hardware, on memory hardware in communication with the data processing hardware, the pairing of each of the 20one or more corresponding noisy samples with the corresponding transcription of the corresponding training utterance (the system of Meloney using noisy audio samples adapted by the teachings of Panchapagesan regarding training data, which includes audio data annotated with transcriptions, i.e. utterances paired with corresponding transcriptions, and storing the paired noisy audio samples with corresponding transcriptions to on memory hardware in communication with the data processing hardware).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meloney to incorporate the teachings of Panchapagesan to provide the method of claim 16. Both disclosures are directed to the handling of audio processing in noisy environments. While Meloney does not explicitly teach the storing of pairs of noisy audio data and corresponding transcriptions in memory hardware, Meloney does disclose the use of text and a corresponding utterance together with noisy audio data from the environment to train the VR model (Page 4, [0043]) and the use of memory which contains the utterance database and noise database (Spec. page 4, [0037]). Panchapagesan discloses that the pairing of transcriptions with the rest of the training data helps to keep the transcriptions and audio signals in alignment (Spec. Col. 10, lines 18-23). Therefore it would have been obvious to modify Meloney to incorporate the teachings of Panchapagesan to improve training by maintaining alignment between the transcriptions and audio signals. 

Regarding claim 23, the claim is directed to the system of claim 17 for performing the elements of claim 7 and is rejected on the same grounds.

Regarding claim 24, the claim is directed to the system of claim 23 for performing the elements of claim 8 and is rejected on the same grounds.

Regarding claim 27, the claim is directed to the system of claim 17 for performing the elements of claim 11 and is rejected on the same grounds.

Regarding claim 32, the claim is directed to the system of claim 17 for performing the elements of claim 16 and is rejected on the same grounds.

Claims 10 and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Meloney in view of Weinberger (Doc. ID. US 11227187 B1).

Regarding claim 10, Meloney teaches the method of claim 1 as detailed above, however, Meloney does not teach the method further comprising wherein the corresponding speech representation for at 25least one training utterance comprises a raw audio waveform of the corresponding training utterance.
Weinberger is directed to generating and updating trained machine learning models (Spec. Col. 2, lines 45-49) which can be used for training models for voice recognition or natural language processing tasks (Spec. Col. 4, lines 5-12). Weinberger further teaches wherein the corresponding speech representation for at 25least one training utterance comprises a raw audio waveform of the corresponding training utterance (Spec. Col. 4, lines 5-12; training data may comprise acoustic data as waveforms, i.e. raw audio).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meloney to incorporate the teachings of Weinberger to provide the method according to claim 1, wherein the corresponding speech representation for at 25least one training utterance comprises a raw audio waveform of the corresponding training utterance. Both disclosures are directed to audio processing. Meloney teaches that the disclosure is intended to apply to a variety of devices configured to accept sound inputs representative of vocalized information (Spec. page 2, [0027]). Weinberger is similarly directed to accepting sound input and lists several example options of format (Spec. Col. 4, lines 5-12). Therefore, it would have been obvious to adapt the system of Meloney to incorporate the features of Weinberger to produce a system capable of using raw audio waveform representations of input training speech utterances.

Regarding claim 26, the claim is directed to the system of claim 17 for performing the elements of claim 10 and is rejected on the same grounds.

Claims 9 and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Meloney in view of Panchapagesan and Weinberger.

Regarding claim 9, the combination of Meloney and Panchapagesan teaches the method of claim 7 as detailed above. In particular, the combination teaches generating a loss term for the corresponding speech representation or the corresponding noisy audio sample (Spec. Col. 10, lines 57-62). Panchapagesan further teaches updating parameters of the speech recognition model using the loss term (Spec. Col. 3, lines 35-47). However, the combination does not teach wherein training the speech model comprises: 
transmitting, to a central server, the loss term generated for each speech representation in the fixed set of training utterances and each noisy audio sample of the one or more noisy audio samples generated for each corresponding speech representation 15in the fixed set of training utterances, 
wherein the central server is configured to use federated learning to update parameters of a server-side speech recognition model based on: 
the loss terms received from the data processing hardware of the voice- enabled device; and 
20other loss terms received from other voice-enabled devices, the other loss terms received from each other voice-enabled device based on different noisy audio data sampled by the corresponding other voice-enabled device.
Weinberger teaches the federated training of a machine learning model based on raw data received from end users (Abstract) and the disclosure pertains to training models for voice recognition or natural language processing tasks (Spec. Col. 4, lines 5-12). In particular, Weinberger discloses transmitting, to a central server (As shown in Fig. 1A, server 112 is a central server connected by a network 190 to multiple end users), any relevant data for the training of the model (Spec. Col. 3, lines 51-57; end users may provide any relevant data to the server 112 for training of the model). The central server updates the baseline model parameters based on the data received from end users, i.e. using federated learning (Spec. Col. 26, lines 30-36).
Adapting the combination of Meloney and Panchapagesan to use the features taught by Weinberger discloses method of claim 7, wherein training the speech model comprises: 
transmitting, to a central server (Weinberger; As shown in Fig. 1A, server 112 is a central server connected by a network 190 to multiple end users), the loss term generated for each speech representation in the fixed set of training utterances and each noisy audio sample of the one or more noisy audio samples generated for each corresponding speech representation 15in the fixed set of training utterances (the system described above with respect to claim 7 as taught by the combination of Meloney and Panchapagesan generates a loss term for each speech representation in the fixed set of training utterances and each noisy audio sample of the one or more noisy audio samples generated for each corresponding speech representation 15in the fixed set of training utterances and is adapted to send this data to a central server for training of the model as detailed in Weinberger Spec. Col. 3, lines 51-57), 
wherein the central server is configured to use federated learning (Weinberger, Spec. Col. 2, line 64-Col. 3 line 2; federated learning techniques are used to generate and train machine learning models) to update parameters of a server-side speech recognition model based on (Weinberger, Spec. Col. 26, lines 30-36; the central server updates the baseline model parameters based on the data received from end users, i.e. using federated learning): 
the loss terms received from the data processing hardware of the voice- enabled device (updating parameters of the speech recognition model using the loss term as detailed in Panchapagesan, Spec. Col. 3, lines 35-47); and 
20other loss terms received from other voice-enabled devices, the other loss terms received from each other voice-enabled device based on different noisy audio data sampled by the corresponding other voice-enabled device (the generation of the loss term now adapted to be one on numerous devices and use of the loss term for updating the parameters of the speech recognition model now done at the central server of Weinberger).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Meloney and Panchapagesan to incorporate the teachings of Weinberger to provide the method according to claim 9. Meloney recognizes that while more training data results in better results, it is burdensome to need to sample repeat data for different noise environments for training (Spec. page 2, [0015]). Weinberger also recognizes that limited training data can result in worse performance of the model, and rectifies this by using data sourced from many devices (Spec. Col. 2, lines 57-63). Therefore would have been obvious to adapt the combination of Meloney and Panchapagesan to incorporate the features of Weinberger to produce an embodiment capable of training on data sourced from multiple end devices by using loss terms to update parameters of a model. 	

Regarding claim 25, the claim is directed to the system of claim 17 for performing the elements of claim 9 and is rejected on the same grounds.

Claims 12-14 and 28-30 are rejected under 35 U.S.C. 103 as being unpatentable over Meloney in view of Huang et al. "Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection," hereinafter Huang.

Regarding claim 12, Meloney teaches the method of claim 1 as detailed above. Meloney further teaches for at least one training utterance in the fixed set of training utterances obtaining, by the data processing hardware (Spec. page 4, [0041]; the electronic device records an utterance of the user’s speech. If the utterance is determined to be a speech to text command, the device performs a function based on that command), a corresponding spoken utterance 5sampled from the environment of the voice-enabled device (Spec. page 4, [0042]; if the operation performed corresponding to the command is determined to be a valid operation, the device trains and updates the VR model with the user’s utterance and the command, and possibly with pre-recorded utterances from the utterance database. Under broadest reasonable interpretation, the pre-recorded utterances corresponding to the command can be considered fixed set of training utterances and the captured user’s utterance in [0041] can be considered a corresponding spoken utterance 5sampled from the environment of the voice-enabled device).
However, Meloney does not disclose that the corresponding spoken utterance 5sampled from the environment of the voice-enabled device 
is phonetically similar to the corresponding speech representation of the corresponding training utterance; and 
is paired with a respective transcription that is different than the corresponding transcription that is paired with the corresponding speech representation of 10the at least one training utterance, 
wherein training the speech model on the fixed set of training utterances and the one or more corresponding noisy audio samples is further based on the corresponding spoken utterance obtained for the at least one training utterance in the fixed set of training utterances.
Huang teaches an approach to improve spoken term detection by unsupervised training (Abstract, page 1, lines 19-22). Positive and negative pairs are determined from unlabeled data with a focus on clustering embeddings for similar speech realizations such that embeddings for the same word spoken with different speaker characteristics, like accents, are grouped closely and similar words that aren’t the same, brother vs bother, are farther away (Abstract, page 1, lines 10-18. Introduction, page 1, Col. 2, paragraph 2). 
Adapting Meloney to incorporate the features of Huang discloses the method of claim 1, further comprising, for at least one training utterance in the fixed set of training utterances: 
obtaining, by the data processing hardware (Meloney, Spec. page 4, [0041]; the electronic device records an utterance of the user’s speech. If the utterance is determined to be a speech to text command, the device performs a function based on that command), a corresponding spoken utterance 5sampled from the environment of the voice-enabled device (Spec. page 4, [0042]; if the operation performed corresponding to the command is determined to be a valid operation, the device trains and updates the VR model with the user’s utterance and the command, and possibly with pre-recorded utterances from the utterance database. Under broadest reasonable interpretation, the pre-recorded utterances corresponding to the command can be considered fixed set of training utterances and the captured user’s utterance in [0041] can be considered a corresponding spoken utterance 5sampled from the environment of the voice-enabled device) that: 
is phonetically similar to the corresponding speech representation of the corresponding training utterance (the corresponding spoken utterance 5sampled from the environment of the voice-enabled device and the training utterance detailed above from Meloney now adapted to be the examples “brother” and “bother” used in Huang Page 1, Col. 2, paragraph 2, lines 15-18); and 
is paired with a respective transcription that is different than the corresponding transcription that is paired with the corresponding speech representation of 10the at least one training utterance (Meloney, Spec. page 4, [0043]; while in command mode, a user can enter text for the command they wish to execute which is then used for training. This is adapted now to apply to the corresponding utterance sampled from the environment and the corresponding training utterance, which are different words [“brother” and “bother”] as in Huang detailed above), 
wherein training the speech model on the fixed set of training utterances and the one or more corresponding noisy audio samples is further based on the corresponding spoken utterance obtained for the at least one training utterance in the fixed set of training utterances (Meloney, Spec. page 4, [0042]; if the operation performed corresponding to the command is determined to be a valid operation, the device trains and updates the VR model with the user’s utterance and the command, and possibly with pre-recorded utterances from the utterance database. Under broadest reasonable interpretation, the pre-recorded utterances corresponding to the command can be considered fixed set of training utterances and the captured user’s utterance in [0041] can be considered a corresponding spoken utterance 5sampled from the environment of the voice-enabled device).
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Meloney to incorporate the teachings of Huang to provide the method according to claim 12. Both disclosures are directed to the improvement of speech recognition systems. Meloney recognizes that when a misrecognition occurs despite a high confidence score for a match between a command and a spoken utterance, i.e. the spoken utterance is similar to the executed command, that it is useful to train the device with the data from a similar but incorrect recognition (Spec. page 4, [0043]). Huang details the benefits of negative training samples that are similar to the target word when training a speech recognition system (Page 1, Col. 2, paragraph 3). Therefore, it would have been obvious to combine the teachings of Meloney and Huang to provide for training with similar samples.

Regarding claim 13, the combination of Meloney and Huang teaches the method of claim 12 as detailed above. The combination further teaches wherein obtaining the corresponding spoken utterance for the at least one training utterance in the fixed set of training utterances comprises: 
sampling the corresponding spoken utterance from the environment of the voice- enabled device (Meloney, Spec. page 4, [0041]; the electronic device records an utterance of the user’s speech); 
20determining that the corresponding spoken utterance sampled from the environment is phonetically similar to the corresponding speech representation of the at least one corresponding training utterance based on a comparison of a respective embedding generated for the corresponding spoken utterance and a respective embedding generated for the corresponding speech representation of the at least one training 25utterance (Huang, Page 1, Col. 2, paragraph 2, lines 15-18; the embeddings for the speech realizations of the words “brother” and “bother,” the corresponding spoken utterance sampled from the environment corresponding training utterance as detailed above with respect to claim 12, are very close. Page 1, Col. 2, paragraph 2, lines 4-6; the embeddings carry phonetic information); 
obtaining the respective transcription of the corresponding spoken utterance sampled from the environment of the voice-enabled device (Meloney, Spec. page 4, [0043]; while in command mode, a user can enter text for the command they wish to execute which is then used for training. This is adapted now to apply to the corresponding utterance sampled from the environment and the corresponding training utterance, which are different words [“brother” and “bother”] as in Huang detailed above); and 
determining that the respective transcription of the corresponding spoken utterance is different than the corresponding transcription that is paired with the 30corresponding speech representation of the at least one training utterance (Meloney, Spec. page 4, [0043]; while in command mode, a user can enter text for the command they wish to execute which is then used for training. This is adapted now to apply to the corresponding utterance sampled from the environment and the corresponding training utterance, which are different words [“brother” and “bother”] as in Huang detailed above).

Regarding claim 14, the combination of Meloney and Huang teaches the method of claim 13 as detailed above. The combination further teaches wherein an embedding model or a portion of the speech model (Huang is directed to the improvement of speech recognition models by improving their learning of better audio embeddings [Page 1, Sect. 1 Introduction, paragraph 1, lines 11-21]) generates the respective embedding for each of the corresponding spoken utterance and the corresponding speech representation of the at least one training 5utterance (Page 1, Col. 2, lines 1-9; Audio Word2Vec embeds input audio segments as sequences of vectors).

Regarding claim 28, the claim is directed to the system of claim 17 for performing the elements of claim 12 and is rejected on the same grounds.

Regarding claim 29, the claim is directed to the system of claim 28 for performing the elements of claim 13 and is rejected on the same grounds.

Regarding claim 30, the claim is directed to the system of claim 29 for performing the elements of claim 14 and is rejected on the same grounds.

Claims 15 and 31 are rejected under 35 U.S.C. 103 as being unpatentable over Meloney in view of Huang and Jin et al. (Doc. ID. US 20160260429 A1), hereinafter Jin.

Regarding claim 15, the combination of Meloney and Huang teaches the method of claim 12 as detailed above. The combination further teaches the training of the model using the corresponding spoken utterance obtained for the at least one training utterance as a negative training sample (Meloney teaches the use of a corresponding spoken utterance obtained for the at least one training utterance in the training of a model, as shown above with respect to claim 12. Huang discloses the use of negative training examples for model training [Page 1, Col. 2, paragraph 3]). The combination does not, however, disclose wherein the corresponding speech representation of the at least one training utterance represents a spoken representation of a particular fixed term; the speech model comprises a hotword detection model trained to detect a particular fixed term; and training the hotword detection model to detect the particular fixed term.
Jin teaches systems and methods for noise-robust speech recognition (Spec. page 1, [0005], lines 1-4). The method and systems of the disclosure can be used for an automated speech recognition system for a fixed set of voice commands, i.e. a hotword detection model, therefore the positive training utterances can be spoken representations of a particular fixed term (Spec. page 7, [0078], lines 1-5). The disclosure further teaches the training of feature detectors involving the use of negative training exemplars (Spec. page 4, [0040], lines 19-27). 
Adapting the method taught by the combination of Meloney and Huang to incorporate the features of Jin provides the method of claim 12, wherein: 
the corresponding speech representation of the at least one training utterance represents a spoken representation of a particular fixed term (the system of Meloney using a corresponding spoken utterance obtained for the at least one training utterance in the training of a model, as shown above with respect to claim 12, now adapted to use the method and systems of the disclosure of Jin, Spec. page 7, [0078], lines 1-5 for training an automated speech recognition system for a fixed set of voice commands. The positive training utterances of Jin can be considered to be spoken representations of a particular fixed term as detailed above); 
10the speech model comprises a hotword detection model trained to detect a particular fixed term (Jin, Spec. page 7, [0078], lines 1-5, the method and systems of the disclosure can be used for an automated speech recognition system for a fixed set of voice commands, i.e. a hotword detection model); and 
training the hotword detection model to detect the particular fixed term comprises using the corresponding spoken utterance obtained for the at least one training utterance as a negative training sample (The system of Meloney, now adapted such that the corresponding spoken utterance obtained for the at least one training utterance is used as the negative training exemplar for training the feature detectors in Jin, Spec. page 4, [0040], lines 19-27). 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the combination of Meloney and Huang to incorporate the teachings of Jin to provide the method according to claim 15. Both Meloney and Jin are directed to improving speech recognition in noisy environments. Meloney discusses the training of the VR model with speech to text commands (Spec. page 2, [0019] and [0023]), i.e. particular fixed terms, which can be considered to be training for hotword detection. Jin also discloses the training of a speech recognition system for recognizing a fixed set of voice commands (Spec. page 7, [0078], lines 1-5), and notes the improvement in performance when training with negative examples (Spec. page 2, [0021]). Huang also details the benefits of negative training samples when training a speech recognition system (Page 1, Col. 2, paragraph 3). Therefore, it would have been obvious to combine the teachings of Meloney, Huang, and Jin to provide improved training for hotword detection using negative samples. 

Regarding claim 31, the claim is directed to the system of claim 28 for performing the elements of claim 15 and is rejected on the same grounds.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Lee et al. (Doc. ID. US 2017/0133006 A1) teaches a neural network training apparatus using primary training with clean training data and secondary training based on noisy training data.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to PARKER L MAYFIELD whose telephone number is (571)272-4745. The examiner can normally be reached Monday - Friday 7:30 AM-5:30 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/PARKER L MAYFIELD/
Examiner
Art Unit 2655



/ANDREW C FLANDERS/Supervisory Patent Examiner, Art Unit 2655