DETAILED ACTION

Notice of  AIA  Status
The present application, filed on or after March 3rd, 2021, is being examined under the first
inventor to file provisions of the AIA . 

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 04/16/2021 have being considered by the examiner. 
Claim Objection
 Claim 6 is objected to because of the following informalities:
Claim 6, line 2, recites the term “ initial KWS teach model”. It is not clear what the applicant refers to as “ initial KWS teach model” in claim 6. For examination purposes the Examiner has interpreted the term “ initial KWS teach model” to be“ initial KWS teacher model”. Appropriate correction is required. 


Claim Rejections - 35 USC§ 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-18 are rejected under 35 U.S.C. 101 because the claimed invention is directed
to an abstract idea without significantly more.

The Independent claim(s) 1 recite(s) “A method implemented by one or more processors, the method comprising: training an initial keyword spotting ("KWS") teacher model using a labeled training data set which includes input audio features and supervised output features”; “generating augmented audio data, wherein generating the augmented audio data comprises augmenting an instance of base audio data”; “processing the augmented audio data using the initial KWS teacher model to generate a soft label”; “processing the augmented audio data using a KWS student model to generate student output”; “and updating one or more portions of the KWS student model based on comparing the soft label and the generated student output”. 
The above limitations as drafted, is a process that covers performance of the limitation by the use of a mental process but of a generic computing device.  That is, other than reciting the generic computing device, nothing precludes the step from practically being performed by pen and paper. 
This judicial exception is not integrated into a practical application. Claim 1 recites “A method implemented by one or more processors,” this limitation directs towards using a computer for the method, and does not impose any meaningful limits on practicing the abstract idea.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The addition of the generic computer components recited above with regard to claim 1 do not amount to more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Claim 1 does not recite any additional limitations. The claim as drafted, is not patent eligible.

The Independent claim(s) 12 recite(s), “A method implemented by one or more processors, the method comprising: receiving audio data capturing a spoken utterance which includes one or more keywords”; “processing the audio data using a keyword spotting ("KWS") model to generate keyword output”, “wherein training the KWS model comprises: training an initial KWS teacher model portion of the KWS model using a labeled training data set which includes input audio features and supervised output features”; “generating augmented audio data, wherein generating the augmented audio data comprises augmenting an instance of base audio data”; “processing the augmented audio data using the initial KWS teacher model to generate a soft label”; “processing the augmented audio data using a KWS student model portion of the KWS model to generate student output; and updating one or more portions of the KWS student model portion of the KWS model based on comparing the soft label and the generated student output”; “and determining whether the one or more keywords are present in the spoken utterance based on the keyword output.”.
The above limitations as drafted, is a process that covers performance of the limitation by the use of a mental process but of a generic computing device.  That is, other than reciting the generic computing device, nothing precludes the step from practically being performed by pen and paper. 
This judicial exception is not integrated into a practical application. Claim 12 recites “A method implemented by one or more processors”, this limitation directs towards using a computer for the method, and do not impose any meaningful limits on practicing the abstract idea.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The addition of the generic computer components recited above with regard to claim 12 do not amount to more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Claim 12 does not recite any additional limitations. The claim as drafted, is not patent eligible.

The Independent claim(s) 13 recite(s), “A method implemented by one or more processors, the method comprising: training an initial teacher model using a labeled training data set which includes input audio features and supervised output features”; “generating augmented audio data, wherein generating the augmented audio data comprises augmenting an instance of base audio data using time masking and/or frequency masking of the base audio data”; “processing the augmented audio data using the initial teacher model to generate a soft label”; “processing the augmented audio data using a student model to 
generate student output”; “and updating one or more portions of the student model based on comparing the soft label and the generated student output.”
The above limitations as drafted, is a process that covers performance of the limitation by the use of a mental process but of a generic computing device.  That is, other than reciting the generic computing device, nothing precludes the step from practically being performed by pen and paper. 
This judicial exception is not integrated into a practical application. Claim 13 recites “A method implemented by one or more processors”, this limitation directs towards using a computer for the method, and do not impose any meaningful limits on practicing the abstract idea.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The addition of the generic computer components recited above with regard to claim 13 do not amount to more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Claim 12 does not recite any additional limitations. The claim as drafted, is not patent eligible.

Claims 2 and 14 recite the additional limitations of “further comprising: generating additional augmented audio data”, “wherein generating the additional augmented audio data comprises augmenting an additional instance of base audio data”; “processing the additional augmented audio data using the initial KWS teacher model to generate an additional soft label”; “processing the additional augmented audio data using the KWS student model to generate additional student output”; “and further updating the one or more portions of the KWS student model based on comparing the additional soft label and the generated additional student output.” 
The above limitations as drafted, is a process that covers performance of the limitation by the use of a mental process but of a generic computing device.  That is, other than reciting the generic computing device, nothing precludes the step from practically being performed by pen and paper. 
This judicial exception is not integrated into a practical application, as Claims 2 and 14 comprises no additional limitations.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as Claims 2 and 14 do not recite any additional limitations. The claims as drafted, are not patent eligible.

Claims 3 and 15 recite the additional limitations of “further comprising: subsequent to updating the one or more portions of the student model based on comparing the soft label and the generated student output, using the KWS student model as a next instance of the KWS teacher model”; “generating additional augmented audio data, wherein generating the additional augmented audio data comprises augmenting an additional instance of base audio data”; “processing the additional augmented audio data using the next instance of the KWS teacher model to generate an additional soft label”; “processing the additional augmented audio data using a next instance of the KWS student model to generate additional student output”; “and updating one or more additional portions of the next instance of the KWS student model based on comparing the additional soft label and the additional student output”. 
The above limitations as drafted, is a process that covers performance of the limitation by the use of a mental process but of a generic computing device.  That is, other than reciting the generic computing device, nothing precludes the step from practically being performed by pen and paper. 
This judicial exception is not integrated into a practical application, as Claims 3 and 15 comprises no additional limitations.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as Claims 3 and 15 do not recite any additional limitations. The claims as drafted, are not patent eligible.

Claim 4 recites the additional limitations of “wherein augmenting the instance of base audio data comprises: aggressively augmenting the instance of base audio data.” 
The above limitation as drafted, is a process that covers performance of the limitation by the use of a mental process but of a generic computing device.  That is, other than reciting the generic computing device, nothing precludes the step from practically being performed by pen and paper. 
This judicial exception is not integrated into a practical application, as Claim 4 comprises no additional limitations.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as Claim 4 does not recite any additional limitations. The claim as drafted, is not patent eligible.

Claim 5 recite the additional limitations of “wherein aggressively augmenting the instance of audio data comprises: processing the instance of base audio data using spectral augmentation to generate the augmented audio data”, “wherein the spectral augmentation includes time masking of the base audio data and/or frequency masking of the base audio data.” 
The above limitations as drafted, is a process that covers performance of the limitation by the use of a mental process but of a generic computing device.  That is, other than reciting the generic computing device, nothing precludes the step from practically being performed by pen and paper. 
This judicial exception is not integrated into a practical application, as Claim 5 comprises no additional limitations.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as Claim 5 does not recite any additional limitations. The claim as drafted, are not patent eligible.

Claim 6 recite the additional limitations of “wherein the base audio data includes a true occurrence of a keyword for which the initial KWS teach model is trained to predict”, “and wherein the augmented audio data fails to include the true occurrence of the keyword.” 
The above limitations as drafted, is a process that covers performance of the limitation by the use of a mental process but of a generic computing device.  That is, other than reciting the generic computing device, nothing precludes the step from practically being performed by pen and paper. 
This judicial exception is not integrated into a practical application, as Claim 6 comprises no additional limitations.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as Claim 6 does not recite any additional limitations. The claim as drafted, are not patent eligible.

Claims 7 and 16 recite the additional limitations of “wherein comparing the soft label and the generated student output comprises: generating a cross-entropy loss based on the soft label and the generated student output”; “and wherein updating the one or more portions of the KWS student model based on comparing the soft label and the generated student output comprises: updating the one or more portions of the KWS student model based on the cross-entropy loss.” 
The above limitations as drafted, is a process that covers performance of the limitation by the use of a mental process but of a generic computing device.  That is, other than reciting the generic computing device, nothing precludes the step from practically being performed by pen and paper. 
This judicial exception is not integrated into a practical application, as Claims 7 and 16 comprises no additional limitations.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as Claims 7 and 16 do not recite any additional limitations. The claims as drafted, are not patent eligible.

Claims 8 and 17 recite the additional limitations of “wherein the instance of base audio data is a labeled instance of base audio data”, “and wherein processing the augmented audio data using the initial KWS teacher model to generate the soft label comprises: processing the augmented audio data generated by augmenting the labeled instance of base audio data using the initial KWS teacher model to generate the soft label.” 
The above limitations as drafted, is a process that covers performance of the limitation by the use of a mental process but of a generic computing device.  That is, other than reciting the generic computing device, nothing precludes the step from practically being performed by pen and paper. 
This judicial exception is not integrated into a practical application, as Claims 8 and 17 comprises no additional limitations.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as Claims 8 and 17 do not recite any additional limitations. The claims as drafted, are not patent eligible.

Claims 9 and 18 recite the additional limitations of “wherein the instance of base audio data is an unlabeled instance of base audio data”, “and wherein processing the augmented audio data using the initial KWS teacher model to generate the soft label comprises: processing the augmented audio data generated by augmenting the unlabeled instance of base audio data using the initial KWS teacher model to generate the soft label.” 
The above limitations as drafted, is a process that covers performance of the limitation by the use of a mental process but of a generic computing device.  That is, other than reciting the generic computing device, nothing precludes the step from practically being performed by pen and paper. 
This judicial exception is not integrated into a practical application, as Claims 9 and 18 comprises no additional limitations.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as Claims 9 and 18 do not recite any additional limitations. The claims as drafted, are not patent eligible.

Claim 10 recite the additional limitations of “further comprising: receiving audio data capturing a spoken utterance which includes one or more keywords; processing the audio data using the KWS student model to generate keyword output”; “and determining the one or more keywords are present in the spoken utterance based on the keyword output.” 
The above limitations as drafted, is a process that covers performance of the limitation by the use of a mental process but of a generic computing device.  That is, other than reciting the generic computing device, nothing precludes the step from practically being performed by pen and paper. 
This judicial exception is not integrated into a practical application, as Claim 10 comprises no additional limitations.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as Claim 10 does not recite any additional limitations. The claim as drafted, are not patent eligible.

Claim 11 recite the additional limitations of “wherein the keyword output is binary classification output.” 
The above limitation as drafted, is a process that covers performance of the limitation by the use of a mental process but of a generic computing device.  That is, other than reciting the generic computing device, nothing precludes the step from practically being performed by pen and paper. 
This judicial exception is not integrated into a practical application, as Claim 11 comprises no additional limitations.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception, as Claim 11 does not recite any additional limitations. The claim as drafted, are not patent eligible.


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:

A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1, 2, 3, 6, 7, 10,12,14,15,16 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Li et al.  (Jinyu Li, Rui Zhao, Zhuo Chen, Changliang Liu, Xiong Xiao, Guoli Ye, and Yifan Gong, “DEVELOPING FAR-FIELD SPEAKER SYSTEM VIA TEACHER-STUDENT LEARNING”, 2018, Microsoft AI & Research, Redmond, WA 98052, 1-5), hereinafter referenced as Li.

Regarding Claim 1, Li teaches a method implemented by one or more processors, the method comprising: 
training an initial keyword spotting ("KWS") teacher model using a labeled training data set which includes input audio features and supervised output features (Section. [3.3], Fig.1, CTC KWS framework. Input to the CTC KWS model is the input acoustic feature and acoustics score is the supervised output features. Also, Section [1.0],discloses teacher-student learning is applied to adapt a KWS model compression); 
generating augmented audio data, wherein generating the augmented audio data comprises augmenting an instance of base audio data (Section [3.1], Equation 5 represents Far field speech Y (augmented audio data) generated by convolving close-talk speech S (base audio data) combined with various additive noise N at different signal-to-noise-ratio (SNR) level (augmenting an instance of base audio data));
processing the augmented audio data using the initial KWS teacher model to generate a soft label (Section.[2.0], the audio data from the source domain are processed by the teacher model to generate soft labels);
processing the augmented audio data using a KWS student model to generate student output (Section.[2.0], the audio data from the source domain are parallelly processed by the student model to generate output); 
and updating one or more portions of the KWS student model based on comparing the soft label and the generated student output ( Section.[2.0], KWS student model is trained based on the comparison of soft labels generated by the source model ( teacher) and the target (student) output). 

Regarding Claim 2, The method of claim 1, further comprising: generating additional augmented audio data, wherein generating the additional augmented audio data comprises augmenting an additional instance of base audio data (Section[3.1], equation 6 represents generating additional augmented data which includes room impulse response for speech, diffuse noise and directional noise); 
processing the additional augmented audio data using the initial KWS teacher model to generate an additional soft label (Section.[2.0], the audio data from the source domain are processed by the teacher model to generate soft labels);
processing the additional augmented audio data using the KWS student model to generate additional student output (Section.[2.0], the audio data from the source domain are parallelly processed by the student model to generate output);  
and further updating the one or more portions of the KWS student model based on comparing the additional soft label and the generated additional student output ( Section.[2.0], KWS student model is trained based on the comparison of soft labels generated by the source model ( teacher) and the target (student) output).

Regarding claim 14, is similar in scope and content of claim 2 above, and is rejected under similar rationale.

Regarding Claim 3,  The method of claim 1, further comprising: subsequent to updating the one or more portions of the student model based on comparing the soft label and the generated student output, using the KWS student model as a next instance of the KWS teacher model (Section[2.0]-[2.2], updated KWS student model ( target) became the next KWS teacher model ( source)); 
generating additional augmented audio data, wherein generating the additional augmented audio data comprises augmenting an additional instance of base audio data (Section[3.1], equation 6 represents generating additional augmented data which includes room impulse response for speech, diffuse noise and directional noise); 
processing the additional augmented audio data using the next instance of the KWS teacher model to generate an additional soft label (Section.[2.0], the audio data from the source domain are processed by the teacher model to generate soft labels); 
processing the additional augmented audio data using a next instance of the KWS student model to generate additional student output (Section.[2.0], the audio data from the source domain are parallelly processed by the student model to generate output);  
and updating one or more additional portions of the next instance of the KWS student model based on comparing the additional soft label and the additional student output ( Section.[2.0], KWS student model is trained based on the comparison of soft labels generated by the source model ( teacher) and the target (student) output).

Regarding claim 15, is similar in scope and content of claim 3 above, and is rejected under similar rationale.

Regarding Claim 6,  The method of claim 1, wherein the base audio data includes a true occurrence of a keyword for which the initial KWS teach model is trained to predict (Section[4.2], Correct accept (CA) rate of KWS system shows a true occurrence of a keyword for which the system was trained for), 
and wherein the augmented audio data fails to include the true occurrence of the keyword (Section[4.2], False accept (FA) rate of KWS system shows that the augmented audio data ( source data and beamformed far-field data) didn’t include the keyword).

Regarding Claim 7,  The method of claim 1, wherein comparing the soft label and the generated student output comprises: generating a cross-entropy loss based on the soft label and the generated student output (section [2.1], equation 2 shows the cross-entropy loss);
and wherein updating the one or more portions of the KWS student model based on comparing the soft label and the generated student output comprises: updating the one or more portions of the KWS student model based on the cross-entropy loss (section [2], [2.1], student model ( small size network) is optimized based on the cross-entropy loss);

Regarding claim 16, is similar in scope and content of claim 7 above, and is rejected under similar rationale.

Regarding Claim 10,  The method of claim 1, further comprising: receiving audio data capturing a spoken utterance which includes one or more keywords; processing the audio data using the KWS student model to generate keyword output; and determining the one or more keywords are present in the spoken utterance based on the keyword output ( Section[4.2], audio data includes speech utterances which contain keywords and audio data was processed to generate keywords ( Fig.1, keyword detection)) ; 

Regarding claim 12, A method implemented by one or more processors, the method comprising: receiving audio data capturing a spoken utterance which includes one or more keywords; processing the audio data using a keyword spotting ("KWS") model to generate keyword output, (Section[4.2], audio data includes speech utterances which contain keywords and audio data was processed to generate keywords ( Fig.1, keyword detection)), 
wherein training the KWS model comprises: training an initial KWS teacher model portion of the KWS model using a labeled training data set which includes input audio features and supervised output features (Section. [3.3], Fig.1, CTC KWS framework. Input to the CTC KWS model is the input acoustic feature and acoustics score is the supervised output features. Also, Section [1],discloses teacher-student learning is applied to adapt a KWS model compression); 
generating augmented audio data, wherein generating the augmented audio data comprises augmenting an instance of base audio data (Section [3.1], Equation 5 represents Far field speech Y (augmented audio data) generated by convolving close-talk speech S (base audio data) combined with various additive noise N at different signal-to-noise-ratio (SNR) level (augmenting an instance of base audio data));
processing the augmented audio data using the initial KWS teacher model to generate a soft label (Section.[2.0], the audio data from the source domain are processed by the teacher model to generate soft labels); 
processing the augmented audio data using a KWS student model portion of the KWS model to generate student output (Section.[2.0], the audio data from the source domain are parallelly processed by the student model to generate output);  
and updating one or more portions of the KWS student model portion of the KWS model based on comparing the soft label and the generated student output ( Section.[2.0], KWS student model is trained based on the comparison of soft labels generated by the source model ( teacher) and the target (student) output);
 and determining whether the one or more keywords are present in the spoken utterance based on the keyword output ( Section[4.2], audio data includes speech utterances which contain keywords and audio data was processed to generate keywords ( Fig.1, keyword detection)).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 4, 5, 8, 9, 13, 17, 18 are rejected under 35 U.S.C. 103 as being unpatentable over Li as stated above, in view of Park et al. (US 20190354808 A1), hereinafter referenced as Park.

Regarding Claim 4,  Li teaches the method of claim 1, Li fails to explicitly teach wherein augmenting the instance of base audio data comprises: aggressively augmenting the instance of base audio data.

However, Park does teach the claimed augmenting the instance of base audio data comprises: aggressively augmenting the instance of base audio data ([0032]-[0036], one or more augmentation on base data can include aggressive augmentation ( time warping operation, frequency masking operation, time masking operation)).

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Park’s teaching of generation of data augmentation for machine-learned models, into the method of Student teacher learning for KWS model optimization as taught by Li, because this would effectively improve the generation of augmented training data that results in improved model performance. (Park [0023]).

Regarding Claim 5,  Park in view of Li teaches the method of claim 4, as mentioned above.
Park further teaches, wherein aggressively augmenting the instance of audio data comprises: 
processing the instance of base audio data using spectral augmentation to generate the augmented audio data, wherein the spectral augmentation includes time masking of the base audio data and/or frequency masking of the base audio data ([0081, lines 12-14] spectral augmentation,[0053], “FIG. 3C shows an augmented image generated by applying a frequency masking operation to the audio graphic image of FIG. 3A and FIG. 3D shows an augmented image generated by applying a time masking operation to the audio graphic image of FIG. 3A.”).

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Park’s teaching of generation of data augmentation for machine-learned models, into the method of Student teacher learning for KWS model optimization as taught by Li, because this would effectively improve the generation of augmented training data that results in improved model performance. (Park [0023]).

Regarding Claim 8,  Li teaches the method of claim 1, wherein processing the augmented audio data using the initial KWS teacher model to generate the soft label comprises: processing the augmented audio data generated by augmenting the labeled instance of base audio data using the initial KWS teacher model to generate the soft label. However, Li fails to explicitly teach the claimed wherein the instance of base audio data is a labeled instance of base audio data.

However, Park  does teach the claimed instance of base audio data is a labeled instance of base audio data (Para.[0030,line 4] , the base audio signal can be labeled).

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Park’s teaching of generation of data augmentation for machine-learned models, into the method of Student teacher learning for KWS model optimization as taught by Li, because this would effectively improve the generation of augmented training data that results in improved model performance. (Park [0023]).

Regarding claim 17, is similar in scope and content of claim 8 above, and is rejected under similar rationale.

Regarding Claim 9,  Li teaches the method of claim 1, wherein processing the augmented audio data using the initial KWS teacher model to generate the soft label comprises: processing the augmented audio data generated by augmenting the unlabeled instance of base audio data using the initial KWS teacher model to generate the soft label. However, Li fails to explicitly teach the claimed wherein the instance of base audio data is an unlabeled instance of base audio data.

However, Park does teach the claimed instance of base audio data is an unlabeled instance of base audio data (Para.[0030, line 10], the base audio signal can be unlabeled). 

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Park’s teaching of generation of data augmentation for machine-learned models, into the method of Student teacher learning for KWS model optimization as taught by Li, because this would effectively improve the generation of augmented training data that results in improved model performance. (Park [0023]).

Regarding claim 18, is similar in scope and content of claim 9 above, and is rejected under similar rationale.

Regarding claim 13,  Li teaches a method implemented by one or more processors, the method comprising: training an initial teacher model using a labeled training data set which includes input audio features and supervised output features (Section. [3.3], Fig.1, CTC KWS framework. Input to the CTC KWS model is the input acoustic feature and acoustics score is the supervised output features.)  
processing the augmented audio data using the initial teacher model to generate a soft label (Section.[2.0], the audio data from the source domain are processed by the teacher model to generate soft labels); 
 processing the augmented audio data using a student model to generate student output (Section.[2.0], the audio data from the source domain are parallelly processed by the student model to generate output);   
and updating one or more portions of the student model based on comparing the soft label and the generated student output ( Section.[2.0], student model is trained based on the comparison of soft labels generated by the source model ( teacher) and the target (student) output);

Li fails to explicitly teach the claimed generating augmented audio data, wherein generating the augmented audio data comprises augmenting an instance of base audio data using time masking and/or frequency masking of the base audio data; 

However, Park does teach the claimed generating augmented audio data, wherein generating the augmented audio data comprises augmenting an instance of base audio data using time masking and/or frequency masking of the base audio data (Para. [0053], “FIG. 3C shows an augmented image generated by applying a frequency masking operation to the audio graphic image of FIG. 3A and FIG. 3D shows an augmented image generated by applying a time masking operation to the audio graphic image of FIG. 3A.”).
 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Park’s teaching of generation of data augmentation for machine-learned models, into the method of Student teacher learning for KWS model optimization as taught by Li, because this would effectively improve the generation of augmented training data that results in improved model performance. (Park [0023]).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Li as stated above, in view of Guevara et al. (US 20200126537 A1), hereinafter referenced as Guevara.

Regarding Claim 11,  Li teaches the method of claim 10. However, Li fails to explicitly teach wherein the keyword output is binary classification output.

However, Guevara does teach the claimed keyword output is binary classification output ( Para.[0041], Figs 4A and 4B, the neural network is optimized to output binary decision labels to indicate if the keyword(s) are present or not).   

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Guevara’s teaching of an end to end system for spotting keywords in streaming audio, into the method of Student teacher learning for KWS model optimization as taught by Li, because this would effectively improve the creation of effective training model in the spotting of keywords in streaming audio. (Guevara [0003]).

Conclusion
Listed below are the prior arts made of record and not relied upon but are considered pertinent to applicant's disclosure.
Fukuda et al.  (US 20190205748 A1) a computer-implemented method for generating soft labels for training is provided. The method includes preparing a teacher model having a teacher side class set. The method also includes obtaining a collection of class pairs for respective data units, in which each class pair includes classes labelled to a corresponding data unit from among the teacher side class set and from among a student side class set that is different from the teacher side class set. [0004]
Li et al. (US 20190287515 A1) Methods, systems, and computer programs are presented for training, with adversarial constraints, a student model for speech recognition based on a teacher model. One method includes operations for training a teacher model based on teacher speech data, initializing a student model with parameters obtained from the teacher model, and training the student model with adversarial teacher-student learning based on the teacher speech data and student speech data.[Abstract]
Ye et al. (US 20210304769 A1) Systems, methods, and devices are provided for generating and using text-to-speech (TTS) data for improved speech recognition models. A main model is trained with keyword independent baseline training data. In some instances, acoustic and language model sub-components of the main model are modified with new TTS training data.[Abstract]
Stoimenov et al. (US 20200349927 A1) Generally discussed herein are devices, systems, and methods for on-device detection of a wake word. A device can include a memory including model parameters that define a custom wake word detection model, the wake word detection model including a recurrent neural network transducer (RNNT) and a lookup table (LUT), the LUT indicating a hidden vector to be provided in response to a phoneme of a user-specified wake word, a microphone to capture audio, and processing circuitry to receive the audio from the microphone, determine, using the wake word detection model, whether the audio includes an utterance of the user-specified wake word, and wake up a personal assistant after determining the audio includes the utterance of the user-specified wake word.[Abstract]

Any inquiry concerning this communication or earlier communications from the examiner should be directed to NADIRA SULTANA whose telephone number is (571)-272-4048.  The examiner can normally be reached on 7:30AM-5:00PM (EST); M-F. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571)-272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/N.S./Examiner, Art Unit 2658                                                                                                                                                                                                        

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658