DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Specification
The disclosure is objected to because of the following informalities:
 In paragraphs 020, 025, 032, and 044, all incidents of   
In paragraphs 024, 042, 032, and 0166, all incidents of atmosphere should be changed to “environment ".  
Appropriate correction is required.

Claim Objections
Claim16, is objected to because of the following informalities:  
 “wavelength” in line 7, and 22 should be changed to – waveform – 

 “atmosphere” in line 17 should be changed to – environment --

Claim 18, is objected to because of the following informalities:  
 “wavelength” in line 23 should be changed to – waveform – 


“atmosphere” in line 17 should be changed to – environment --




Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1 - 4   are rejected under 35 U.S.C. 103 as being unpatentable over Tashev et al. (US20190318755A1)(hereinafter "Tashev"), Yassa et al. (US20180247640A1)(hereinafter "Yassa", and  Takashi Fukuda (US20180053087A1)(hereinafter "Fukuda").

Regarding claim 1, Tashev teaches each model being recorded at one or more differing associated predetermined signal to noise ratios; (Par. 0099:” As shown in Table 2, on the test data set with seen noise, the convolutional-recurrent neural network may consistently outperform the various speech enhancement systems, including MS, DNN-SYMM, DNN-CASUAL, and RNN-NG. Specifically, the convolutional-recurrent neural network is able to improve the PESQ measure by 0.6 points without decreasing the recognition accuracy … while all the various speech enhancement systems may boost the SNR ratio, the various speech enhancement systems”, and Par. 0105:” Further, the one or more conversion domains may further be converted into one or more conversion domains, such as one or more of a modulation domain, a cepstral domain, a mel-frequency cepstral coefficient [“MFCC”] domain, a log-power frequency domain, etc. Further, after processing the audio data, if the audio data is represented in a conversion domain, the audio data may be converted into the time domain.”, and Par. 0106:” Then, at step 604, for each frequency bin, a prior signal-to-noise ratio and a posterior signal-to-noise ratio based on the plurality of frames may be calculated.").
recording a raw audio waveform and transmitting the raw audio waveform to the computerized neural network;(Par. 0019:” FIG. 6 depicts a method 600 for training a deep neural network for improved real-time audio processing …”, and Par. 0052:” As shown in FIG. 2, the neural network may require a synthetic data set with separately known clean speech and noise signals in order to be trained.”, and Par. 0107:” At step 610, a neural network model, including a plurality of neurons, configured to output the clean speech estimation and the ideal ratio mask of audio data, the plurality of neurons arranged in a plurality of layers, including at least one hidden layer, and being connected by a plurality of connections may be constructed.”, and Par. 0108:” the trained neural network model configured to output the clean speech estimation and the ideal ratio mask of audio data may be outputted.”). Note, a clean audio outputted by a NN, necessitate to have a raw audio at its input.
and determining whether the raw audio waveform contains human speech; (Par. 0059:” Through continued tracking and comparison of this information and over a period of time, the machine learning component may determine whether the model accurately predicts which parts of the audio data are likely to be noise and/or speech. “, and Par. 0061:” ... the machine learning component may extract features from audio data where the model suggests that a noise and/or speech is present.”).

Tashev does not teach extracting any human speech from the raw audio waveform, a voice activity detection method comprising:  training one or more computerized neural networks having a denoising autoencoder and a classifier.

Yassa teaches extracting any human speech from the raw audio waveform (Par. 0017:” ASR 140 receives the audio waveform and phonetic transcriptions from TTS 160 and creates an acoustic model by taking the audio waveforms of speech and their transcriptions (taken from 
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Yassa to extracting any human speech from the raw audio waveform, in order to generate large quantities of rich, feature augmented datasets for use with a neural network which direct to generating quality speech-based datasets including synthetic audio for a plurality of speech utterances, as evidence by Yassa (See Par. 0017).

Neither Tashev nor Yassa teach a voice activity detection method comprising:  training one or more computerized neural networks having a denoising autoencoder and a classifier.

Fukuda teaches denoising the raw audio wave utilizing the denoising autoencoder; (Par. 0019:” Further, the back-end NN is a neural network that can be used for identifying phoneme corresponding to the input speech [that is, input feature] ....”, and Par. 0018:” The term, “front-end neural network”, may refer to a neural network which may be used for a denoising autoencoder including a feature space conversion.”).

training one or more computerized neural networks having a denoising autoencoder and a classifier (Par. 0019:” Further, the back-end NN is a neural network that can be used for identifying [classifier] phoneme corresponding to the input speech [that is, input feature] .... The back-end NN may be, for example, but not limited to, a convolutional neural network CNN] or a deep neural network [DNN]”, and Par. 0018:” The term, “front-end neural network”, may refer to a neural network which may be used for a denoising autoencoder including a feature space conversion.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Fukuda to train neural network having denoising autoencoder along with one or more features in order to provide for training a front-end neural network and a back-end neural network so that an output layer of the front-end NN is also an input layer of the back-end NN and training the combined NN for a speech recognition which [see par. 0002] yielded significant improvement in speech recognition performance, as evidence by Fukuda (see Par. 0003).

With regard to claim 2, Tashev teaches the voice activity detection method of claim 1, wherein the computerized neural network is a convolutional neural network. (Par. 0035:” … in an end-to-end model based on convolutional and recurrent neural networks for speech enhancement, the network may be data-driven, and the network may not make any assumptions about the underlying noise..”).

With regard to claim 3, Tashev teaches the voice activity detection method of claim 1, wherein the computerized neural network is a deep neural network. (Par. 0019:” FIG. 6 depicts a method 600 for training a deep neural network for improved real time audio processing, according to embodiments of the present disclosure…”).

recurrent neural network [“LSTM-RNN”] may be used for speech enhancement, which may provide improved performance of noise reduction at low SNRs.”).

Claims 5, and 6   are rejected under 35 U.S.C. 103 as being unpatentable over Tashev, Yassa, and  Fukuda as applied to claim 1, 5 respectively, in further view of Chang et al. (US20140365218A1)(hereinafter "Chang").

With regard to claim 5, and 6, Tashev teaches a voice activity detection method.
Tashev, Yassa, and Fukuda do not teach the voice activity detection method of claim 1, wherein the classifier is trained utilizing one or more linguistic models. 
Chang teaches wherein the classifier is trained utilizing one or more linguistic models.  (Par. 0005:” FIG. 2 shows a process for training a classifier using recognition results obtained by using different language models”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Chang to train classifier utilizing one or more linguistic models in order to create a new language model by using the updated training data or an existing language model may be re-trained using the updated training data, it has been found that using an adapted language model may improve the sentence error rate, as evidence by Chang (see Par. 0055).


Chang teaches wherein the classifier is trained utilizing a plurality of linguistic models. (Par. 0005:” FIG. 2 shows a process for training a classifier using recognition results obtained by using different language models”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Chang to train classifier utilizing one or more linguistic models in order to create a new language model by using the updated training data or an existing language model may be re-trained using the updated training data, it has been found that using an adapted language model may improve the sentence error rate, as evidence by Chang (see Par. 0055).

Claims 7- 10   are rejected under 35 U.S.C. 103 as being unpatentable over Tashev, Yassa,  Fukuda, and Chang  as applied to claim 6, in further view of Tu et al. “Investigating the role of L1 in automatic pronunciation evaluation of L2 speech, July 4, 2018)

With regard to claim 7- 10, Tashev teaches a voice activity detection method.
With regard to claim 7, Tashev, Yassa, Fukuda, and Chang do not teach the voice activity detection method of claim 6, wherein at least one linguistic model is VoxForge.
Tu teaches wherein at least one linguistic model is VoxForge. (Par. 0002, section 2.1:” We use data from the Voxforge project and download the speech corpora for French [≈ 30 Voxforge. The dictionary for these three languages.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev and Chang in view of Tu to incorporate at least one linguistic model is VoxForge in order to extract a new utterance-level feature scheme to convert the technique proposed into a fixed-dimension vector which is used as an input to a statistical model to predict the accentedness of a speaker, as evidence by Tu (See Par. ABS).

With regard to claim 8, Tashev, Yassa, Fukuda, and Chang do not teach the voice activity detection method of claim 6, wherein at least one linguistic model is AIShell.
Tu teaches wherein at least one linguistic model is AIShell. (Par. 0002, section 2.1:” For Mandarin, the publicly accessible AIShell Mandarin Speech corpus [approximately 150 hours training data] and the corresponding Kaldi scripts2 are used.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev and Chang in view of Tu to incorporate at least one linguistic model is AIShell in order to extract a new utterance-level feature scheme to convert the technique proposed into a fixed-dimension vector which is used as an input to a statistical model to predict the accentedness of a speaker, as evidence by Tu (See Par. ABS).


Tu teaches wherein at least one linguistic model is VoxForge; and wherein at least one additional linguistic model is AISHELL. (Par. 0002, section 2.1:” For Mandarin, the publicly accessible AIShell Mandarin Speech corpus [approximately 150 hours training data] and the corresponding Kaldi scripts are used. A pronunciation dictionary is included with the dataset. For the remaining three languages [Spanish, French and German], there are no well-organized publicly available data. We use data from the Voxforge project and download the speech corpora for French [≈ 30 hours], German [≈ 50 hours] and Spanish [≈ 50 hours]. Kaldi scripts for the Voxforge. The dictionary for these three languages.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev and Chang in view of Tu to incorporate at least one linguistic model is VoxForge; and wherein at least one additional linguistic model is AISHELL in order to extract a new utterance-level feature scheme to convert the technique proposed into a fixed-dimension vector which is used as an input to a statistical model to predict the accentedness of a speaker, as evidence by Tu (See Par. ABS).

With regard to claim 10, Tashev teaches a voice activity detection method.
Tashev further teaches the voice activity detection method of claim 6, wherein each linguistic model is recorded having a base truth, wherein each linguistic model is recorded at one or more of a plurality of pre-set signal to noise ratios. (Par. 0095:” Table 2 depicts ground truth clean speech. For each metric, the model achieves the best performance is highlighted in bold. In Table 2, as shown below, a comparison is shown with the following metrics signal-to-noise ratio (“SNR”) (in dB), log-spectral distortion (“LSD”) (in dB), mean square error (“MSE”), word error rate (“WER”) in percentage, and perceptual evaluation of speech quality (“PESQ”) in a range from 1 to 5 with 1 being poor and 5 being excellent.).

Claim 11, and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Tashev, Yassa, Fukuda, Chang, and Tu, as applied to claim 10, and 11 respectively, in further view of  Young et al. (KR101704926B1)(hereinafter "Young").
 
 With regard to claims 11, and 12, Tashev teaches a voice activity detection method.
Tashev, Yassa, Fukuda, Chang, and Tu do not teach wherein each linguistic model is recorded having a base truth, wherein each linguistic model is recorded at a plurality of pre-set signal to noise ratios. 
Young teaches wherein each linguistic model is recorded having a base truth, wherein each linguistic model is recorded at a plurality of pre-set signal to noise ratios. (Par. 0024:” After the learning is completed, the learning data is fed-forward through the learned deepening neural network to obtain the result value. Then, the model-trust algorithm is used to estimate the presence probability from the resultant value and the label value. A slope parameter and a SNRs of 5, 0, 5, 10, 15, and 20 dB, respectively, for airport, babble, car, exhibition, restaurant, street, subway and train noise. In addition, factory and office noise can be synthesized at SNRs of-5, 0, 5, 10, 15, and 20 dB, respectively, in order to evaluate the proposed voice detection method in an unseen environment. A radial basis function (RBF) kernel may be applied to the learning of the statistical model based speech detector using the SVM for comparison with the speech detection method according to the present embodiment, and the kernel parameter may be set to 1.0.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Young to incorporate base truth and a preset SNR when using linguistic model in order to improve the voice detection performance by estimating it accurately, via modeling the non-linear distribution characteristics of the statistical model parameters appearing in each noise environment into the respective deepened neural networks, as evidence by Young (See Par. 0032).

With respect claim 12, Tashev further teaches the voice activity detection method of claim 11, wherein the plurality of pre-set signal to noise ratios range between 0 dB and 35 dB. (Par. 0063:” A resulting file SNR may be limited to the range of [0, +30] dB. All signals were sampled at 16 kHz sampling rate and stored with 24 bits precision.).

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Tashev, Yassa, Fukuda, and Chang, as applied to claim 6, in further view of Garimella et al. (US9886948B1)(hereinafter "Garimella").

With respect to claim 13, Tashev teaches a voice activity detection method.
Tashev, Yassa, Fukuda, and Chang do not teach the voice activity detection method of claim 6, wherein the raw audio waveform is recorded on a local computational device, and wherein method further comprises a step of transmitting the raw audio waveform to a remote server, wherein the remote server contains the computational neural network.
Garimella teaches wherein the raw audio waveform is recorded on a local computational device, (Col 6, lines 21 – 24:” Audio data captured by a remote microphone 104 [e.g., a microphone coupled to or integrated with a mobile phone, tablet computer, notebook computer, desktop computer, set-top box, television, home stereo, or the like]”).
and wherein method further comprises a step of transmitting the raw audio waveform to a remote server (Col 6, lines 18 – 30:” The speech processing system 100 may be implemented on a server computing device….  Audio data captured by a remote microphone 104 … may be transmitted to the speech processing system 100 via a network, such as a local area network …”).
wherein the remote server contains the computational neural network. (Col. 1, lines 60-62: “FIG. 5 is a block diagram of an illustrative computing system configured to train and/or use a neural network according to some embodiments”, and Col 6, lines 18 – 20:”The speech server computing device, such as the network-accessible server device shown in FIG. 5 “).
Therefore it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Garimella to record audio waveform on a local device, and transmit the raw audio waveform to a remote server, where remote server contains the computational neural network, in order to enhance robustness and maintain a desired level of accuracy when data from certain feature streams is corrupt, invalid, erroneous, or otherwise undesirable, as evidence by Garimella (See Col 3, lines 13 – 16).

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Tashev, Yassa, Fukuda, and Chang, as applied to claim 6, in further view of Gruenstein et al. (US20180330728A1)(hereinafter "Gruenstein").

With respect to claim 14, Tashev teaches a voice activity detection method.
Tashev, Yassa, Fukuda, and Chang do not teach the voice activity detection method of claim 6, wherein the raw audio waveform is recorded on a local computational device, and wherein the local computational device contains the computational neural network.
Gruenstein teaches wherein the raw audio waveform is recorded on a local computational device (Par. 0053:” To detect an activation hotword and receive voice queries uttered in a local environment, the device 200 may include one or more microphones 202. The record audio signals detected by the microphones 202 and process the audio with a hotworder 204.”).
and wherein the local computational device contains the computational neural network. (Par. 0053:” The hotworder 204 may use classifying windows to process these audio features using, for example, a support vector machine, a machine-learned neural network, or other models.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Gruenstein to record the waveform local device wherein the local device contains the neural network, in order to identify an illegitimate voice query in request received from a client device at a later time, as evidence by Gruenstein (See Par. 0004).

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Tashev, Yassa, Fukuda, Chang, and Gruenstein as applied to claim 14, in further view of Song et al. “Binary Neural Network For Vad and Wakeup, July 2018).

With respect to claim 14, Tashev teaches a voice activity detection method.
Tashev, Yassa, Fukuda, Chang, and Gruenstein do not teach the voice activity detection method of claim 14, wherein the computational neural network is compressed. 
Song teaches, wherein the computational neural network is compressed. 
compressed neural network BNN and BWN with optimized Batch Normalization layer followed by a posterior smoothing method for VAD and wakeup task”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Song to employ compressed neural network, in order to dramatically reduce and optimized the computing cost on training and running time, as evidence by Song (See Page 306, Introduction, right hand column).

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Tashev, Jeon et al. (US20190197121A1 A1)(hereinafter "Jeon"), Ravindran et al. (US 20160284349 A1)(hereinafter “Ravindran”), Fukuta, Alton Konchitsky (US20130060567A1)(hereinafter “Konchitsky”), Lee et al. (US20130339015A1)(hereinafter “Lee”).

With respect to claim 16, Tashev teaches a voice activity detection system, the system comprising:  a local computational system, the local computational system comprising: processing circuitry; (Par. 0020:” FIG. 7 depicts a high-level illustration of an exemplary computing device [processing circuitry] that may be used in accordance with the systems, methods, and computer-readable media disclosed herein, according to embodiments of the present disclosure”).
microphone varying from 1 to 3 meters, were used.”).
a non-transitory computer-readable media being operatively connected to the processing circuitry; (Par. 0020:” FIG. 7 depicts a high-level illustration of an exemplary computing device that may be used in accordance with the systems, methods, and computer-readable media disclosed herein, according to embodiments of the present disclosure”).
and a classifier module, (Par. 0005:” may also be applicable to regressive processing and classification problems”, and Par. 0103:” For example, aspects of the present disclosure may also improve classification tasks, such as source separation and microphone beam forming, as well as estimation tasks, such as acoustic echo cancellation.”).
wherein the non-transitory computer-readable media contains instructions for the processing circuitry to perform the following tasks: (Par. 0123:” Various functions described herein may be implemented in hardware, software, or any combination thereof. If implemented in software, the functions may be stored on and/or transmitted over as one or more instructions or code on a computer-readable medium.”).

Tashev does not teach transmit the recorded raw audio waveforms to the remote server; the remote server having one or more computerized neural networks, wherein the computerized neural networks of the remote server are trained on a plurality of acoustic models, wherein each of the plurality of acoustic models represent a particular linguistic dataset recorded in one or more associated noise predetermined signal to noise ratios; a 



Jeon teaches wherein the remote server contains the computational neural network. (Par. 0099:” In addition, in one or more embodiments one or more of an example sub-neural network and the main neural network, e.g., including an acoustic and/or language model, and a decoder may be implemented by one or more remote servers, as the speech recognizer 820, or by the speech recognizer 820 of the electronic device 800.”)
wherein the computerized neural networks of the remote server are trained on a plurality of acoustic models (Par. 0098:”Referring to FIG. 8, in an embodiment, the electronic device 800 includes a speech receiver 810, a speech recognizer 820, and a processor 830, in training apparatuses described above with respect to FIGS. 1-7”, and Par. 0099:” In addition, in one or more embodiments one or more of an example sub-neural network and the main neural network, e.g., including an acoustic and/or language model, and a decoder may be implemented by one or more remote servers, as the speech recognizer 820, or by the speech recognizer 820 of the electronic device 800.”, and Par. 0054:”Examples set forth hereinafter set forth hardware for recognizing a voice using one or more neural networks, as well as for training one or more neural networks for subsequent use in such voice recognition.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Jeon to employ neural networks of the remote server be trained on a plurality of acoustic models in order to perform a neural network specialized training, thereby have a capability of generating a relatively accurate output with respect to an input pattern that the neural network may not have been trained for, as evidence by Jeon (See Par. 0003).

Tashev, and Jeon do not teach wherein each of the plurality of acoustic models represent a particular linguistic dataset recorded in one or more associated noise predetermined signal to noise ratios; a denoising autoencoder module, and wherein the remote server contains processing circuitry configured to utilize the denoising autoencoder module to perform a denoising operation on the recorded waveform,  and utilize the classifier to classify the recorded waveforms as speech or non-speech, utilize the microphone to record raw audio waveforms from an ambient environment; transmit the recorded raw audio waveforms to the 

Ravindran teaches wherein each of the plurality of acoustic models represent a particular linguistic dataset recorded in one or more associated noise predetermined signal to noise ratios; (Par. 0045:” Process 300 may include “modify at least one parameter used to perform speech recognition on the audio data and depending on the characteristic” 306. Also as explained in greater detail herein, the parameters used to perform the ASR engine computations using the acoustic models and/or language models may be modified depending on the characteristic in order to reduce the computational load or increase the quality of the speech recognition without increasing the computational load. For one optional example, noise reduction during feature extraction may avoid extraction of an identified noise or sound. For other examples, identity of the types of sounds in the noise of the audio data, or identification of the user's voice, may be used to select an acoustic model that de-emphasizes undesired sounds in the audio data. Also, the SNR of the audio as well as the ASR indicators….”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Ravindran to employ acoustic models represent a particular linguistic dataset recorded in one or more associated noise predetermined signal to noise ratios in order to extend battery life on small devices using ASR by dynamically selecting the ASR parameters based on the environment in 

Tashev , Jeon, and Ravindran do not teach a denoising autoencoder module, and wherein the remote server contains processing circuitry configured to utilize the denoising autoencoder module to perform a denoising operation on the recorded waveform,  and utilize the classifier to classify the recorded waveforms as speech or non-speech, utilize the microphone to record raw audio waveforms from an ambient environment; transmit the recorded raw audio waveforms to the remote server; a remote server configured to receive recorded waveforms from the local computational system; extract the speech from the recorded raw audio waveforms, perform a speech-to-text operation, and transmit one or more extracted strings of speech characters back to the local computational system.

Fukuda teaches a denoising autoencoder module, (Par. 0019:” Further, the back-end NN is a neural network that can be used for identifying phoneme corresponding to the input speech [that is, input feature] ....”, and Par. 0018:” The term, “front-end neural network”, may refer to a neural network which may be used for a denoising autoencoder including a feature space conversion.”).

and wherein the remote server contains processing circuitry configured to utilize the denoising autoencoder module to perform a denoising operation on the recorded waveform, (Par. 0134:” The computer readable program instructions may execute entirely on the user's remote computer or entirely on the remote computer or server.”, and 0019:” Further, the back-end NN is a neural network that can be used for identifying phoneme corresponding to the input speech (that is, input feature). .... The back-end NN may be, for example, but not limited to, a convolutional neural network (CNN) or a deep neural network (DNN).”, and 0018:” The term, “front-end neural network”, may refer to a neural network which may be used for a denoising autoencoder including a feature space conversion.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Fukuda to train neural network having denoising autoencoder along with one or more features in order to provide for training a front-end neural network and a back-end neural network so that an output layer of the front-end NN is also an input layer of the back-end NN and training the combined NN for a speech recognition which [see par. 0002] yielded significant improvement in speech recognition performance, as evidence by Fukuda (see Par. 0003).

Tashev, Jeon, Ravindran, and Fukuda do not teach and utilize the classifier to classify the recorded waveforms as speech or non-speech, utilize the microphone to record raw audio waveforms from an ambient environment; transmit the recorded raw audio waveforms to the remote server; a remote server configured to receive recorded waveforms from the local computational system; extract the speech from the recorded raw audio waveforms, perform a 

Konchitsky teaches and utilize the classifier to classify the recorded waveforms as speech or non-speech, (Par. 0104:” … the segmented data undergoes voice activity detection 513 ["VAD"] wherein speech and noise metrics are continuously calculated from the input speech waveform which are then used to classify speech and non-speech regions.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Konchitsky to classifier to classify the recorded waveforms as speech or non-speech in order to develop an adaptive noise reduction scheme which reduces the background noise in the front-end to improve the performance of a speech recognition engine, as evidence by Konchitsky (see Par. 0003).

Tashev, Jeon, Ravindran, Fukuda, and Konchitsky do not teach utilize the microphone to record raw audio waveforms from an ambient environment; transmit the recorded raw audio waveforms to the remote server; a remote server configured to receive recorded waveforms from the local computational system; extract the speech from the recorded raw audio waveforms, perform a speech-to-text operation, and transmit one or more extracted strings of speech characters back to the local computational system.

convert the received user's voice to text information, and transmit the text information to the terminal apparatus 100.”, and 0049:” the first server 200 may utilize the algorithms for the Speech to Text (STT) to convert the voice signals transmitted from the terminal apparatus 100 into the text information.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Lee to extract the speech from the recorded raw audio waveforms, perform a speech-to-text operation, and transmit one or more extracted strings of speech characters back to the local computational system in order to determine the user's intention for uttering in response to the text information received from the terminal apparatus 100, and generate the control command and the response information in response to the user's intention for uttering, as evidence by Lee (see Par. 0102).

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Tashev, Jeon,  Ravindran, Fukuta,  Konchitsky, and Lee as applied to claim 16, and in of further view of  Chang and Tu ).

Tashev teaches a voice activity detection system.


Chang teaches wherein the classifier is trained utilizing a plurality of linguistic models, (Par. 0005:” FIG. 2 shows a process for training a classifier using recognition results obtained by using different language models”).

Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Chang to train classifier utilizing one or more linguistic models in order to create a new language model by using the updated training data or an existing language model may be re-trained using the updated training data, it has been found that using an adapted language model may improve the sentence error rate, as evidence by Chang (see Par. 0055).
Neither Tashev nor Chang teach wherein at least one linguistic model is VoxForgeTM and at least one linguistic model is AIShell.

Tu teaches wherein at least one linguistic model is VoxForgeTM and at least one linguistic model is AIShell. (Par. 0002, section 2.1:” For Mandarin, the publicly accessible AIShell Mandarin Speech corpus [approximately 150 hours training data] and the corresponding Kaldi scripts are used. A pronunciation dictionary is included with the dataset. For the remaining three languages [Spanish, French and German], there are no well-organized publicly available Voxforge project and download the speech corpora for French [≈ 30 hours], German [≈ 50 hours] and Spanish [≈ 50 hours]. Kaldi scripts for the Voxforge. The dictionary for these three languages.”).

Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev and Chang in view of Tu to incorporate at least one linguistic model is VoxForge; and wherein at least one additional linguistic model is AISHELL in order to extract a new utterance-level feature scheme to convert the technique proposed into a fixed-dimension vector which is used as an input to a statistical model to predict the accentedness of a speaker, as evidence by Tu (See Par. ABS).

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Tashev, Jeon,  Ravindran, Fukuta,  Konchitsky, Lee, and  Yassa ).

With respect to claim 18, Tashev teaches a vehicle comprising a voice activity detection system, the system comprising: a local computational system, the local computational system further comprising: processing circuitry; (Par. 0020:”FIG. 7 depicts a high-level illustration of an exemplary computing device [processing circuitry] that may be used in accordance with the systems, methods, and computer-readable media disclosed herein, according to embodiments of the present disclosure; and).
a microphone operatively connected to the processing circuitry; (Par. 0002:” 48 room impulse responses [“RIR”], obtained from a room with T60=300 ms and distances between the microphone varying from 1 to 3 meters, were used.”, and Par. 0103:” While the present disclosure specifically discusses audio processing … such as source separation and microphone beam forming, as well as estimation tasks, such as acoustic echo cancellation.).
a non-transitory computer-readable media being operatively connected to the processing circuitry; (Par. 0020:” FIG. 7 depicts a high-level illustration of an exemplary computing device that may be used in accordance with the systems, methods, and computer-readable media disclosed herein, according to embodiments of the present disclosure”).
one or more computerized neural networks including: (Par. 0113:” … the computing device 700 may be used in a system that processes data, such as audio data, using a deep neural network, according to embodiments of the present disclosure.”).

and a classifier module, (Par. 0005:” may also be applicable to regressive processing and classification problems.”, and 0103:”For example, aspects of the present disclosure may also improve classification tasks, such as source separation and microphone beam forming, as well as estimation tasks, such as acoustic echo cancellation.”).
wherein the non-transitory computer-readable media contains instructions for the processing circuitry to perform the following tasks: (Par. 0123:”Various functions described herein may be implemented in hardware, software, or any combination thereof. If implemented in software, the functions may be stored on and/or transmitted over as one or more instructions or code on a computer-readable medium.”).
utilize the microphone to record raw audio waveforms from an ambient environment; (Par. 0007:” The audio signals detected by microphone 104 are converted into electrical signals microphone varying from 1 to 3 meters, were used.”).

Tashev does not teach wherein the computerized neural networks are trained on a plurality of acoustic models, wherein each of the plurality of acoustic models represent a particular linguistic dataset recorded in one or more associated noise predetermined signal to noise ratios; transmit the recorded raw audio waveforms to the one or more computerized neural networks; teaches a denoising autoencoder module, and wherein at least one computerized neural network is configured to utilize the denoising autoencoder module to perform a denoising operation on the recorded waveform, and utilize the classifier to classify the recorded waveforms as speech or non-speech, extract the speech from the recorded raw audio waveforms, perform a speech-to-text operation, and transmit one or more extracted strings of speech characters back to the local computational system, transmit the recorded raw audio waveforms to the one or more computerized neural networks.

Jeon teaches wherein the computerized neural networks are trained on a plurality of acoustic models, (Par. 0099:” In addition, in one or more embodiments one or more of an example sub-neural network and the main neural network, e.g., including an acoustic and/or language model, and a decoder may be implemented by one or more remote servers, as the speech recognizer 820, or by the speech recognizer 820 of the electronic device 800.”).



Neither Tashev nor Jeon teach  wherein each of the plurality of acoustic models represent a particular linguistic dataset recorded in one or more associated noise predetermined signal to noise ratios; teaches a denoising autoencoder module, and wherein at least one computerized neural network is configured to utilize the denoising autoencoder module to perform a denoising operation on the recorded waveform, and utilize the classifier to classify the recorded waveforms as speech or non-speech, extract the speech from the recorded raw audio waveforms, perform a speech-to-text operation, and transmit one or more extracted strings of speech characters back to the local computational system, transmit the recorded raw audio waveforms to the one or more computerized neural networks.

Ravindran teaches wherein each of the plurality of acoustic models represent a particular linguistic dataset recorded in one or more associated noise predetermined signal to noise ratios; (Par. 0045:” Process 300 may include “modify at least one parameter used to perform speech recognition on the audio data and depending on the characteristic” 306. Also as explained in greater detail herein, the parameters used to perform the ASR engine acoustic models and/or language models may be modified depending on the characteristic in order to reduce the computational load or increase the quality of the speech recognition without increasing the computational load. For one optional example, noise reduction during feature extraction may avoid extraction of an identified noise or sound. For other examples, identity of the types of sounds in the noise of the audio data, or identification of the user's voice, may be used to select an acoustic model that de-emphasizes undesired sounds in the audio data. Also, the SNR of the audio as well as the ASR indicators….”).

Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Ravindran to employ acoustic models represent a particular linguistic dataset recorded in one or more associated noise predetermined signal to noise ratios in order to extend battery life on small devices using ASR by dynamically selecting the ASR parameters based on the environment in which an audio capture device (such as a microphone) is being operated, as evidence by Ravindran (See Par. 0023).

Tashev, Jeon, and Ravindran do not teach teaches a denoising autoencoder module, and wherein at least one computerized neural network is configured to utilize the denoising autoencoder module to perform a denoising operation on the recorded waveform, and utilize the classifier to classify the recorded waveforms as speech or non-speech, extract the speech from the recorded raw audio waveforms, perform a speech-to-text operation, and transmit one 

Fukuda teaches teaches a denoising autoencoder module, (Par. 0019:” Further, the back-end NN is a neural network that can be used for identifying phoneme corresponding to the input speech [that is, input feature] ....”, and Par. 0018:” The term, “front-end neural network”, may refer to a neural network which may be used for a denoising autoencoder including a feature space conversion.”).
and wherein at least one computerized neural network is configured to utilize the denoising autoencoder module to perform a denoising operation on the recorded waveform, (Par. 0134:” The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.”, and 0019:” Further, the back-end NN is a neural network that can be used for identifying phoneme corresponding to the input speech (that is, input feature). .... The back-end NN may be, for example, but not limited to, a convolutional neural network (CNN) or a deep neural network (DNN).”, and 0018:” The term, “front-end neural network”, may refer to a neural network which may be used for a denoising autoencoder including a feature space conversion.”).

Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Fukuda to train 

Tashev, Jeon, Ravindran, and Fukuda do not teach and utilize the classifier to classify the recorded waveforms as speech or non-speech, extract the speech from the recorded raw audio waveforms, perform a speech-to-text operation, and transmit one or more extracted strings of speech characters back to the local computational system, transmit the recorded raw audio waveforms to the one or more computerized neural networks.

Konchitsky teaches and utilize the classifier to classify the recorded waveforms as speech or non-speech, (Par. 0104:” … the segmented data undergoes voice activity detection 513 ["VAD"] wherein speech and noise metrics are continuously calculated from the input speech waveform which are then used to classify speech and non-speech regions.”)

Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Konchitsky to classifier to classify the recorded waveforms as speech or non-speech in order to develop an adaptive noise reduction scheme which reduces the background noise in the front-end to 

Tashev, Jeon, Ravindran, Fukuda, and Konchitsky do not teach extract the speech from the recorded raw audio waveforms, perform a speech-to-text operation, and transmit one or more extracted strings of speech characters back to the local computational system, transmit the recorded raw audio waveforms to the one or more computerized neural networks.

Lee teaches extract the speech from the recorded raw audio waveforms, perform a speech-to-text operation, and transmit one or more extracted strings of speech characters back to the local computational system. (Par. 0040:” ...first server 200 may convert the received user's voice to text information, and transmit the text information to the terminal apparatus 100.”, and 0049:” the first server 200 may utilize the algorithms for the Speech to Text (STT) to convert the voice signals transmitted from the terminal apparatus 100 into the text information.”).

Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Lee to extract the speech from the recorded raw audio waveforms, perform a speech-to-text operation, and transmit one or more extracted strings of speech characters back to the local computational system in order to determine at least one of a gender and an age of the user, and convert the 

Tashev, Jeon, Ravindran, Fukuda, Konchitsky, and Lee do not teach transmit the recorded raw audio waveforms to the one or more computerized neural networks.

Yassa teaches transmit the recorded raw audio waveforms to the one or more computerized neural networks; (Par. 0024:” At step 410, speech input module 120 obtain human speech from audio source 115. At Step 420, audio source 115 transmits the human speech to NN [Neural Network] 330.”).

Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Yassa to transmit the recorded raw audio waveforms to the one or more computerized neural networks, in order to generate large quantities of rich, feature augmented datasets for use with a neural network which direct to generating quality speech-based datasets including synthetic audio for a plurality of speech utterances, as evidence by Yassa (See Par. 0017).


Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Tashev, Jeon,  Ravindran, Fukuta,  Konchitsky, Lee, and Yassa, as applied to claim 18, in of further view of  Chang, Tu, and Song.

Tashev teaches a vehicle comprising a voice activity detection system, the system comprising.
Tashev does not teach the vehicle of claim 18, wherein the classifier is trained utilizing a plurality of linguistic models, wherein at least one linguistic model is VoxForgeTM and at least one linguistic model is AIShell; and wherein the computational neural network is compressed.

Chang teaches wherein the classifier is trained utilizing a plurality of linguistic models, (Par. 0005:” FIG. 2 shows a process for training a classifier using recognition results obtained by using different language models”).

Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Chang to train classifier utilizing one or more linguistic models in order to create a new language model by using the updated training data or an existing language model may be re-trained using the updated training data, it has been found that using an adapted language model may improve the sentence error rate, as evidence by Chang (see Par. 0055).

Neither Tashev nor Chang teach wherein at least one linguistic model is VoxForgeTM and at least one linguistic model is AIShell; and wherein the computational neural network is compressed.
AIShell Mandarin Speech corpus [approximately 150 hours training data] and the corresponding Kaldi scripts are used. A pronunciation dictionary is included with the dataset. For the remaining three languages [Spanish, French and German], there are no well-organized publicly available data. We use data from the Voxforge project and download the speech corpora for French [≈ 30 hours], German [≈ 50 hours] and Spanish [≈ 50 hours]. Kaldi scripts for the Voxforge. The dictionary for these three languages.”).

Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev and Chang in view of Tu to incorporate at least one linguistic model is VoxForge; and wherein at least one additional linguistic model is AISHELL in order to extract a new utterance-level feature scheme to convert the technique proposed into a fixed-dimension vector which is used as an input to a statistical model to predict the accentedness of a speaker, as evidence by Tu (See Par. ABS).

Tashev, Chang, and Tu do not teach and wherein the computational neural network is compressed.
Song teaches, wherein the computational neural network is compressed. 
 (Page 310, Conclusion:” We have presented a deep compressed neural network BNN and BWN with optimized Batch Normalization layer followed by a posterior smoothing method for VAD and wakeup task”).

Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Song to employ compressed neural network, in order to dramatically reduce and optimized the computing cost on training and running time, as evidence by Song (See Page 306, Introduction, right hand column).

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Tashev, Jeon,  Ravindran, Fukuta,  Konchitsky, Lee, and Yassa, as applied to claim 18, in of further view of  Adrien Daniel (US20180012120A1)(hereinafter "Daniel").

Tashev teaches a vehicle comprising a voice activity detection system, the system comprising.
Tashev, Yassa, Jeon, Ravindran, Fukuta, Konchitsy, and Lee do not teach the vehicle of claim 18, wherein the vehicle is one of an automobile, a boat, or an aircraft.
Daniel teaches wherein the vehicle is one of an automobile, a boat, or an aircraft. (Par. 0053:” As mentioned above, the presently disclosed method and system are particularly useful for facilitating the detection of audio patterns. For example, the following use cases of the presently disclosed method and system are envisaged: audio context recognition (e.g., car, office, park), predefined audio pattern recognition (e.g. baby cry, glass breaking, fire alarm), speaker authentication/recognition, voice activity detection (i.e., detection of the presence of 

Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Tashev in view of Daniel to employ vehicle is one of an automobile, a boat, or an aircraft, in order to recognize particular events or contexts (e.g., starting a car or being present in a running car) and to distinguish and identify different speakers, and furthermore, it may be useful to make such detections easier, as evidence by Daniel (See Par. 0002).






Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. Eller et al. (US9129605B2) teach a system and method for voice and speech analysis which correlates a speaker signal source and a normalized signal comprising measurements of input acoustic data to a database of language, dialect, accent, and/or speaker attributes in order to create a transcription of the input acoustic data.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/D.A./


/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        
04/05/2021