Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. CN201710824269.5, filed on 09/13/2017 and Application No. PCT/CN2018/102982, filed on 08/29/2018.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 11/08/2019 and 03/16/2020 and 11/09/2021 are being considered by the examiner.
Drawings
The drawing submitted on 11/08/2019 is been accepted by the examiner.
Response to Amendment
Claims 1-20 are currently pending and among them claims 1, 5, 8, 12, 15, and 19 are independent claims and has been amended. Claims 2, 9, and 16, has been cancelled.
Response to Arguments
Applicant's arguments filed 10/15/2021 have been fully considered but they are not persuasive for the following reasons:

Applicant Argument: The noise-vector and the non- speech vectors of Krist are not relevant to the target result having at least two speech categories and at least two noise categories. Therefore, Krist fails to teach or suggest at least "wherein the target result comprises at least one of at least two speech categories and at least two noise categories," as claimed.
Examiner Response: Examiner with respectfully disagree with the applicant’s simple assertion on Krist et al. teaching with respect to a broad limitation. 
First of all the disclosure is silent on, what are the categories of two speeches and noises? and how the each two speech and two noise categories are based on?, in order for the claimed to be properly interpreted and evaluated with respect to prior art teaching and rejection.
Second of all amended independent claims which rollover the limitation of claim 2, recites “target result comprising at least one of at least two speech categories and at least two noise categories” which is not same as “result having at least two speech categories and at least two noise categories” as applicant arguing and contradicting his own argument with further reciting the limitation in the argument as claimed. 
The claim is claiming one of either “two speech categories” or “two noise categories” not both. The word “Comprising” is open ended and broad and thus cause the interpretation of the limitation of “target result comprising at least one of at least two speech categories and at least two noise categories” as to be either an input from which the target results was derived having at least one of the at least two speech categories and at least two noise categories, or be the target results having at least one of at least two speech categories or two noise categories.

Therefore Krist et al. teaches the broad limitation and the rejection of all independent claims remain same.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 3, 5-8, 10, 12-15, 17, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Kristjansson (US 2017/0092268 A1) herein referred as Krist, in view of Sainath et al.(US 2015/0161995 A1).

Regarding Claims 1, 5, 8, 12, 15, and 19, Krist teaches: A method for establishing a voice activity detection model, the method being performed by an execution device, the method comprising: obtaining a training audio file and a target result of the training audio file ([0030] The computing system 120 receives the audio signal 112 and obtains information about acoustic features of the audio signal 112. For example, the computing system 120 may generate a set of feature vectors 122, where each feature vector 122 indicates audio characteristics during a different portion or window of the audio signal 112. [0031] The computing system 120 can receive information about the noise environment 124. [0034] In the illustrated example, the computing system 120 inputs the feature vectors 122, the noise vector 124 and the additional data 126 to the neural network 140. The neural network 140 has been trained to act as an acoustic model. [0045] The neural network 270 has been trained to estimate likelihoods that a combination of feature vectors and a noise-vector and non-speech vectors that represent particular phonetic units. [0088] Forward propagation through the neural network produces outputs at an output layer of the neural network. The outputs may be compared with data indicating correct or desired outputs that indicate that the received feature vector corresponds to the acoustic state indicated in a received label for the feature vector. [0089] The process 500 may be repeated for feature vectors extracted from multiple different utterances in a set of training data. For each utterance or audio recording in the training data, a noise-vector may be calculated based on characteristics of the utterance as a whole. Whenever a feature vector for a particular utterance is provided as input to the neural network, the noise-vector calculated for the particular utterance may also be input to the neural network at the same time. During training, the frames selected for training can be selected randomly from a large set, so that frames from the same utterance are not processed consecutively.); framing the training audio file to obtain an audio frame; extracting an audio feature of the audio frame, the audio feature comprising at least two types of features, and one of the at least two types of features comprising an energy ([0037] The computing system 120 receives data about an audio signal 210 that includes noise and speech to be enhanced or recognized. The computing system 120 or another system then performs feature extraction on the audio signal 210. [0038] The computing system 120 performs a Fast Fourier Transform (FFT) on the audio in each window 220. The results of the FFT are shown as time-frequency representations 230 of the audio in each window 220. From the FFT data for a window 220, the computing system 120 extracts features that are represented as an acoustic feature vector 240 for the window 220. The acoustic features may be magnitude spectrum values or log-magnitude spectrum values. The acoustic features may be determined by binning according to filterbank energy coefficients, using a Mel-Frequency Cepstral Component (MFCC) transform, using a perceptual linear prediction (PLP) transform, or using other techniques. In some implementations, the logarithm of the energy in each of various bands of the FFT may be used to determine acoustic features. [0076] The feature vector and the data indicative of the noise environment is provided as input to a neural network (406). Multiple feature vectors may be provided as part of the input.); inputting the extracted audio feature as an input to a deep neural network model ([0045] The neural network 270 has been trained to estimate likelihoods that a combination of feature vectors and a noise-vector and non-speech vectors that represent particular phonetic units. [0046] To enhance or recognize speech in the audio signal 210 using the neural network 270, the computing system 120 inputs the noise-environment-vector 250 and 244 at the input layer 271 of the neural network 270 with different sets of acoustic feature vectors 240. Many inputs combining acoustic feature vectors and a noise-vector can be used to train the neural network 270, and the various training data sets can include acoustic feature vectors and noise-vectors derived from utterances from multiple speakers. [0076] The feature vector and the data indicative of the noise environment is provided as input to a neural network (406). Multiple feature vectors may be provided as part of the input. The feature vector(s) and the data indicative of the latent variables are input together (e.g., simultaneously) as part of a single input data set. For example, the feature vector(s) and data indicative of the noise environment may be combined into a single input vector which is input to the neural network.); performing information processing on the audio feature through a hidden layer of the deep neural network model ([0044] The computing system 120 uses a neural network 270 that can serve as an acoustic model and indicate likelihoods that acoustic feature vectors 240 represent different phonetic units. The neural network 270 includes an input layer 271, a number of hidden layers 272a-272c, and an output layer 273. The neural network 270 receives a noise-vector 250 and non-speech vectors 244 as input as well as receiving acoustic feature vectors 245. Many typical neural networks used for speech enhancement or recognition include input connections for receiving only acoustic feature information. By contrast, the neural network 270 receives acoustic feature information augmented with additional information such as a noise-vector and non-speech vectors. For example, the first hidden layer 272a has connections from the noise-vector input portion of the input layer 271, where such connections are not present in typical neural networks used for speech enhancement or recognition.), and outputting the processed audio feature through an output layer of the deep neural network model, to obtain a training result ([0047] At the output layer 273, the neural network 270 indicates likelihoods that the speech in the window 220 under analysis (e.g., the window w.sub.7 corresponding to acoustic feature vector v.sub.7) corresponds to specific phonetic units.); determining a bias between the training result and the target result, and inputting the bias as an input to an error back propagation mechanism([0088] Forward propagation through the neural network produces outputs at an output layer of the neural network. The outputs may be compared with data indicating correct or desired outputs that indicate that the received feature vector corresponds to the acoustic state indicated in a received label for the feature vector. A measure of error between the actual outputs of the neural network and the correct or desired outputs is determined. The error is then back-propagated through the neural network to update the weights within the neural network.); wherein the target result comprises at least one of at least two speech categories and at least two noise categories (Fig.2, [0040] Each acoustic feature vector 240 represents characteristics of the portion of the audio signal 210 within its corresponding window 220. [0042] The computing system 120 also obtains copies of feature vectors that represent the background environment 244. In this example 244 contains copies of recent feature vectors that represent the dynamic, time varying noise background signal. These vectors may for example contain background noise such as car noise, road noise, office noise, babble noise or voices of background speakers. [0044] The computing system 120 uses a neural network 270 that can serve as an acoustic model and indicate likelihoods that acoustic feature vectors 240 represent different phonetic units. [0045] The neural network 270 has been trained to estimate likelihoods that a combination of feature vectors and a noise-vector and non-speech vectors that represent particular phonetic units. For example, during training, input to the neural network 270 may be a combination of acoustic feature vectors and a noise-vector corresponding to the utterance from which the acoustic feature vectors were derived. Many inputs combining acoustic feature vectors and a noise-vector can be used to train the neural network 270, and the various training data sets can include acoustic feature vectors and noise-vectors derived from utterances from multiple speakers. [0047] At the output layer 273, the neural network 270 indicates likelihoods that the speech in the window 220 under analysis (e.g., the window w.sub.7 corresponding to acoustic feature vector v.sub.7) corresponds to specific phonetic units. [0048] The output layer 273 provides predictions or probabilities of acoustic states given the data at the input layer 271. [0053] As indicated above, each output from the neural network 270 can include a posterior probability P(s.sub.i|Y,N), representing a likelihood of a particular acoustic state s.sub.i given the current set of input data, Y,N. The resulting scaled posterior probabilities are then input to the weighted finite state transducers or speech enhancement system for further processing. [0088] A measure of error between the actual outputs of the neural network and the correct or desired outputs is determined.  Note: Since neural network 270 output layer 273 (Fig.2) which is simply an output corresponds to a probability of the desired output that corresponds to a specific phonetic units based on the input to the input layer and has not been speech enhanced for clean speech yet. Therefore output 273 posterior probabilities comprises all the input type speech and noise characteristics, i.e. different background noises and speakers speeches which further needed to be enhanced for clean speech.). 
Krist however do not specifically teach: separately updating weights of the hidden layer until the deep neural network model reaches a preset condition, to obtain the voice activity detection model.
Sainath et al. teach: separately updating weights of the hidden layer until the deep neural network model reaches a preset condition, to obtain the voice activity detection model ([0130] At act 130, the features extracted using the front-end parameters may be fed through a neural network to obtain an output classification for the input speech signal. Any suitable neural network classifier may be used, such as a convolutional neural network (CNN) as described above. At act 140, an error measure may be computed for the output classification, e.g., through comparison of the output classification with a known target classification. At act 150, back propagation may be applied to adjust one or more of the front-end parameters as one or more layers of the neural network, based on the error measure. Method 100 may then loop back to act 120, at which the updated (adjusted) front-end parameters may be applied in extracting updated features from the input speech signal. As method 100 continues to iterate, the front-end parameters may continue to be adjusted, through back propagation as one or more layers of the neural network, to reduce the error in the neural network's output classification. When a suitable number of iterations have been completed, or when the error has been reduced below a suitable threshold (or when any other suitable convergence criteria have been reached), method 100 may end. At this point, the front-end feature extraction parameters may have been "learned" to fit the data classification task at hand, as part of training the neural network classifier.).
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Krist et al. to include the teaching of Sainath et al. above in order to reduce error below a suitable threshold in the neural network output.


Regarding Claims 3, 10,  and 17, Krist teach: The method for establishing the voice activity detection model according to claim 1, wherein the audio feature is a fused audio feature, the fused audio feature comprising at least two independent audio features, and the independent audio features comprising the energy and at least one of a zero-crossing rate, a mean value, and a variance, and wherein the extracting the audio feature of each audio frame further comprises extracting independent audio features of each audio frame and fusing the respective independent audio features to obtain the fused audio feature ([0043] The noise-vector 250 and non-speech vectors 244 may be normalized, for example, to have a zero mean unit variance. In addition, or as an alternative, the noise-vector 250 may be projected, for example, using principal component analysis (PCA) or linear discriminant analysis (LDA). Techniques for obtaining a noise vector are described further below with respect to FIG. 3. [0052] For speech enhancement, the output of the neural network 270 can be provided to a Minimum Mean Squared Error or Maximum Posteriori based speech enhancement or source separation system, such as a high resolution speech separation system. The posterior P(s.sub.i|Y,N) can correspond to the components of the acoustic model of the speech separation system. The speech separation system can use the posterior 273 to choose which component of the acoustic model to use to reconstruct or separate the target speech from the noisy acoustic signal. The computing system 120 can use to determine the separation of the target speaker from the background noise environment for the audio signal.).

Regarding Claims 6, 13, and 20,  Krist teach: The voice activity detection method according to claim 5, wherein the inputting the audio feature into the voice activity detection model to obtain the detection result further comprises: inputting the audio feature into the voice activity detection model, to obtain a frame detection result of each audio frame of the to-be-detected audio file; and smoothing each frame detection result in the to-be-detected audio file, and obtaining a detection result of the to-be-detected audio file (See rejection of claim 1).

Regarding Claims 7 and 14, Krist teach: The voice activity detection method according to claim 6, wherein, after the obtaining the detection result of the to-be-detected audio file, the method further comprises: determining a speech start point and a speech end point in the to-be-detected audio file according to the detection result (The limitation is very well-known in the art and the examiner is taking an official notice for the claim).
Allowable Subject Matter
Claims 4, 11, and 18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Sainath et al.(US 2016/0284347 A1) teach: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing audio waveforms. In some implementations, a time-frequency feature representation is generated based on audio data. The time-frequency feature representation is input to an acoustic model comprising a trained artificial neural network. The trained artificial neural network comprising a frequency convolution layer, a memory layer, and one or more hidden layers. An output that is based on output of the trained artificial neural network is received. A transcription is provided, where the transcription is determined based on the output of the acoustic model. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOHAMMAD K ISLAM whose telephone number is (571)270-5878.  The examiner can normally be reached on Monday -Friday, EST (IFP).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/MOHAMMAD K ISLAM/Primary Examiner, Art Unit 2656