Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 03/10/2022 has been entered.
 
EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in an interview with applicant’s representative, Janvi Shah on 03/14/2022.
The claims 1, 8, and 15 of the application has been amended as follows: 

1. (currently amended): A method for establishing a voice activity detection model, the method being performed by an execution device, the method comprising: obtaining a training audio file and a target result of the training audio file; framing the training audio file to obtain an audio frame; extracting an audio feature of the audio frame, the audio feature comprising at least two types of features, and one of the at least two types of features comprising an energy; inputting the extracted audio feature as an input to a deep neural network model; performing information processing on the audio feature through a hidden layer of the deep neural network model, and outputting the processed audio feature through an output layer of the deep neural network model, to obtain a training result; determining a bias between the training result and the target result, and inputting the bias as an input to an error back propagation mechanism; and separately updating weights of the hidden layer until the deep neural network model reaches a preset condition, to obtain the voice activity detection model, wherein the target result comprises at least one of at least two speech categories and at least two noise categories, wherein the audio feature of the audio frame is an extended frame audio feature, the extended frame audio feature comprising at least one of a single frame audio feature of a current frame, and a first single frame audio feature of a first preset quantity of frames before the current frame, or a single frame audio feature of a current frame, and a second single frame audio feature of a second preset quantity of frames after the current frame, and wherein the extracting the audio feature of each audio frame further comprises: extracting the single frame audio feature of each audio frame; setting the at least one of the single frame audio feature of the current frame, and the first single frame audio feature of the first preset quantity of frames before the current frame, or the single frame audio feature of the current frame, and the second single frame audio feature of the second preset quantity of frames after the current frame, as the extended frame audio feature of the current frame; and separately using each audio frame as the current frame to obtain an extended audio feature of each audio frame.

8. (currently amended): A computer device, comprising: at least one memory configured to store computer program code; and at least one processor configured to access the computer program code and operate as instructed by the computer program code, the computer program code comprising: file obtaining code configured to cause the at least one processor to obtain a training audio file; result obtaining code configured to cause the at least one processor to obtain a target result of the training audio file; framing code configured to cause the at least one processor to frame the training audio file to obtain an audio frame; extraction code configured to cause the at least one processor to extract an audio feature of the audio frame, the audio feature comprising at least two types of features, and one of the at least two types of features comprising an energy; inputting code configured to cause the at least one processor to input the audio feature as an input to a deep neural network model, and perform information processing on the audio feature through a hidden layer of the deep neural network model; outputting code configured to cause the at least one processor to output the processed audio feature through an output layer of the deep neural network model, to obtain a training result; and update and optimizing code configured to cause the at least one processor to determine a bias between the training result and the target result as an input to an error back propagation mechanism, and separately update weights of the hidden layer until the deep neural network model reaches a preset condition, to obtain a voice activity detection model, wherein the target result comprises at least one of at least two speech categories and at least two noise categories, and wherein the audio feature of the audio frame is an extended frame audio feature, the extended frame audio feature comprising at least one of a single frame audio feature of a current frame, and a first single frame audio feature of a first preset quantity of frames before the current frame, or a single frame audio feature of a current frame,  and a second single frame audio feature of a second preset quantity of frames after the current frame, and wherein the computer device further comprises: single frame feature extraction code configured to cause the at least one processor to extract the single frame audio feature of each audio frame; and audio frame extension code configured to cause the at least one processor to set the at least one of the single frame audio feature of the current frame, and the first single frame audio feature of the first preset quantity of frames before the current frame, or the single frame audio feature of the current frame, and the second single frame audio feature of the second preset quantity of frames after the current frame, as the extended frame audio feature of the current frame, and separately use each audio frame as the current frame to obtain an extended audio feature of each audio frame.

15. (currently amended): A non-transitory computer-readable storage medium, storing executable instructions, the executable instructions capable of causing a computer to: obtain a training audio file and a target result of the training audio file; frame the training audio file to obtain an audio frame; extract an audio feature of the audio frame, the audio feature comprising at least two types of features, and one of the at least two types of features comprising an energy; input the audio feature as an input to a deep neural network model, performing information processing on the audio feature through a hidden layer of the deep neural network model, and output the processed audio feature through an output layer of the deep neural network model, to obtain a training result; and determine a bias between the training result and the target result as an input to an error back propagation mechanism, and separately update weights of the hidden layer until the deep neural network model reaches a preset condition, to obtain a voice activity detection model, wherein the target result comprises at least one of at least two speech categories and at least two noise categories, and wherein the audio feature of the audio frame is an extended frame audio feature, the extended frame audio feature comprising at least one of a single frame audio feature of a current frame, and a first single frame audio feature of a first preset quantity of frames before the current frame, or a single frame audio feature of a current frame, and a second single frame audio feature of a second preset quantity of frames after the current frame, and wherein the executable instructions are further capable of causing the computer to: extract the single frame audio feature of each audio frame; set the at least one of the single frame audio feature of the current frame, and the first single frame audio feature of the first preset quantity of frames before the current frame, or the single frame audio feature of the current frame, and the second single frame audio feature of the second preset quantity of frames after the current frame, as the extended frame audio feature of the current frame; and separately using each audio frame as the current frame to obtain an extended audio feature of each audio frame.
Allowable Subject Matter
Claims 1, 3, 5-8, 10, 12-15, 17 and 19-20, are allowed.

The following is an examiner’s statement of reasons for allowance: The prior art of records Parthasarathi et al.(US 2017/0270919 A1) teach: The audio feature vectors may be processed by RNN encoder  to create encoded reference feature vector, which by virtue of the RNN encoding represents the entire reference audio data from first audio feature vector to last audio feature vector in a single feature vector. The RNN encoder may be configured to process a first input audio feature vector first, or may be configured to process input audio feature vectors in a reverse order  depending on system configuration. 
The prior art of record Dimitriadis et al.(US 2015/0058004 A1) teach: The classifiers can operate on audio and video from a same frame of the video, or can operate on data covering different timespans, which may or may not overlap. For example, when detecting whether a frame contains a voice, a first classifier can operate on audio of a current frame, a second classifier can operate on audio of the current frame and a previous frame, while a third classifier can operate on video of five previous frames and can exclude the current frame. The original features can be associated with a video frame. For example, the features can be extracted from a single frame, or from multiple frames. While one scenario is based on instantaneous computations, in other scenarios the system can aggregate information across multiple frames, such as all frames within a 2 second window, where that window can include samples forward or backward in time, relative to the classified sample.
The prior arts of record alone or in combination failed to teach the limitation of claims 1, 8, and 15, “wherein the audio feature of the audio frame is an extended frame audio feature, the extended frame audio feature comprising at least one of a single frame audio feature of a current frame, and a first single frame audio feature of a first preset quantity of frames before the current frame, or a single frame audio feature of a current frame, and a second single frame audio feature of a second preset quantity of frames after the current frame, and wherein the extracting the audio feature of each audio frame further comprises: extracting the single frame audio feature of each audio frame; setting the at least one of the single frame audio feature of the current frame, and the first single frame audio feature of the first preset quantity of frames before the current frame, or the single frame audio feature of the current frame, and the second single frame audio feature of the second preset quantity of frames after the current frame, as the extended frame audio feature of the current frame”.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOHAMMAD K ISLAM whose telephone number is (571)270-5878. The examiner can normally be reached Monday -Friday, EST (IFP).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MOHAMMAD K ISLAM/Primary Examiner, Art Unit 2656