DETAILED ACTION
This Office Action is in response to the correspondence filed by the applicant on 11/24/2020.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The Information Statements (IDS) filed on 11/24/2020 have been accepted and considered in this office action and are in compliance with the provisions of 37 CFR 1.97.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1, 11, 20 and their dependent claims are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out 

Claims 1, 11, and 20 recite the limitation, “generate a command to wake up a device based on the obtained scores of the labels ….”.  There is insufficient antecedent basis for this limitation in the claim.  


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5, 10-15, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over PARADA (US 2015/0127594 A1), and further in view of CANO (US 2012/0239403 A1).

REGARDING CLAIM 1, PARADA discloses a system for providing voice recognition, comprising: 
at least one storage medium storing a set of instructions (PARADA Figs. 7 and 8; Par 107 – “computer program instructions”); and 
at least one processor configured to communicate with the at least one storage medium, wherein when executing the set of instructions (PARADA Figs. 7 and 8; Par 100 – “one or more processors”), the at least one processor is directed to: 
receive a voice signal (PARADA Fig. 3 – “Audio Waveform 202”; Par 47 – “The system 200, e.g., the speech recognition system 100, receives an audio waveform 202 and provides the audio waveform 202 to a front-end feature extraction module 204.”) including a plurality of frames of voice data (PARADA Fig. 3 – “Windowing of the Acoustic Signal 320”; Par 52 – “The front-end feature extraction module 204 may analyze the audio waveform 202 by dividing the audio waveform 202 into a plurality of windows and analyzing each of the windows, e.g., separately.”); 
determine a voice feature for each of the plurality of frames (PARADA Fig. 3 – “Feature Vectors 380”; Par 61 – “The front-end feature extraction module 204 generates a plurality of feature vectors 380 that represent acoustic features of frames from the audio waveform 202 by performing the aforementioned analytical techniques to obtain information about characteristics of the audio waveform 202 for successive time intervals.”), the voice feature being related to one or more labels (PARADA Fig. 1; Par 38 – “For example, the deep neural network 104 may be trained initially using three-thousand hours of speech, where all of the parameters of the deep neural network 104 are adjusted during training The deep neural network 104 may then be trained using examples for each keyword, e.g., “okay” and “google,” and using negative examples, e.g., for the “filler” category, where some of the parameters of the deep neural network 104 are adjusted while others remain constant.”; ); 
determine one or more scores with respect to the one or more labels based on the voice feature (PARADA Fig. 4 – “Posterior Probability Vector 420”; Par 62 – “The acoustic modeling module 206, shown in FIG. 1, receives the plurality of feature vectors 380 from the front-end feature extraction module 204 and generates a corresponding posterior probability vector 420 for each of the feature vectors 380. For a particular feature vector, the corresponding a value for each of the keywords or key phrases for which the speech recognition system is trained.”); 
[sample] combining a plurality of frames in a pre-set interval (PARADA Fig. 2 – “Posterior Handling Module 208”; Fig. 5; Par 42 –“The deep neural network 104 provides the posterior probabilities to the posterior handling module 106. The posterior handling module 106 may smooth the posterior probabilities over a fixed time window of size wsmooth to remove noise from the posterior probabilities, e.g., where posterior probabilities corresponding with multiple frames are used to determine whether a keyword was included in a window.”; Par 78 – “For example, the posterior handling module 208 may average twenty posterior probability scores associated with the keyword “Google” from twenty consecutive posterior probability vectors and use the average, e.g., as a single posterior probability for a time period, to determine whether “Google” was spoken during the time period that corresponds with the twenty consecutive posterior probability vectors.”; Par 79 – “The posterior handling module 208 may use any appropriate window for the consecutive posterior probability scores. For example, the posterior handling module 208 may average the corresponding scores from eleven consecutive posterior probability vectors.”; Par 83 – “The posterior handling module 208 may move a window and/or may use windows of different sizes when determining whether a keyword or key phrase was spoken during a different portion of the audio waveform 202. For example, the posterior handling module 208 may look at different overlapping or non-overlapping windows and determine a combination of the posterior probability scores for the different window.”), the [sampled] combined frames corresponding to at least a part of the one or more labels according to a sequence of the one or more labels (PARADA Par 78 – “For example, the posterior handling module 208 may average twenty posterior probability scores associated with the keyword “Google” from twenty consecutive posterior probability vectors and use the average, e.g., as a single posterior probability for a time period, to determine whether “Google” was spoken during the time period that corresponds with the twenty consecutive posterior “Okay” keyword posterior probability scores and the filler posterior probability scores.”); 
obtain a score of a label associated with each sampled frame (PARADA Par 79 – “The posterior handling module 208 may use any appropriate window for the consecutive posterior probability scores. For example, the posterior handling module 208 may average the corresponding scores from eleven consecutive posterior probability vectors.”; Par 80 – “The posterior handling module 208 uses the combination of the posterior probability scores to determine whether the keyword or key phrase was spoken during the time window of the audio waveform 202. For example, the posterior handling module 208 determines whether the combination is greater than a predetermined threshold value and, if so, determines that the keyword or key phrase associated with the combined posterior probability scores was likely spoken during the time window of the audio waveform 202.”; Par 82 – “In some examples, when the maximum value is greater than a predetermined threshold, the posterior handling module 208 determines that the keyword or key phrase was included in the audio waveform 202.”); and 
generate a command to wake up a device based on the obtained scores of the labels associated with the sampled frames (PARADA Par 106 – “For example, the task performed by some embodiments includes detecting a single word, for example, “Google,” that will activate a device from a standby mode to perform a task. The device continuously monitors received audio waveforms for the predetermined keywords and/or key phrases.”; Par 92 – “The process determines that a phrase was present in the audio waveform (608). For example, the posterior handling module determines that a predetermined key phrase was present in the audio waveform during the overall period of time modeled by the feature vectors. The predetermined phrase includes the first word and potentially a second word that corresponds to at least another portion of the key phrase and a corresponding one of the expected event vectors.”).

PARADA does not explicitly teach the [square-bracketed] limitations, and teaches the underlined features instead.  PARADA teaches combining a plurality of posterior frames in order to remove noise for the posterior probabilities (Par 42). The combined score represent the plurality of scores within the window; thus, it might be viewed as “sampling a plurality of frames.” Furthermore, since the claim does not specify how “a plurality of frames” is obtained in the step of “sample a plurality of frames in a pre-set interval,” Examiner could broadly interpret the limitations as the plurality of frames of voice data, and use the PARADA reference for the mapping.  For example, Examiner could use [0052] of PARADA for mapping the “sample a plurality of frames in a pre-set interval ….” limitations. In [0052], PARADA teaches analysis windows with a size 25 ms time period and with a 10 ms time period shift (e.g., the preset interval). Although Examiner could broadly map the limitations with PARADA, for the clarity of the rejections, Examiner provides CANO.  Examiner also suggests Applicant to specify how “a plurality of frames of voice data” are different from “a plurality of frames” in the sampling step (if they are meant to be different).

CANO discloses a method/system for speech recognition comprising:
[sample] a plurality of frames in a pre-set interval (CANO Fig. 4; Par 36 – “FIG. 4B shows a uniform downsampling scheme when Tt=3. Under this approach, it can be observed that a significant number of frames are reduced when Tt is considerably increased. It can be shown that having just a few number of frames of final posterior vectors decreases system accuracy under constrain of minimum phoneme duration of 3-states in the Viterbi decoder. This suggests further testing this approach with minimum phoneme duration of 1-state. It is also important to mention that during training, the true labels together with the training set of intermediate posterior vectors are also downsampled, significantly reducing the training time.”), the [sampled] frames corresponding to at least a part of the one or more labels according to a sequence of the one or more labels (CANO Par 33 – “To that end, the intra-phonetic can be used as a first hierarchical step, then the posterior vectors generated used as the input to a second hierarchical step, given by the inter-phonetic approach. The aim such an arrangement is first to better classify a phoneme based on the temporal transition information within the phoneme.”); 
obtain a score of a label associated with each sampled frame (CANO Fig. 4 – “Final Posteriors”; Par 36 – “FIG. 4B shows a uniform downsampling scheme when Tt=3. Under this approach, it can be observed that a significant number of frames are reduced when Tt is considerably increased. It can be shown that having just a few number of frames of final posterior vectors decreases system accuracy under constrain of minimum phoneme duration of 3-states in the Viterbi decoder.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of PARADA to include sampling a plurality of frames, as taught by CANO.
One of ordinary skill would have been motivated to include sampling a plurality of frames, in order to significantly reduce the processing time (CANO Par 36).


REGARDING CLAIM 2, PARADA in view of CANO discloses the system of claim 1.
PARADA further discloses wherein the at least one processor is further directed to: 
perform a smoothing operation on the one or more scores of the one or more labels for each of the plurality of frames (PARADA Par 42 – “The deep neural network 104 provides the posterior probabilities to the posterior handling module 106. The posterior handling module 106 may smooth the posterior probabilities over a fixed time window of size wsmooth to remove noise from the posterior probabilities, e.g., where posterior probabilities corresponding with multiple frames are used to determine whether a keyword was included in a window.”).

REGARDING CLAIM 3, PARADA in view of CANO discloses the system of claim 2.
PARADA further discloses wherein to perform a smoothing operation on one or more scores of one or more labels for each of the plurality of frames (PARADA Par 42 – “The deep neural network 104 provides the posterior probabilities to the posterior handling module 106. The posterior handling module 106 may smooth the posterior probabilities over a fixed time window of size wsmooth to remove noise from the posterior probabilities, e.g., where posterior probabilities corresponding with multiple frames are used to determine whether a keyword was included in a window.”), the at least one processor is directed to: 
determine a smoothing window with respect to a current frame (PARADA Par 42 – “The posterior handling module 106 may smooth the posterior probabilities over a fixed time window of size wsmooth to remove noise from the posterior probabilities, e.g., where posterior probabilities corresponding with multiple frames are used to determine whether a keyword was included in a window.”); 
determine at least one frame in the smoothing window associated with the current frame (PARADA Par 42 – “For example, to generate a smoothed posterior probability p′ij from the posterior probability pij , for the ith output category and the jth frame xj, where the values of i are between 0 and n−1, with n the number of total categories, the posterior handling module 106 may use Equation (2) below. …”; Par 43 – “In Equation (2), hsmooth=max {1, j−wsmooth=1} is the index of the first frame within the smoothing window. In some implementations, wsmooth=30 frames.”); 
determine scores of the one or more labels for the at least one frame (PARADA Par 42 – “For example, to generate a smoothed posterior probability p′ij from the posterior probability pij , for the ith output category and the jth frame xj, where the values of i are between 0 and n−1, ; 
determine an average score of each of the one or more labels for the current frame based on the scores of the one or more labels for the at least one frame (PARADA Par 42 Equation (2) -- 
    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale
;  Note that P’ij corresponds to summation of the posterior probabilities of over the Wsmooth and divided by the length of the window, thus, p’ij is the average of pij over the window.  Par 78 – “For example, the posterior handling module 208 may average twenty posterior probability scores associated with the keyword “Google” from twenty consecutive posterior probability vectors and use the average, e.g., as a single posterior probability for a time period, to determine whether “Google” was spoken during the time period that corresponds with the twenty consecutive posterior probability vectors.”); and 
designate the average score of each of the one or more labels for the current frame as the score of each of the one or more labels for the current frame (PARADA Par 78 – “In some implementations, the posterior handling module 208 combines corresponding posterior probability scores from multiple posterior probability vectors 420 to determine whether a keyword or key phrase was uttered during a time window of the audio waveform 202. For example, the posterior handling module 208 may average twenty posterior probability scores associated with the keyword “Google” from twenty consecutive posterior probability vectors and use the average, e.g., as a single posterior probability for a time period, to determine whether “Google” was spoken during the time period that corresponds with the twenty consecutive posterior probability vectors. In this example, the posterior handling module 208 would also average the posterior probability scores for the other keywords or key phrases represented in the posterior probability vectors, such as the “Okay” keyword posterior probability scores and the filler posterior probability scores.”).

REGARDING CLAIM 4, PARADA in view of CANO discloses the system of claim 1.
PARADA further discloses wherein the one or more labels relate to a wake-up phrase for waking up the device, and the wake-up phrase includes at least one word (PARADA Par 106 – “For example, the task performed by some embodiments includes detecting a single word, for example, “Google,” that will activate a device from a standby mode to perform a task. The device continuously monitors received audio waveforms for the predetermined keywords and/or key phrases.”; Par 92 – “The process determines that a phrase was present in the audio waveform (608). For example, the posterior handling module determines that a predetermined key phrase was present in the audio waveform during the overall period of time modeled by the feature vectors. The predetermined phrase includes the first word and potentially a second word that corresponds to at least another portion of the key phrase and a corresponding one of the expected event vectors.”; Par 64 – “For example, the neural network 410 may receive a training set of two expected event vectors for the keywords “Okay” and “Google” or one expected event vectors for the key phrase “Okay Google”.”).

REGARDING CLAIM 5, PARADA in view of CANO discloses the system of claim 1, wherein to determine one or more scores with respect to the one or more labels based on the one or more voice features (PARADA Fig. 4 – “Posterior Probability Vector 420”; Par 62 – “The acoustic modeling module 206, shown in FIG. 1, receives the plurality of feature vectors 380 from the front-end feature extraction module 204 and generates a corresponding posterior probability vector 420 for each of the feature vectors 380. For a particular feature vector, the corresponding posterior probability vector 420 includes a value for each of the keywords or key phrases for which the speech recognition system is trained.”), the at least one processor is directed to: 
determine a neural network model (PARADA Fig. 4 – “Neural Network 410”; Par 64 – “The acoustic modeling module 206 is trained to determine whether a stack of feature vectors matches a keyword or key phrase. For example, the neural network 410 may receive a training set of two expected event vectors for the keywords “Okay” and “Google” or one expected event vectors for the key phrase “Okay Google”. As discussed above, the neural network 410 is trained with a first, general training set and a second, specific training set, e.g., where the second training set includes the expected event vectors for the keywords “Okay” and “Google” or the one expected event vector for the key phrase “Okay Google”.”); 
input the one or more voice features corresponding to the plurality of frames into the neural network model (PARADA Fig. 4 – “Feature Vectors 380 -> Neural Network 410”; Par 62 – “FIG. 4 is a block diagram of an example system 400 for determining a posterior probability vector. The acoustic modeling module 206, shown in FIG. 1, receives the plurality of feature vectors 380 from the front-end feature extraction module 204 and generates a corresponding posterior probability vector 420 for each of the feature vectors 380.”; Par 63 – “The acoustic modeling module 206 includes a neural network 410, such as the deep neural network 104 described with reference to FIG. 1, that generates the corresponding set of posterior probability vectors 420, where each of the posterior probability vectors 420 corresponds with one of the feature vectors 380.”); and 
generate one or more scores with respect to the one or more labels for each of the one or more voice features (FIG. 4 – “Posterior Probability Vector 420”; Par 63 – “The acoustic modeling module 206 includes a neural network 410, such as the deep neural network 104 described with reference to FIG. 1, that generates the corresponding set of posterior probability vectors 420, where each of the posterior probability vectors 420 corresponds with one of the feature vectors 380.”).


REGARDING CLAIM 10, PARADA in view of CANO discloses the system of claim 1.
PARADA further discloses wherein to determine one or more voice features for each of the plurality of frames (PARADA Fig. 3 – “Feature Vectors 380”; Par 61 – “The front-end feature extraction module 204 generates a plurality of feature vectors 380 that represent acoustic features of frames from the audio waveform 202 by performing the aforementioned analytical techniques to obtain information about characteristics of the audio waveform 202 for successive time intervals.”), the at least one processor is directed to: 
transform the voice signal from a time domain to a frequency domain (PARADA Fig. 3 – “Fast Fourier Transform 330”; Par 56 – “After windowing, the front-end feature extraction module 204 may perform a Fast Fourier transform 330 on the windowed data to analyze the constituent frequencies present in the audio waveform.”); and 
discretize the transformed voice signal to obtain the one or more voice features corresponding to the plurality of frames (PARADA Fig 3 Steps 330->380; Par 28 – “In some implementations, the feature extraction module 102 analyzes only the portions of a digital representation of speech that are determined to include speech to reduce computation.”; Par 58 – “The front-end feature extraction module 204 may perform filter bank extraction 350 to separate individual components of the audio data from one another. Each of the individual components generated during filter bank extraction 350 may carry a single frequency sub-band of the audio waveform 202 or the windowed data.”; Par 61 – “The front-end feature extraction module 204 generates a plurality of feature vectors 380 that represent acoustic features of frames from the audio waveform 202 by performing the aforementioned analytical techniques to obtain information about characteristics of the audio waveform 202 for successive time intervals.”).


REGARDING CLAIM 11, PARADA in view of CANO discloses a method for providing voice recognition implemented on a computing device having one or more processors and one or more storage devices (PARADA Figs. 7 and 8; Par 100 – “one or more processors”; Par 101 – “memory storage”), the method comprising: performing the steps of Claim 1; thus, it is rejected under the same rationale.


CLAIM 12 is similar to the system of claim 2; thus, it is rejected under the same rationale.

CLAIM 13 is similar to the system of claim 3; thus, it is rejected under the same rationale.

CLAIM 14 is similar to the system of claim 4; thus, it is rejected under the same rationale.

CLAIM 15 is similar to the system of claim 5; thus, it is rejected under the same rationale.

CLAIM 19 is similar to the system of claim 10; thus, it is rejected under the same rationale.

REGARDING CLAIM 20, PARADA in view of CANO discloses a non-transitory computer readable medium, comprising at least one set of instructions for providing voice recognition (PARADA Figs. 7 and 8; Par 107 – “computer program instructions”), wherein when executed by one or more processors of a computing device (PARADA Figs. 7 and 8; Par 100 – “one or more processors”), the at least one set of instructions causes the computing device to perform a method, the method comprising: performing the steps of Claim 1; thus, it is rejected under the same rationale.

Allowable Subject Matter
Claims 6-9 and 16-18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.  Although the claims are allowable over the prior-art, Examiner suggests, for better explaining the inventive subject matter, to amend Claims 6 and 16 by specifying how the determined searching window and the number of frames are tied with the “sampling the plurality of frames” step.  The submitted drawings (Fig. 9) indicates that the “sampling the plurality of frames with the determined number of frames in the determined searching window” 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONATHAN C. KIM whose telephone number is (571)272-3327.  The examiner can normally be reached on Monday to Friday 9:00 AM thru 5:30 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access 






/JONATHAN C KIM/Primary Examiner, Art Unit 2659