DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . speech

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 4, 5, 7 – 14, 16, 17, 19, 20, 22, 23, 26, 28, 40, 42 are rejected under 35 U.S.C. 103 as being unpatentable over Sun et al. (Compressed time delay neural network for small-footprint keyword spotting, August, 2017) in view of Bocklet et al. (US PAP 2018/0182388).  
As per claims 1, 22, Sun et al. teach a method for keyword spotting in an electronic device, the method comprising: 
obtaining acoustic signal comprising speech (page 1, Introduction);
 providing an acoustic signal representation of the acoustic signal to a neural network executed by a processor (“DNNs or TDNNs are used as acoustic models for our HMM-based keyword spotter.”; page 2, section 2); and 
predicting from the neural network a presence of at least one of a plurality of keywords or absence of any of the plurality of keywords in the acoustic signal (“During decoding, framewise posteriors for keyword are smoothed within a sliding window. The system fires when smoothed keyword posteriors exceed a pre-defined threshold.; page 1, col.2, paragraph 2).
However, Sun et al. do not specifically teach transitioning from a low power processing state to a high power processing state as needed when the presence of any of the plurality of keywords in the acoustics signal are detected for any additional processing.
 Bocklet et al. disclose that key phrase or hot word detection systems may be used to detect a word or phrase or the like, which may initiate an activity by a device such as waking the device from a low power or sleep mode to an active mode based on detection of the key phrase.(paragraph 23).
Therefore, it would have been obvious to one of ordinary skill in the art at the time the invention was made to transition from a low power to high power as taught by Bocklet et al. in Sun et al., because that would help improve keyword spotting performance (Sun et al., page 3, section 3.4).

As per claims 2, 23, Bocklet et al. in view of Sun et al. further disclose the acoustic signal representation comprises a feature domain representation obtained by preprocessing the acoustic signal or the acoustic signal representation is a waveform representation (“an acoustic model such as a deep neural network or the like may be scored to generate the multiple element acoustic score vector such that the multiple element acoustic score vector includes a score for a single state rejection model and scores for one or more multiple state key phrase models such that each multiple state key phrase model corresponds to a predetermined key phrase.”; Bocklet et al paragraphs 25,  33).

As per claim 4, Bocklet et al. further disclose the acoustic signal representation is a waveform representation (paragraph 33).

As per claims 5, 26, Bocklet et al. in view of Sun et al. further disclose the neural network is a time delayed neural network (TDNN) that produces a sequence of keyword posteriors (“given the input acoustic features of each frame, the output layer of DNN/TDNN models its posterior distribution over the HMM states for both keyword and background/filler models”; Sun et al., page 2, col.1, section 2).

As per claims 7, 28, Bocklet et al. in view of Sun et al. further disclose predicting the presence or absence of keywords comprises determining if a posterior value for any of the plurality of keywords exceeds a threshold value, and if the posterior value of a respective keyword exceeds the threshold value predicting the presence of the respective keyword in the audio signal (“During decoding, framewise posteriors for keyword are smoothed within a sliding window. The system fires when smoothed keyword posteriors exceed a pre-defined threshold.”; Bocklet et al. paragraphs 48, 49; Sun et al. page 1, col.2, paragraph 2 ).

As per claim 8, Bocklet et al. further disclose a plurality of different threshold values are used for the plurality of keywords (paragraphs 49, 51).

As per claim 9, Bocklet et al. in view of Sun et al. further disclose the TDNN uses one or more sets of layers to learn phone and keyword targets (“our TDNN architecture, its input layer processes a narrow context window of 5 consecutive frames (l = 2, r = 2).This is labeled as [􀀀2; 2]. Our training recipe starts with the approach described in [10] with transfer learning and multi-task learning, but we replace the DNN with a temporal connection sub-sampled TDNN”; Sun et al., page 2 col.2, paragraph 1).

As per claim 10, Bocklet et al. in view of Sun et al. further disclose a first set of layers is initialized by using transfer learning on a related large vocabulary speech recognition task (“Transfer learning is a widely used approach in machine learning, which transfers the knowledge learned from a related task to improve the main training task [26, 27, 28, 29]. As an application example of transfer learning on neural networks, the hidden layers of a trained network can be initialized from another network of the same size trained for a related task [4]. For our case, an LVCSR TDNN with the same architecture is trained to initialize the keyword spotting TDNN “; Sun et al., page 2, section 3.2).

As per claim 11, Bocklet et al. in view of Sun et al. further disclose reducing a number of multiplications per second performed during inference of a model of the neural network using dynamic programming (“When a bottleneck layer of R nodes is added between, with R properly chosen, the number of multiplications could be reduced from M  N to (M + N) R.”; Sun et al., page 3, section 3.5).

As per claim 12, Bocklet et al. in view of Sun et al. further disclose a total number of multiplications per second performed during inference of a model of the neural network is reduced by frame skipping (Sun et al., page 3, section 3.5; page 2, col.1, paragraph 3).

As per claim 13, Bocklet et al. in view of Sun et al. further disclose a voice activity detection (VAD) system is used to minimize computation by the TDNN network, wherein the VAD system only sends the audio signal representation to the TDNN when speech is detected in the background (Bocklet et al. paragraph 31; Sun et al. page 2, col.1, paragraph 3).

As per claim 14, Bocklet et al. in view of Sun et al. further disclose recording a user query which follows following keyword detection and recording it for further decoding wherein start and end times of the keyword are found in the acoustic signal (“The multiple element state score vector for the current time instance may then be evaluated to determine whether a key phrase has been detected. If a single key phrase model is provided, the current state score for the single state rejection model and a final state score for the multiple state key phrase model may be evaluated to determine whether the received audio input is associated with the predetermined key phrase corresponding to the multiple state key phrase model.”; Bocklet et al., paragraph 28).

As per claim 16, Bocklet et al. in view of Sun et al. further disclose a second neural network is used for second stage decoding, comprising of one or more of: a bidirectional GRU RNN model to produce a phone posteriorgram; a histogram of acoustic correlations (HAC) to produce a fixed-length vector from the phone posteriorgram; and a fully-connected network to produce keyword probabilities from the fixed-length vector (“feature vectors 212 may be provided to acoustic scoring module 203. Acoustic scoring module 203 may score feature vectors 212 based on acoustic model 208 as received via memory and provide any number of output scores 214 based on feature vectors 212. Output scores 214 may be characterized as scores, probabilities, scores of sub-phonetic units, probability density function scores, or the like.”; Bocklet et al. paragraph 39).

As per claim 17, Bocklet et al. in view of Sun et al. further disclose training data for the neural network is produced by concatenating recordings of commands and user queries at different volume levels and mixing with different noise types of noises (“Data from a voice conversation may be, for example, speech information, silence periods, background noise, comfort noise, tones and so forth”; Bocklet et al. paragraph 113).

As per claims 19, 40, Bocklet et al. in view of Sun et al. further disclose upon predicting from the neural network the presence of at least one of the plurality of keywords in the acoustic signal in a low power state by a first lower power processing core, the high power state of a second high power processing core is awoken from a sleep state to perform further processing on the acoustic signal (“detect a word or phrase or the like, which may initiate an activity by a device such as waking the device from a low power or sleep mode to an active mode based on detection of the key phrase”; Bocklet et al., paragraph 23).

As per claims 20, 42, Bocklet et al. in view of Sun et al. further disclose in the second processing core verifies the presence of at least one of the plurality of keywords in the acoustic before performing further processing of the acoustic signal to determine one or more commands within the acoustic signal (“if any of the key phrases are detected, system wake indicator 216 and/or system command 218 may be provided. Furthermore, system command 218 may be associated with a particular key phrase of the key phrases. For example, a first wake up command (e.g., key phrase) such as “Computer, Play Music” may wake the device (e.g., via system wake indicator 216) and play music (e.g., via a music play command implemented by system command 218) and a second wake up command (e.g., key phrase) such as “Computer, Do I Have Mail?” may wake the device”; paragraph 32).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  Sainath et al. teach convolutional neural networks in keyword spotting system.  Kim et al. teach method for performing functions by speech input.  Khellah et al. teach detecting keywords in audio using a spiking neural network.



Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD SAINT CYR whose telephone number is (571)272-4247. The examiner can normally be reached Monday- Friday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571) 272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/LEONARD SAINT CYR/Primary Examiner, Art Unit 2658