Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Acknowledgement  
Acknowledgement is made of applicant’s amendment made on 02/16/2021. Applicant’s submission filed has been entered and made of record.
Status of the Claims
Claims 1, 3-14, and 16-20 are pending. 
Response to Applicant’s Argument
In response to “First, Sorin begins with an MFCC vector (having low order coefficients). In contrast, the claims recite outputting auditory features that include MFCCs based on convolved and thresholded input audio samples represented as a data vector. Second, Sorin describes a process of using an MFCC vector having low order coefficients to estimate higher order coefficients and generate a new MFCC vector. (Id.)” and “To the extent Sorin describes thresholding, the thresholding is performed on bins for regularization. (Sorin at col. 10:27-36.) Convolution and thresholding is never performed on input audio to a neural network that is trained to determine MFCCs of the input audio. Accordingly, Steelberg and Sorin, taken in combination, fail to teach or suggest the features of the claims”.  
Steelberg discloses a system implementing the speech transcription / classification neural networks on a server connected to a distributed network (¶115). In particular, the ¶46). 
According to Sorin, in speech recognition system such as Steelberg where speech recorded at a client device and processed at an ASR server (Sorin, Col 2, Rows 47-51), high order cepstral (HOC) ending coordinates of MFCC vectors are truncated in order to conserve bandwidth prior to transmission to the server (Col 2, Rows 43-46, Col 3, Rows 1-4). Therefore, the method of Sorin for restoring the high order mel frequency cepstral coefficients of truncated MFCC vectors would be advantageous when speech recognition engine / DSR front end receives bandwidth conserving / truncated MFCC vectors (Col 3, Rows 20-23 and Col 7, Rows 9-21).
Therefore, it would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to implement the first neural network in Steelberg’s server to output or reconstruct MFCC from truncated MFCC (¶46) using steps (a)-(j) of Sorin including step (e) calculating a spectral envelope for each sampled basis function by convolution with a Fourier transform of a windowing function (Col 5, Rows 11-14) and (g) identifying any coordinates of the synthetic (i.e., reconstructed) vector whose value does not exceed a predefined threshold (Col 4, Rows 55-57). This combination would render obvious the requirement of “the first artificial neural network being trained to receive convolved and thresholded input audio samples represented as a data vector and output mel-frequency cepstral coefficients auditory features based on the convolved and thresholded input audio samples”. 
Claim Rejections - 35 USC § 103
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 103 that form the basis for the rejections under this section made in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 3 and 5-11 are rejected under 35 USC 103(a) as being unpatentable over Steelberg et al. (US 2020/0075019 A1) in view of Sorin (US 8412526 B2) and Esser et al. (“Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing”). 
Regarding Claim 1, Steelberg discloses a processor (¶119, processor with software instructions implementing modules to extract audio features and neural networks) comprising: 
a first artificial neural network, the first artificial neural network being trained to output auditory features based on input audio samples (¶46, extracting audio features of each segment using outputs of one or more layers of a neural network trained to perform speech recognition; ¶46, frontend neural network to perform feature engineering for each audio segment), the auditory features comprise mel-frequency cepstral coefficients (¶55-56, identifying domain mel frequency cepstral coefficients using a pre-trained speech recognition neural network); 
a second artificial neural network, the second artificial neural network being operatively coupled to the first artificial neural network and receiving therefrom the auditory features (¶46, outputs of one or more hidden layers of the frontend neural network can be used as inputs of an engine prediction neural network), the second artificial neural network being trained to output a classification of the input audio samples based on the ¶47-48 in view of ¶37, the engine prediction neural network can associate a certain set of dominant audio features to characteristics of one or more candidate domain specific engines; ¶50, select which engine to transcribe which segments of the audio file based on audio / cepstral features of the segment and the predicted word error rate of the engine associated with the segment). 
Steelberg does not disclose the first artificial neural network being trained to receive convolved and thresholded input audio samples represented as a data vector to output auditory features based on the convolved and thresholded input audio samples.
Sorin discloses a method for estimating high-order coefficients (HOC) of Mel Frequency Cepstral Coefficients to produce an output MFCC vector that improves speech recognition accuracy (Abstract and Col 3, Rows 25-33) by reconstructing bandwidth preserving / truncated MFCC input audio samples received at a server (Col 7, Rows 19-23 in view of Col 3, Rows 3-5 and Rows 20-23) comprising steps a) – h) (Col 4, Rows 1-18): 
step a) converting a truncated L-dimensional MFCC vector of low-order coefficients (LOC) to an N-dimensional binned spectrum, 
b) initializing N-L high-order coefficients (HOC) using predetermined values, 
c) computing an N-dimensional binned spectrum corresponding to the HOC, 
d) calculating a composite binned spectrum from both of the binned spectra using coordinate-wise multiplication, 
e) producing a basis bins matrix and basis function mixing coefficients by estimating at least one harmonic model parameter from the composite binned spectrum and a pitch frequency (Col 5, Rows 9-23) comprising calculating a spectral envelope for each sampled basis function by convolution with a Fourier transform of a windowing function, 
f) synthesizing a new binned spectrum by multiplying the basis bins matrix by the vector of the basis function mixing coefficients, 
g) regularizing the synthesized bins by identifying any coordinates of the synthetic vector whose value does not exceed a predefined threshold (Col 4, Rows 55-61), and 
h) estimating the HOC by converting the regularized synthesized bins to HOC.
Steelberg teaches that audio segments being processed by the neural networks are very noisy audio segments (¶40) where the neural networks were implemented on a server connected to a network (¶115). 
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to implement the first neural network of Steelberg to receive convolved and thresholded input audio samples represented as a data vector to produce MFCC output vectors as taught by Sorin in order to output dominant MFCC auditory features, which improves speech recognition accuracy, into the back-end neural network to predict a best candidate transcription / speech recognition engine  (Sorin, Col 3, Rows 25-33; Steelberg, ¶46). 
Steelberg does not disclose a neurosynaptic chip implementing the neural networks. 
Esser discloses implementing deep convolution neural networks on a neurosynaptic chip (Abstract, “Critically, we demonstrate that bringing the above innovations together allows us to create networks that approach state of the art accuracy performing inferences on 8 standard datasets, running on a neuromorphic chip between 1200 and 2600 frames per second, using between 25 and 275 mW.”).
Steelberg’s deep  / convolutional neural networks (Steelberg, ¶46) on Esser’s neurosynaptic chip to achieve deep convolution networks approaching state of the art classification accuracy while preserving hardware’s underlying energy efficiency and high throughput (Esser, Abstract).
Regarding Claim 3, Steelberg discloses wherein the auditory features comprise linear predictive coding coefficients, perceptual linear predictive coefficients, spectral coefficients, filter bank coefficients, and/or spectro-temporal receptive fields (¶55-56, identifying domain mel frequency cepstral coefficients using a pre-trained speech recognition neural network). 
Regarding Claim 5, Steelberg discloses wherein the classification is of phonemes, words, or speech segments (¶46, using outputs of the neural network to perform speech recognition; ¶50, selecting the best engine to perform speech transcription based on dominant audio features). 
Regarding Claim 6, Steelberg discloses wherein the first neural network is a convolutional neural network (¶46, extract dominant audio features using a speech recognition neural network, which can be a convolutional neural network). 
Regarding Claim 7, Steelberg discloses wherein the second neural network is a convolutional neural network (¶46, outputs from the last hidden layer of a deep neural network can be used as inputs of an engine prediction neural network, which can be a fully-layered convolutional neural network). 
Regarding Claim 8, Steelberg discloses wherein the input audio samples comprise speech (¶46, extracting dominant audio features to perform speech recognition).
Regarding Claim 9, Steelberg as modified by Esser discloses wherein the first neural network is an EEDN network (Steelberg, ¶46, frontend speech recognition neural network being a convolution neural network; Esser, Abstract, implement a deep convolution neural network on neuromorphic chip to preserve hardware processor’s underlying energy efficiency). 
Regarding Claim 10, Steelberg as modified by Esser discloses wherein the second neural network is an EEDN network (Steelberg, ¶46, fully layered convolutional neural network at the backend; Esser, Abstract, implement a deep convolution neural network on neuromorphic chip to preserve hardware processor’s underlying energy efficiency). 
Regarding Claim 11, Steelberg discloses a buffer between the first and second artificial neural networks, the buffer being configured to collect the mel-frequency cepstral coefficients from the first neural network and provide batches of the mel-frequency cepstral coefficients to the second neural network (Fig. 5 and see ¶46, outputs of one or more hidden layers of the deep speech neural network can be used as inputs of an engine prediction neural network). 
Claim 4 is rejected under 35 USC 103(a) as being unpatentable over Steelberg et al. (US 2020/0075019 A1) in view of Sorin (US 8412526 B2) and Esser et al. (“Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing”) as applied to claim 1, in further view of Fayek (“Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What's In-Between”).
Steelberg does not disclose wherein the auditory features comprise a combination of linear predictive coding coefficients, perceptual linear predictive coefficients, spectral coefficients, filter bank coefficients, and/or spectro-temporal receptive fields. 
Fayek discloses wherein the auditory features comprise a combination of linear predictive coding coefficients, perceptual linear predictive coefficients, spectral coefficients, filter bank coefficients, and/or spectro-temporal receptive fields (“a signal goes through a pre-emphasis filter; then gets sliced into (overlapping) frames and a window function is applied to each frame; afterwards, we do a Fourier transform on each frame (or more specifically a Short-Time Fourier Transform) and calculate the power spectrum; and subsequently compute the filter banks. To obtain MFCCs, a Discrete Cosine Transform (DCT) is applied to the filter banks retaining a number of the resulting coefficients while the rest are discarded”).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to implement Steelberg’s deep / convolutional neural networks to extract MFCC (Steelberg, ¶46) by using a combination of at least spectral coefficients and filter bank coefficients as taught by Fayek in order to extract dominant mel-frequency cepstral coefficients (Steelberg, ¶46).
Claim 12 is rejected under 35 USC 103(a) as being unpatentable over Steelberg et al. (US 2020/0075019 A1) in view of Sorin (US 8412526 B2) and Esser et al. (“Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing”) as applied to claim 1, in further view of Misra et al. ("New Entropy Based Combination Rules in HMM/ANN Multi-Stream ASR”).
Steelberg does not disclose wherein the first neural network is further trained to output derivatives of the mel-frequency cepstral coefficients. 
Misra discloses a plurality of artificial neural network based speech recognizer (Fig. 1 and see p. 741), wherein the speech recognizers implement a neural network trained to p. 743, 4. Experimental Setup and Fig. 1, artificial neural network / ANN accepts 12 dimensional raw cepstral coefficients to extract 13 dimensional delta cepstral coefficients (first derivative), and 13 dimensional delta-delta cepstral coefficient (second derivative)).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to modify front end speech recognition neural network of Steelberg (Steelberg, ¶46, frontend speech recognition neural network) to output derivatives of the mel-frequency cepstral coefficients as taught by Misra in order to implement speech recognition neural network with lesser entropy and therefore more reliable classification (Misra, p. 741, 1. Introduction).
Claims 13-14 and 16-20 are rejected under 35 USC 103(a) as being unpatentable over Steelberg et al. (US 2020/0075019 A1) in view of Sorin (US 8412526 B2). 
Regarding Claim 13, Steelberg discloses a method comprising: 
training a first neural network to output auditory features based on input audio samples (¶39 and ¶46, speech recognition neural network trained to perform speech to text classification and using the speech recognition neural network to extract relevant dominant features of the audio file), wherein the output auditory features comprise mel-frequency cepstral coefficients (¶55-56); 
training a second neural network to output a classification based on input auditory features (¶46, backend fully layered CNN trained to predict a best candidate transcription engine given a set of outputs of one or more layers of the frontend neural network). 
Steelberg does not disclose the first artificial neural network being trained to output auditory features based on convolved and thresholded input audio samples represented as a data vector.
Sorin discloses a method for estimating high-order coefficients (HOC) of Mel Frequency Cepstral Coefficients to produce an output MFCC vector that improves speech recognition accuracy (Abstract and Col 3, Rows 25-33) comprising steps a) – h) (Col 4, Rows 1-18): 
step a) converting a truncated L-dimensional MFCC vector of low-order coefficients (LOC) to an N-dimensional binned spectrum, 
b) initializing N-L high-order coefficients (HOC) using predetermined values, 
c) computing an N-dimensional binned spectrum corresponding to the HOC, 
d) calculating a composite binned spectrum from both of the binned spectra using coordinate-wise multiplication, 
e) producing a basis bins matrix and basis function mixing coefficients by estimating at least one harmonic model parameter from the composite binned spectrum and a pitch frequency (Col 5, Rows 9-23) comprising calculating a spectral envelope for each sampled basis function by convolution with a Fourier transform of a windowing function, 
f) synthesizing a new binned spectrum by multiplying the basis bins matrix by the vector of the basis function mixing coefficients, 
g) regularizing the synthesized bins by identifying any coordinates of the synthetic vector whose value does not exceed a predefined threshold (Col 4, Rows 55-61), and 
h) estimating the HOC by converting the regularized synthesized bins to HOC.
Steelberg teaches that audio segments being processed by the neural networks are very noisy audio segments (¶40) where the neural networks were implemented on a server connected to a network (¶115). 
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to implement the first neural network of Steelberg to receive convolved and thresholded input audio samples represented as a data vector to produce MFCC output vectors as taught by Sorin in order to output dominant MFCC auditory features, which improves speech recognition accuracy, into the back-end neural network to predict a best candidate transcription / speech recognition engine  (Sorin, Col 3, Rows 25-33; Steelberg, ¶46). 
Regarding Claim 14, Steelberg discloses providing an input audio sample to the first neural network (¶46, feed audio segments to speech recognition neural network (frontend neural network)); 
receiving from the first neural network auditory features (¶46, speech recognition neural network extracts dominant features of the audio file segment; i.e., dominant mel-frequency cepstral coefficients); 
providing the auditory features to the second neural network (¶46, outputs of one or more hidden layers of the deep speech neural network can be used as inputs of an engine prediction neural network); 
receiving from the second neural network a classification of the input audio sample (¶47-48 in view of ¶37, the engine prediction neural network can associate a certain set of dominant audio features to characteristics of one or more candidate domain specific engines; ¶50, select which engine to transcribe which segments of the audio file based on audio / cepstral features of the segment and the predicted word error rate of the engine associated with the segment). 
Regarding Claim 16, Steelberg discloses wherein the auditory features comprise linear predictive coding coefficients, perceptual linear predictive coefficients, spectral coefficients, filter bank coefficients, and/or spectro-temporal receptive fields (¶55-56, identifying domain mel frequency cepstral coefficients using a pre-trained speech recognition neural network). 
Regarding Claim 17, Steelberg discloses wherein the classification is of phonemes, words, or speech segments (¶39 and ¶46, using outputs of the neural network to perform speech recognition / speech to text classification; ¶50, selecting the best engine to perform speech transcription based on dominant audio features). 
Regarding Claim 18, Steelberg discloses wherein the first and/or second neural network is a convolutional neural network (¶46, extract dominant audio features using a speech recognition neural network, which can be a convolutional neural network; ¶46, outputs from the last hidden layer of a deep neural network can be used as inputs of an engine prediction neural network, which can be a fully-layered convolutional neural network). 
Regarding Claim 19, Steelberg discloses wherein the input audio samples comprise speech (¶46, extracting dominant audio features to perform speech recognition).
Regarding Claim 20, Steelberg discloses a method comprising: 
providing an input audio sample to a first neural network (¶46, feed audio segments to speech recognition neural network (frontend neural network)); 
¶46, speech recognition neural network extracts dominant features of the audio file segment; i.e., dominant mel-frequency cepstral coefficients); 
providing the mel-frequency cepstral coefficients to a second neural network (¶46, outputs of one or more hidden layers of the deep speech neural network can be used as inputs of an engine prediction neural network); 
receiving from the second neural network a classification of the input audio sample (¶47-48 in view of ¶37, the engine prediction neural network can associate a certain set of dominant audio features to characteristics of one or more candidate domain specific engines; ¶50, select which engine to transcribe which segments of the audio file based on audio / cepstral features of the segment and the predicted word error rate of the engine associated with the segment).
Steelberg does not disclose convolving and thresholding input audio samples represent the input audio samples as a data vector and providing the convolved and thresholded input audio sample to the first neural network.
Sorin discloses a method for estimating high-order coefficients (HOC) of Mel Frequency Cepstral Coefficients to produce an output MFCC vector that improves speech recognition accuracy (Abstract and Col 3, Rows 25-33) comprising steps a) – h) (Col 4, Rows 1-18): 
step a) converting a truncated L-dimensional MFCC vector of low-order coefficients (LOC) to an N-dimensional binned spectrum, 
b) initializing N-L high-order coefficients (HOC) using predetermined values, 
c) computing an N-dimensional binned spectrum corresponding to the HOC, 

e) producing a basis bins matrix and basis function mixing coefficients by estimating at least one harmonic model parameter from the composite binned spectrum and a pitch frequency (Col 5, Rows 9-23) comprising calculating a spectral envelope for each sampled basis function by convolution with a Fourier transform of a windowing function, 
f) synthesizing a new binned spectrum by multiplying the basis bins matrix by the vector of the basis function mixing coefficients, 
g) regularizing the synthesized bins by identifying any coordinates of the synthetic vector whose value does not exceed a predefined threshold (Col 4, Rows 55-61), and 
h) estimating the HOC by converting the regularized synthesized bins to HOC.
Steelberg teaches that audio segments being processed by the neural networks are very noisy audio segments (¶40) where the neural networks were implemented on a server connected to a network (¶115). 
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to implement the first neural network of Steelberg to receive convolved and thresholded input audio samples represented as a data vector to produce MFCC output vectors as taught by Sorin in order to output dominant MFCC auditory features, which improves speech recognition accuracy, into the back-end neural network to predict a best candidate transcription / speech recognition engine  (Sorin, Col 3, Rows 25-33; Steelberg, ¶46). 

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to examiner Richard Z. Zhu whose telephone number is 571-270-1587 or examiner’s supervisor King Y. Poon whose telephone number is 571-272-7440. Examiner Richard Zhu can normally be reached on M-Th, 0730:1700.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or 
/RICHARD Z ZHU/Primary Examiner, Art Unit 2675                                                                                                                                                                                                        05/12/2021