DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Office Action is in response to the remarks filed April 29, 2021.  No claims are amended, added, or cancelled.  Claims 1, 3-10, and 12-19 are pending.

Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1, 5-10, and 14-19 are rejected under 35 U.S.C. 103 as being unpatentable over Bacchiani et al (US Patent Application Publication No. 2015/0127327) in view of Chung et al (US Patent Application Publication No.  2016/0078863) further in view of Lee et al (US Patent Application Publication No. 2015/0163310) and further in view of Garimella et al (US Patent No. 9,378,735).
Bacchiani teaches context-dependent state tying using a first and second neural network.  Regarding claims 1,10, and 19, Bacchiani teaches a speech recognition method (para [0004), [0016], [0020], [0024]) (A non-transitory computer -readable medium storing a set of instructions that is executable by one or more processors of an apparatus to cause the apparatus to perform a method for speech recognition (para [0004], [0016], [0020], [0024], [0050]), the method comprising: A speech recognition apparatus (para [0004], [0016], [0020], [0024])), comprising: extracting, via a first neural network, a vector containing recognition features [para 0020-0024 -- computing device 120 receives the audio signal 112 and obtains information about acoustic features of the audio signal 112] from speech data (e.g. an initial neural network can be trained using a set of acoustic observation vectors to extract features used in the state-tying process, para [0016]: e.g. receives data about an audio signal 210 that includes speech to be recognized... the feature extraction may be performed using a neural network such as the neural network 130, para [0039]; e.g. extracts features that are represented as an acoustic feature vector 240 for the window 220, para [0040]; para [0024]-[0025]), the first neural network receives the speech data as input [Figure 3, element 304; para 0041 -- The neural network 140 can be configured to receive one or more of the feature vectors as inputs, where a feature vector acoustic represents characteristics of the portion of the audio signal 210 within its corresponding window 220 – where acoustic features representing characteristics of the portion of the audio signal is a form of speech data];  implementing bias in a second neural network in accordance with the vector containing the speaker recognition features (e.g. In the training only the weights and biases of the softmax layer were optimized, para [0073]; e.g. the weight vector associated with the output neuron of the neural network 130 for the n-th class, b.sub.n denotes the bias of that neuron, para [0025] -- the neural network 130 includes an input layer 131 to which the acoustic observation vectors 122 are presented. The input layer 131 can be connected with trainable weights to a stack of hidden layers 133 and an output layer 135 (which may also be referred to as a softmax layer). The hidden layers 133 can be configured to compute weighted sums of the activation vectors received from the connections and a bias, and output activation vectors based on a non-linear function applied to the sums);  and the second neural network (e.g, second neural network, para [0047]).   Bacchiani teaches recognizing speech, via an acoustic model based on the second neural network, in the speech data [para 0024 -- the computing device 120 may generate a set of feature vectors, where each feature vector indicates audio characteristics during a different portion or window of the audio signal 112. Each feature vector may indicate acoustic properties of, for example, a 10 millisecond (ms), 25 ms, or 50 ms portion of the audio signal 112. During run time, for example, when performing automatic speech recognition (ASR), the computing device 120 can process the audio signal 112 to obtain transcription data 160 for the same, using one or more neural networks].  Bacchiani fails to specifically teach the recognition features are received from the first neural network.  In a similar field of endeavor, Chung teaches a signal processing integrated deep neural network (DNN) based speech recognition apparatus, in which the signal processing DNN extracts feature parameters and a classification DNN performs speech recognition using the extracted feature parameters [para 0018 – processor may input the feature parameter output from the signal processing DNN to the classification DNN; 0031; 0048; 0056], such that the signal processing DNN (“first neural network”) is fused with the classification DNN (“second neural network”).   Chung teaches the signal processing DNN/classification DNN fusing is beneficial in maximizing speech recognition performance [para 0009].  Therefore, one having ordinary skill in the art at the time of the invention, would have recognized the advantages of specifically providing for the recognition features to be received from the first neural network, as suggested by Chung, for the purpose of maximizing  recognition performance, as suggested by Chung.  Bacchiani fails to teach the extracted speech features are a vector containing speaker dependent features from the speech data and fails to teach wherein determining a bias term in the second neural network in accordance with the vector containing the speaker recognition features includes: multiplying the vector containing the speaker recognition features by a weight matrix to be a bias term of the second neural network.  However, Garimella, in an analogous art, discloses estimating speaker-specific affine transforms for neural network based speech recognition, where the speaker-computing bias portion of the affine transform; col. 9, line 15 col. 10, line 57).  Lee, in an analogous art, teaches compensating bias in the second neural network in accordance with the vector containing the speaker recognition features includes: multiplying the vector containing the speaker recognition features by a weight matrix to be a bias term of the second neural network (para [0072]-[0077], [0149], [0156]).  It would have been obvious to one of ordinary skill in the art at the time of the invention to have modified the system of Bacchiani/Chung by including compensating bias in the second neural network in accordance with the vector containing the speaker-dependent recognition features includes: multiplying the vector containing the speaker recognition features by a weight matrix to be the bias term of the second neural network as taught by Garimella and Lee, for the purpose of improving the accuracy of speech recognition (Lee at para 0007 and Garimella, col. 2, lines 31-33).
Regarding claims 3 and 12, the combination of Bacchiani, Chung, Garimella and Lee teaches wherein the first neural network, the second neural network, and the weight matrix are trained through: training the first neural network and the second neural network respectively (Lee at para [0072]-[0077], [0083]-[0084], [0104]-[0107]); and collectively training the trained first neural network, the weight matrix, and the trained second neural network (Lee at para [0072]-[0077], [0083]-[0084], [0104]-[0107]).
Regarding claims 4 and 13, the combination of Bacchiani, Chung, Garimella and Lee teaches  initializing the first neural network, the second neural network, and the weight matrix (Lee at para [0072]-[0077], [0149], [0156]); updating the weight matrix using a back propagation 
Regarding claims 5 and 14, the combination of Bacchiani, Chung, Garimella and Lee teaches wherein the speaker recognition features include at least speaker voiceprint information (Garimella at col. 2, lines 37-47 -- audio data is received and converted into a sequence of frames. Each frame is then processed to create a speaker-specific feature vector (or transformed feature vector) using a sequence of steps; col. 4, lines 18-28).
Regarding claims 6 and 15, the combination of Bacchiani, Chung, Garimella and Lee teaches wherein determining bias term in the second neural network in accordance with the vector containing the speaker recognition features Includes: determining bias term at all or a part of layers, except tor an input layer (e.g. In the training only the weights and biases of the softmax layer were optimized, para [0073]; e.g. the weight vector associated with the output neuron of the neural network 130 for the n-th class, b.sub.n denotes the bias of that neuron, para [0025] -- the neural network 130 includes an input layer 131 to which the acoustic observation vectors 122 are presented. The input layer 131 can be connected with trainable weights to a stack of hidden layers 133 and an output layer 135 (which may also be referred to as a softmax layer). The hidden layers 133 can be configured to compute weighted sums of the activation vectors received from the connections and a bias, and output activation vectors based on a non-linear function applied to the sums; where Garimella provides for the bias term);  and the second neural network (e.g, second neural network, para [0047]), In the second neural network in accordance with the vector containing the speaker recognition features, wherein the vector containing the 
Regarding claims 7 and 16, the combination of Bacchiani, Chung, Garimella and Lee teaches wherein determining bias at all or a part of layers, except for an input layer, in the second neural network in accordance with the vector containing the speaker recognition features includes: transmitting the vector containing the speaker recognition features, output by nodes at the last hidden layer of the first neural network, to bias nodes corresponding to the all or the part of layers, except for the input layer, in the second neural network (Bacchiani at para [0025]-[0026] --the neural network 130 includes an input layer 131 to which the acoustic observation vectors 122 are presented. The input layer 131 can be connected with trainable weights to a stack of hidden layers 133 and an output layer 135 (which may also be referred to as a softmax layer). The hidden layers 133 can be configured to compute weighted sums of the activation vectors received from the connections and a bias, and output activation vectors based on a non-linear function applied to the sums, [0036], [0041], [0047], [0073]).
Regarding claims 8 and 17, the combination of Bacchiani, Chung, Garimella and Lee teaches wherein the speech data is collected original speech data or speech features extracted from the collected original speech data (Bacchiani at para [0022], [0039], [0066]).

Claims 9 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Bacchiani in view of Chung, Garimella and Lee as applied to claims 1 and 10 above, and further in view of Yu et al (US Patent Application Publication No. 2015/0269933).
Regarding claims 9 and 18, Bacchiani fails to teach wherein the speaker recognition features correspond to different users, or correspond to clusters of different users.  However, Yu, .


Response to Arguments
Applicant's arguments filed April 29, 2021 have been fully considered but they are not persuasive. 
Applicant argues Garimella’s estimating the bias term of the affine transform function {Af, bf} cannot correspond to the claimed “determining a bias term of a neural network based acoustic model based on the speaker specific feature vectors.”   Applicant argues, Garimella even fails to disclose “modifying the general acoustic model” or “creating a speaker specific acoustic model” in accordance with the speaker specific features. Therefore, Garimella fails to teach or suggest at least “determining a bias term of a second neural network [on which an acoustic model is based] in accordance with the vector containing the speaker dependent features” and “multiplying the vector containing the speaker dependent features by a weight matrix to generate the bias term of the second neural network,” as recited in amended claim 1 (emphasis added).  Applicant also argues Lee does not even mention a term “bias” nor “speaker recognition features (or vectors).” Nowhere, including the cited portions, does Lee disclose “determining a bias term” of any neural network including a neural network based acoustic model in accordance with the vector containing the speaker recognition features.  In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).  The Examiner notes, Bacchiani teaches the acoustic model based on the second neural network; Garimella was cited for teaching speaker specific transforms based on speaker training data stored as feature vectors (speaker specific feature vectors) where affine transform may be estimated by minimizing the least squares error between corresponding linear and bias transform parts for the resultant neural network feature vector and speaker-specific feature vector obtained for a GMM-based acoustic model using constrained Maximum Likelihood Linear Regression techniques (bias term for speaker dependent features) and Lee was cited for teaching the concept of multiplying a vector of speech recognition features by a weight matrix (Lee’s acoustic embedding matrix) to determine a bias term.  One having ordinary skill at the time of the invention would have recognized the advantages of modifying the system of Bacchiani/Chung by including compensating bias in the second neural network in accordance with the vector containing the speaker-dependent recognition features includes: multiplying the vector containing the speaker recognition features by a weight matrix to be the bias term of the second neural network as suggested by Garimella and Lee, for the purpose of improving the accuracy of speech recognition (Lee at para 0007 and Garimella, col. 2, lines 31-33).
In response to applicant's argument that the references fail to show certain features of applicant' s invention, it is noted that the features upon which applicant relies (i.e., “modifying the general acoustic model” or “creating a speaker specific acoustic model”) are not recited in the rejected claim(s).  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993).  The Examiner notes, claims 1, 10, and 10 recite speaker dependent features from speech data, and thus, Garimella’s speaker specific transforms based on speaker training data stored as feature vectors provide adequate support for the broadly claimed speaker dependent features from speech data.  

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANGELA A ARMSTRONG whose telephone number is (571)272-7598.  The examiner can normally be reached on M,T,TH,F 11:30-8:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


ANGELA A. ARMSTRONG
Primary Examiner
Art Unit 2659



/ANGELA A ARMSTRONG/Primary Examiner, Art Unit 2659