DETAILED ACTION
Introduction
This office action is in response to Applicant’s submission filed on 08/18/2022. Claims 1-4, 9-13, 15 and 17-23 are pending in the application and have been examined.
	
Notice of Pre-AIA  or AIA  Status
The present application is being examined under the pre-AIA  first to invent provisions. 

Response to Amendment
The response filed on 08/18/2022 has been correspondingly accepted and considered in this Office Action. Claims 1-4, 9-13, 15 and 17-23 have been examined. Claim 16 has been cancelled. Applicant’s amendments to claims 1, 13, 17, 18, 22 and 23 have been noted.

Response to Arguments
Applicant's arguments filed 08/18/2022  have been fully considered as follows:
Applicant’s arguments with respect to claims 1, 22 and 23 on pg. 9 states that
“While Qian may disclose a gradient descent that measures a discrepancy between a given input and a target output, Qian does not appear to disclose or suggest also performing the gradient descent on the extracted features themselves and this discrepancy to generate enhanced features as recited in the independent claims.”
	
Applicant’s arguments with respect to claim(s) 1, 22 and 23 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
In response to the art rejection(s) of the remainder of dependent claims are rejected under 35 U.S.C 103, in case said claims are correspondingly discussed and/or argued for at least the same rationale presented in Remarks filed 08/18/2022, Examiner respectfully notes as follows. For completeness, should the mentioned claims be likewise traversed for similar reasons to independent claims 1, 22 and 23 correspondingly, Examiner respectfully directs Applicant to the same previous supra reasons provided in the response directed towards claims 1, 22 and 23 correspondingly discussed above. For at least the same supra provided reasons, Examiner likewise respectfully disagrees, and Applicant's arguments have been fully considered but they are not persuasive.
Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1, 2, 9-13, 15, 17, 20-23 are rejected under 35 U.S.C. 103 as being unpatentable over Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014, May). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1695-1699), in view of Y. Qian, Y. Fan, W. Hu and F. K. Soong, "On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis," 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3829-3833 further in view of Menezes (WO Publication 2017142775).
Regarding claim 1, Lei teaches window an audio input signal to extract features from the audio input signal, generate a first sequence of acoustic unit observations by propagating the extracted features through a neural network, decoding the first sequence of acoustic unit observations to obtain a transcript of the audio input signal (see Lei, pg. 1696,  sect 2,  The i-vector used to represent the speech signal is the maximum a posterior (MAP) point estimate of the latent vector ω(i). It is noted that the alignments can be replaced by the prior (e.g., weights of the UBM) in equation (1). see Lei, pg. 1696, sect 4, Figure 1 presents a flow diagram for training a DNN for ASR. A pre-trained hidden markov model (HMM) ASR system with GMM states is needed to generate alignments for the subsequent DNN training; the prior is interpreted as the first sequence of acoustic units and i-vector is interpreted as the transcript ),  perform forced alignment on the transcript to obtain a second sequence of acoustic unit observations  (see Lei, pg. 1696, sect 4, Once the set of senones is defined, a Viterbi decoder is used to align the training data into the corresponding senones. These alignments are used to estimate the observation probability distribution p(x|q), where x is an observation vector in the training data and q is the senone); perform gradient descent based on the extracted features and differences between the second first sequence of acoustic unit observations to generate enhanced features (see Lei, pg. 1697, sect. 4 To solve this problem we propose to directly use the posteriors from the DNN in the ASR system as the γs in eq 3. In acoustic modeling, DNNs have been shown to outperform GMM-based models by a significant margin, due to the fact that they use longer context windows and are discriminatively trained. As a result, a DNN model gives a much better estimate of the senone posterior than the supervised UBM. Note that an important characteristic of our approach is that one does not have to compromise by designing a feature that works well for both ASR and SID. Indeed, the DNN system can use completely different features from the features used for speaker recognition, as long as it improves the estimate of the posterior probability; senone posterior interpreted as second sequence unit observations).  However, Lei fails to teach perform vocoding based on the enhanced features to produce an enhanced audio signal.
However, Qian teaches perform vocoding based on the enhanced features to produce an enhanced audio signal (see Qian, Fig. 1 and pg. 3831 col1 lines 2-11  teaches 
    PNG
    media_image1.png
    402
    689
    media_image1.png
    Greyscale
in synthesis, the input text is converted first into input feature vector through the text  analysis, then input feature vectors are mapped to output vectors by a trained DNN using forward propagation. By setting the predicted output features from the DNN as mean vectors and pre-computed (global) variances of output features from all training data, the speech feature generation module can generate smooth trajectories of speech parameter features which satisfy the statistics of static and dynamic features. Finally, speech waveform is synthesized with the generated speech parameters).
Lei and Qian are both considered to be analogous to the claimed invention because both relate to automated speech processing using DNN. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Lei on the DNN processing for ASR with the parametric synthesis teachings of Qian to improve the performance of vocoder based speech synthesis using DNNs (see Qian, pg. 3829 col. 2 lines 9-12).
However, Lei in view of Qian do not teach an electronic devices comprising circuitry.  
However, Menezes teaches an electronic device, comprising circuitry configured to obtain a transcript of an audio input signal including extracted features (see Menezes, [00028-0029] As shown in FIG. 2, this assistive hearing device 200 has an assistive hearing module 202 that is implemented on a computing device 800 such as is described in greater detail with respect to FIG. 8. A speech recognition module 224 on the assistive hearing device 200 converts the received audio 206 to text 228. For example, in some implementations the speech recognition module 224 extracts features from the speech in the audio 206 signals and uses speech models to determine what is being said in order to transcribe the speech to text and thereby generate a transcript 228 of the speech).
Lei, Qian and Menezes are considered to be analogous to the claimed invention because they relate to automated speech processing and speech synthesis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Lei and Qian on the ASR DNN teachings for parametric speech synthesis teachings with the enhanced speech techniques of Menezes to improve the audibility of speech (see Menezes, [0003]).
Regarding claim 2, Lei in view of Qian further in view of Menezes teach the electronic device of claim 1. Menezes further teaches wherein the circuitry is further configured to obtain the transcript of the audio input signal by Speech Recognition Decoding (see Menezes[00029] a speech recognition module 224 on the assistive hearing device 200 converts the received audio 206 to text 228). 
Regarding claim 9, Lei in view of Qian further in view of Menezes teach the electronic device of claim 7. Qian further teaches wherein the enhanced features are enhanced Mel-scale cepstral coefficients (see Qian pg. 3830, col2, lines 38-39, the output features are acoustic features like spectral envelope and fundamental frequency; this is interpreted as Mel scale coefficients).
	Regarding claim 10, Lei in view of Qian further in view of Menezes teach the electronic device of claim 1.  Qian further teaches wherein performing vocoding comprises resynthesizing the enhanced audio signal from the enhanced features (see Qian, Fig. 1 vocoder).
	Regarding claim 11, Lei in view of Qian further in view of Menezes teach the electronic device of claim 1. Qian further teaches wherein performing forced alignment comprises performing an inverse Speech Recognition Decoding (see Qian pg. 3830, col2, lines 39-41, input features and output features are time-aligned frame-by-frame by well-trained HMM models).
	Regarding claim 12, Lei in view of Qian further in view of Menezes teach the electronic device of claim 2. Menezes further teaches wherein the enhanced audio signal is an enhanced version of the audio input signal (see Menezes, [00030] the transcript 228 is input to a text-to-speech converter 230 (e.g., a voice synthesizer). The text-to-speech converter 230 then converts the transcript (text) 228 to enhanced speech signals 232).

    PNG
    media_image2.png
    77
    401
    media_image2.png
    Greyscale
Regarding claim 13, Lei in view of Qian further in view of Menezes teach the electronic device of claim 7. Qian further teaches wherein differences between the ideal sequence of acoustic unit observations and the predicted sequence of acoustic unit observations are determined by computing a gradient of a loss function (see Qian, pg. 3830, section 2.1, given the training set, the cost function to be minimized is defined by equation 5                         
                             
                            C
                            =
                            
                                
                                    1
                                
                                
                                    2
                                    T
                                
                            
                            +
                            
                                
                                    ∑
                                    
                                        t
                                        =
                                        1
                                    
                                    
                                        T
                                    
                                
                                
                            
                            
                                
                                    
                                        
                                            f
                                             
                                            
                                                
                                                    
                                                        
                                                            x
                                                        
                                                        
                                                            
                                                                
                                                                    t
                                                                
                                                            
                                                        
                                                    
                                                
                                            
                                            -
                                            
                                                
                                                    y
                                                
                                                
                                                    
                                                        
                                                            2
                                                        
                                                    
                                                
                                            
                                        
                                    
                                
                            
                            
                                
                                
                                    2
                                
                            
                        
                    ” . Equation 5 teaches the trained DNN is trained by optimizing a cost function which measures the discrepancy between target vectors and the predicted output with a Back-Propagation (BP) procedure, the DNN is trained by using batch gradient descent. It is optimized by a “mini-batch” based stochastic gradient descent algorithm,  where ε is a preset learning rate; cost function is interpreted as loss function).
Regarding claim 15, Lei in view of Qian further in view of Menezes teach the electronic device of claim 13. Qian further teaches wherein performing gradient descent comprises multiplying the gradient of the loss function with a predefined multiplication factor (see Qian, pg. 3830, col 1 lines 19-22, and equation 6 teaches the DNN is trained by using batch gradient descent. It is optimized by a “mini-batch” based stochastic gradient descent algorithm, 
    PNG
    media_image3.png
    61
    316
    media_image3.png
    Greyscale
 
    PNG
    media_image2.png
    77
    401
    media_image2.png
    Greyscale
 
    PNG
    media_image2.png
    77
    401
    media_image2.png
    Greyscale
 where ε is a preset learning rate)
Regarding claim 17, Lei in view of Qian further in view of Menezes teach the electronic device of claim 1. Lei  further teaches wherein the neural network is a Deep Neural Network  (see Lei, pg. 1696, sect 3. In this work, we propose to replace the UBM-GMM by a deep neural network (DNN) trained for ASR).
Regarding claim 22, is directed to a method claim corresponding to the system claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1.
Regarding claim 23, is directed to a non-transitory computer readable medium claim corresponding to the system claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1.
Claims 3 and 4 are rejected under 35 U.S.C. 103 as being unpatentable over Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014, May). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1695-1699), in view of Y. Qian, Y. Fan, W. Hu and F. K. Soong, "On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis," 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3829-3833, further in view of Menezes (WO Publication 2017142775), further in view of Pollet (U.S. Patent Application Publication 2018/0096677)
Regarding claim 3, Lei in view of Qian further in view of Menezes teach the electronic device of claim 1. Qian further teaches wherein the transcript is obtained, by direct transcription of a transcript provided from a user (see Qian, Fig. 1 TEXT).  However, Pollet further teaches transcript is obtained by direct transcription of a transcript provided from a user (see Pollet, [0054] teaches how the text input may be obtained directly from the user, which is considered as the direct transcription from the user).  
Lei, Qian, Menezes and Pollet are considered to be analogous to the claimed invention because they are in the same field of speech processing and speech synthesis.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the transcript generation as taught by Lei, Qian and Menezes with the transcript provided from a user as taught by Pollet to improve the quality of the synthesized speech (see Pollet [0030]).
Regarding claim 4, Lei in view of Qian further in view of Menezes further in view of Pollet teach the electronic device of claim 3. Pollet further teaches wherein the transcript is a sequence of written words provided by the user in a computer readable format (see Pollet [0054] teaches the user may enter the text using a keyboard or an application and the text input may include words, phrases, sentences).
Claims  18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014, May). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1695-1699), in view of Y. Qian, Y. Fan, W. Hu and F. K. Soong, "On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis," 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3829-3833, further in view of Menezes (WO Publication 2017142775), further in view of Le, D., Licata, K., & Provost, E. M. (2017) “Automatic Paraphasia Detection from Aphasic Speech: A Preliminary Study” In Interspeech (pp. 294-298) referred as Le (2017).
Regarding claim 18, Lei in view of Qian further in view of Menezes teach the electronic device of claim 17. However, fails to teach wherein the circuitry is further configured to perform Speech Recognition Decoding on the first sequence of acoustic unit observations based on a language model to obtain the transcript of the audio input signal. 
However, Le (2017) further teaches wherein the circuitry is further configured to perform Speech Recognition Decoding on the first sequence of acoustic unit observations based on a language model to obtain the transcript of the audio input signal (see Le (2017), pg. 296, sect. 5.3 Automatic transcription of test utterances can be performed by combining our DBLSTM-RNN acoustic model with a language model (LM) for decoding).
Lei, Qian, Menezes and Le (2017) are considered to be analogous to the claimed invention because relate to automated speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Lei, Qian and Menezes on the ASR DNN Modeling with parametric synthesis teachings with the acoustic modeling teachings of Le (2017) to improve the automation of generating the transcript (see Le (2017), pg. 1, section 1, col 2).
Regarding claim 19, Lei in view of Qian further in view of Menezes further in view of Le (2017) teach the electronic device of claim 18. Le (2017) further teaches wherein performing Speech Recognition Decoding comprises performing maximum likelihood wherein performing maximum likelihood includes performing optimization by combining acoustic likelihoods and language model probabilities (see Le (2017), pg. 296, sect. 5.2 GOP involves calculating the difference between the average acoustic log-likelihood of a force-aligned word-level segment and that of an unconstrained phone loop and sect 5.3 teaches automatic transcription of test utterances can be performed by combining our DBLSTM-RNN acoustic model with a language model (LM) for decoding).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Li, (US Patent Application Publication 2018/0197534) teaches generating a single combined channel of audio data using neural networks and providing a transcription for the utterance (see Li [0011]).
Meng, Z., Li, J., & Gong, Y. (2018). Adversarial feature-mapping for speech enhancement. arXiv preprint arXiv:1809.02251 teaches the feature-mapping approach with adversarial learning to further diminish the discrepancy between the distributions of the clean features and the enhanced features generated by the feature-mapping network given nonstationary and auto-correlated noise at the input.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NANDINI SUBRAMANI whose telephone number is (571)272-3916. The examiner can normally be reached Monday - Friday 12:00pm - 5:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh M Mehta can be reached on (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NANDINI SUBRAMANI/Examiner, Art Unit 2656                                                                                                                                                                                                        

/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656