DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 5/5/2022  has been entered.
Response to Amendment
The response filed on 5/5/2022 has been correspondingly accepted and considered in this Office Action. Claims 1-4, 9-13 and 15-23 have been examined. Claims 5-8, 14 and 24-25 have been cancelled. Applicant’s amendments to claims 1, 9, 11, 13, 15, 17, 20, 22 and 23 have been noted.
Response to Arguments
Applicant's arguments filed 5/5/2022  have been fully considered as follows:
Applicant’s arguments with respect to claim 1 state that
“Qian fails to suggest, much less disclose, performing gradient descent on extracted features and differences between an ideal sequence obtained from forced alignment of the extracted features and a predicted sequence obtained by propagating the extracted features through a neural network, as now clarified in the independent claims....”
	
The examiner respectfully disagrees, Qian teaches “The weights of DNN are trained by using pairs of input and output features extracted from training data to minimize the errors between the mapped output from a given input and the target output. In synthesis, the input text is converted first into input feature vector through the text analysis, then input feature vectors are mapped to output vectors by a trained DNN using forward propagation. Given the training set, the cost function to be minimized is defined by equation 5             
                 
                C
                =
                
                    
                        1
                    
                    
                        2
                        T
                    
                
                +
                
                    
                        ∑
                        
                            t
                            =
                            1
                        
                        
                            T
                        
                    
                    
                
                
                    
                        
                            
                                f
                                 
                                
                                    
                                        
                                            
                                                x
                                            
                                            
                                                
                                                    
                                                        t
                                                    
                                                
                                            
                                        
                                    
                                
                                -
                                
                                    
                                        y
                                    
                                    
                                        
                                            
                                                2
                                            
                                        
                                    
                                
                            
                        
                    
                
                
                    
                    
                        2
                    
                
            
        ” . Equation 5.  The DNN is trained by optimizing a cost function which measures the discrepancy between target vectors and the predicted output with a Back-Propagation (BP) 
    PNG
    media_image1.png
    77
    401
    media_image1.png
    Greyscale
procedure, the DNN is trained by using batch gradient descent. It is optimized by a “mini-batch” based stochastic gradient descent algorithm, where ε is a preset learning rate” in Qian, pg. 3830, section 2.1 and 2.3, the DNN is the trained on gradient descent methods to minimize the errors between the mapped output and target output. Therefore, Qian teaches perform gradient descent based on the extracted features and differences between the ideal sequence of acoustic unit observations and the predicted sequence of acoustic unit observations to generate enhanced features from the extracted features from the transcript of the audio input signal and therefore, the rejections of Claims 1, 22 and 23 are rejected under 35 U.S.C. 103 are sustained and further updated accordingly.
In response to the art rejection(s) of the remainder of dependent claims are rejected under 35 U.S.C 103, in case said claims are correspondingly discussed and/or argued for at least the same rationale presented in Remarks filed 5/5/2022, Examiner respectfully notes as follows. For completeness, should the mentioned claims are likewise traversed for similar reasons to independent claims 1, 22 and 23 correspondingly, Examiner respectfully directs Applicant to the same previous supra reasons provided in the response directed towards claims 1, 22 and 23 correspondingly discussed above. For at least the same supra provided reasons, Examiner likewise respectfully disagrees, and Applicant's arguments have been fully considered but they are not persuasive.
Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1, 2, 9-13, 15, 20-23 are rejected under 35 U.S.C. 103 as being unpatentable over Le, D., Licata, K., & Provost, E. M. (2018) “Automatic quantitative analysis of spontaneous aphasic speech”, Speech Communication, 100, 1-12, in view of Y. Qian, Y. Fan, W. Hu and F. K. Soong, "On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis," 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3829-3833 further in view of Menezes (WO Publication 2017142775).
Regarding claim 1, Le teaches obtain a transcript of an audio input signal including extracted features (see Le, pg. 4 sect. 4 teaches 4. The first step of spontaneous aphasic speech analysis is to obtain a detailed transcript for each utterance, including precise alignments of words and phones; detailed transcript interpreted as transcript including extracted features),  perform forced alignment on the extracted features to obtain an ideal sequence of acoustic unit observations (see Le, pg. 4, sect. 4.1.3 We obtain senone and monophone labels for each frame through forced alignment using a bootstrap context-dependent tied-state triphone HMM-GMM system trained with Maximum Likelihood), generate a predicted sequence of acoustic unit observations by propagating the extracted features through a neural network (Lee, pg. 4, sect. 4.1. Our work on automatic paraphasia detection made use of a deep multi-task BLSTM-RNN architecture trained on MFBs that predicts the senone and monophone labels simultaneously ; senone/monophone lables are interpreted as the predicted sequence of acoustic unit observations).  However, Le fails to teach perform gradient descent based on the extracted features and differences between the ideal sequence of acoustic unit observations and the predicted sequence of acoustic unit observations to generate enhanced features from the extracted features from the transcript of the audio input signal , and perform vocoding based on the enhanced features to produce an enhanced audio signal.
However, Qian teaches perform gradient descent based on the extracted features and differences between the ideal sequence of acoustic unit observations and the predicted sequence of acoustic unit observations to generate enhanced features from the extracted features from the transcript of the audio input signal (See Qian, pg. 3830, sec 2.3 The weights of DNN are trained by using pairs of input and output features extracted from training data to minimize the errors between the mapped output from a given input and the target output. In synthesis, the input text is converted first into input feature vector through the text analysis, then input feature vectors are mapped to output vectors by a trained DNN using forward propagation  see Qian, pg. 3830, sect 2.1 and equation 5 teaches the trained DNN is trained by optimizing a cost function which measures the discrepancy between target vectors and the predicted output with a Back-Propagation (BP) procedure, the DNN is trained by using batch gradient descent. It is optimized by a “mini-batch” based stochastic gradient descent algorithm)  , and perform vocoding based on the enhanced features to produce an enhanced audio signal (see Qian, Fig. 1 and pg. 3831 col1 lines 2-11  teaches 
    PNG
    media_image2.png
    402
    689
    media_image2.png
    Greyscale
in synthesis, the input text is converted first into input feature vector through the text  analysis, then input feature vectors are mapped to output vectors by a trained DNN using forward propagation. By setting the predicted output features from the DNN as mean vectors and pre-computed (global) variances of output features from all training data, the speech feature generation module can generate smooth trajectories of speech parameter features which satisfy the statistics of static and dynamic features. Finally, speech waveform is synthesized with the generated speech parameters).
Le and Qian are both considered to be analogous to the claimed invention because both relate to automated speech processing using DNN. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Le on the acoustic modeling teachings for transcript generation with the parametric synthesis teachings of Qian to improve the performance of vocoder based speech synthesis using DNNs (see Qian, pg. 3829 col. 2 lines 9-12).
However, Le and Qian do not teach an electronic devices comprising circuitry.  
However, Menezes teaches an electronic device, comprising circuitry configured to obtain a transcript of an audio input signal including extracted features (see Menezes, [00028-0029] As shown in FIG. 2, this assistive hearing device 200 has an assistive hearing module 202 that is implemented on a computing device 800 such as is described in greater detail with respect to FIG. 8. A speech recognition module 224 on the assistive hearing device 200 converts the received audio 206 to text 228. For example, in some implementations the speech recognition module 224 extracts features from the speech in the audio 206 signals and uses speech models to determine what is being said in order to transcribe the speech to text and thereby generate a transcript 228 of the speech).
Le, Qian and Menezes are considered to be analogous to the claimed invention because they relate to automated speech processing and speech synthesis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Le and Qian on the acoustic modeling teachings for transcript generation and the parametric synthesis teachings with the enhanced speech techniques of Menezes to improve the audibility of speech (see Menezes, [0003]).
Regarding claim 2, Le, and Qian teach the electronic device of claim 1. Menezes further teaches wherein the circuitry is further configured to obtain the transcript of the audio input signal by Speech Recognition Decoding (see Menezes[00029] a speech recognition module 224 on the assistive hearing device 200 converts the received audio 206 to text 228). 
Regarding claim 9, Le, Qian and Menezes teach the electronic device of claim 7. Qian further teaches wherein the enhanced features are enhanced Mel-scale cepstral coefficients (see Qian pg. 3830, col2, lines 38-39, the output features are acoustic features like spectral envelope and fundamental frequency; this is interpreted as Mel scale coefficients).
	Regarding claim 10, Le, Qian and Menezes teach the electronic device of claim 1.  Qian further teaches wherein performing vocoding comprises resynthesizing the enhanced audio signal from the enhanced features (see Qian, Fig. 1 vocoder).
	Regarding claim 11, Le, Qian and Menezes teach the electronic device of claim 1. Qian further teaches wherein performing forced alignment comprises performing an inverse Speech Recognition Decoding (see Qian pg. 3830, col2, lines 39-41, input features and output features are time-aligned frame-by-frame by well-trained HMM models).
	Regarding claim 12, Le, Qian and Menezes teach the electronic device of claim 2. Menezes further teaches wherein the enhanced audio signal is an enhanced version of the audio input signal (see Menezes, [00030] the transcript 228 is input to a text-to-speech converter 230 (e.g., a voice synthesizer). The text-to-speech converter 230 then converts the transcript (text) 228 to enhanced speech signals 232).

    PNG
    media_image1.png
    77
    401
    media_image1.png
    Greyscale
Regarding claim 13, Le, Qian and Menezes teach the electronic device of claim 7. Qian further teaches wherein differences between the ideal sequence of acoustic unit observations and the predicted sequence of acoustic unit observations are determined by computing a gradient of a loss function (see Qian, pg. 3830, section 2.1, given the training set, the cost function to be minimized is defined by equation 5                         
                             
                            C
                            =
                            
                                
                                    1
                                
                                
                                    2
                                    T
                                
                            
                            +
                            
                                
                                    ∑
                                    
                                        t
                                        =
                                        1
                                    
                                    
                                        T
                                    
                                
                                
                            
                            
                                
                                    
                                        
                                            f
                                             
                                            
                                                
                                                    
                                                        
                                                            x
                                                        
                                                        
                                                            
                                                                
                                                                    t
                                                                
                                                            
                                                        
                                                    
                                                
                                            
                                            -
                                            
                                                
                                                    y
                                                
                                                
                                                    
                                                        
                                                            2
                                                        
                                                    
                                                
                                            
                                        
                                    
                                
                            
                            
                                
                                
                                    2
                                
                            
                        
                    ” . Equation 5 teaches the trained DNN is trained by optimizing a cost function which measures the discrepancy between target vectors and the predicted output with a Back-Propagation (BP) procedure, the DNN is trained by using batch gradient descent. It is optimized by a “mini-batch” based stochastic gradient descent algorithm,  where ε is a preset learning rate; cost function is interpreted as loss function).
Regarding claim 15, Le, Qian and Menezes teach the electronic device of claim 13. Qian further teaches wherein performing gradient descent comprises multiplying the gradient of the loss function with a predefined multiplication factor (see Qian, pg. 3830, col 1 lines 19-22, and equation 6 teaches the DNN is trained by using batch gradient descent. It is optimized by a “mini-batch” based stochastic gradient descent algorithm, 
    PNG
    media_image3.png
    61
    316
    media_image3.png
    Greyscale
 
    PNG
    media_image1.png
    77
    401
    media_image1.png
    Greyscale
 
    PNG
    media_image1.png
    77
    401
    media_image1.png
    Greyscale
 where ε is a preset learning rate)
Regarding claim 22, is directed to a method claim corresponding to the system claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1.
Regarding claim 23, is directed to a non-transitory computer readable medium claim corresponding to the system claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1.
Claims 3 and 4 are rejected under 35 U.S.C. 103 as being unpatentable over Le, D., Licata, K., & Provost, E. M. (2018) “Automatic quantitative analysis of spontaneous aphasic speech”, Speech Communication, 100, 1-12, in view of Y. Qian, Y. Fan, W. Hu and F. K. Soong, "On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis," 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3829-3833, further in view of Menezes (WO Publication 2017142775), further in view of Pollet (U.S. Patent Application Publication 2018/0096677)
Regarding claim 3, Le, Qian and Menezes teach the electronic device of claim 1. Qian further teaches wherein the transcript is obtained, by direct transcription of a transcript provided from a user (see Qian, Fig. 1 TEXT).  However, Pollet further teaches transcript is obtained by direct transcription of a transcript provided from a user (see Pollet, [0054] teaches how the text input may be obtained directly from the user, which is considered as the direct transcription from the user).  
Le, Qian, Menezes and Pollet are considered to be analogous to the claimed invention because they are in the same field of speech processing and speech synthesis.  Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the transcript generation as taught by Le, Qian and Menezes with the transcript provided from a user as taught by Pollet to improve the quality of the synthesized speech (see Pollet [0030]).
Regarding claim 4, Le, Qian and Menezes teach the electronic device of claim 3, however Le, Qian and Menezes fail to teach wherein the transcript is a sequence of written words provided by the user in a computer readable format.  However, Pollet teaches wherein the transcript is a sequence of written words provided by the user in a computer readable format (see Pollet [0054] teaches the user may enter the text using a keyboard or an application and the text input may include words, phrases, sentences).
Claims  16-19 are rejected under 35 U.S.C. 103 as being unpatentable over Le, D., Licata, K., & Provost, E. M. (2018) “Automatic quantitative analysis of spontaneous aphasic speech”, Speech Communication, 100, 1-12, in view of Y. Qian, Y. Fan, W. Hu and F. K. Soong, "On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis," 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3829-3833, further in view of Menezes (WO Publication 2017142775), further in view of Le, D., Licata, K., & Provost, E. M. (2017) “Automatic Paraphasia Detection from Aphasic Speech: A Preliminary Study” In Interspeech (pp. 294-298) referred as Le (2017).
Regarding claim 16, Le, Qian and Menezes teach the electronic device of claim 2. Le, Qian and Menezes fail to teach wherein the circuitry is further configured to perform feature extraction on the audio input signal to obtain coefficients. However, Le (2017) teaches wherein the circuitry is further configured to perform feature extraction on the audio input signal to obtain coefficients (see Le (2017), pg. 296, sect. 5.1, We utilize a multi-task deep bidirectional long-short term memory recurrent neural network (DBLSTM-RNN) to predict both the correct senone and monophone labels for each frame.  Input Features: we use Kaldi [29] to extract 40- dimensional log Mel filterbank coefficients).
Le, Qian, Menezes and Le (2017) are considered to be analogous to the claimed invention because relate to automated speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Le, Qian and Menezes on the acoustic modeling teachings for transcript generation and parametric synthesis teachings with the acoustic modeling teachings of Le (2017) to improve the automation of generating the transcript (see Le (2017), pg. 1, section 1, col 2).
Regarding claim 17, Le, Qian, Menezes and Le (2017) teach the electronic device of claim 16. Le (2017) further teaches wherein the circuitry is further configured to perform Deep Neural Network processing based on the coefficients to obtain a sequence of acoustic unit observations (see (Le 2017) pg. 296, sect. 5.2 teaches the ID acoustic model obtained from the previous step can be used to detect word and phone boundaries via forced alignment with the target transcripts. In addition, the phoneme posteriorgrams produced by the model provide a compact representation of word and phone segments. Given this information, our objective is to extract features for each word that can help separate phonemic/neologistic paraphasias from correct words).
Regarding claim 18, Le, Qian, Menezes and Le (2017) teach the electronic device of claim 17. Le (2017) further teaches wherein the circuitry is further configured to perform Speech Recognition Decoding on the sequence of acoustic unit observations based on a language model to obtain the transcript of the audio input signal (see Le (2017), pg. 296, sect. 5.3 Automatic transcription of test utterances can be performed by combining our DBLSTM-RNN acoustic model with a language model (LM) for decoding).
Regarding claim 19, Le, Qian, Menezes and Le (2017) teach the electronic device of claim 18. Le (2017) further teaches wherein performing Speech Recognition Decoding comprises performing maximum likelihood wherein performing maximum likelihood includes performing optimization by combining acoustic likelihoods and language model probabilities (see Le (2017), pg. 296, sect. 5.2 GOP involves calculating the difference between the average acoustic log-likelihood of a force-aligned word-level segment and that of an unconstrained phone loop and sect 5.3 teaches automatic transcription of test utterances can be performed by combining our DBLSTM-RNN acoustic model with a language model (LM) for decoding).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Li, (US Patent Application Publication 2018/0197534) teaches generating a single combined channel of audio data using neural networks and providing a transcription for the utterance (see Li [0011]).
Meng, Z., Li, J., & Gong, Y. (2018). Adversarial feature-mapping for speech enhancement. arXiv preprint arXiv:1809.02251 teaches the feature-mapping approach with adversarial learning to further diminish the discrepancy between the distributions of the clean features and the enhanced features generated by the feature-mapping network given nonstationary and auto-correlated noise at the input.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NANDINI SUBRAMANI whose telephone number is (571)272-3916. The examiner can normally be reached Monday - Friday 2:00pm - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh M Mehta can be reached on (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NANDINI SUBRAMANI/Examiner, Art Unit 2656                                                                                                                                                                                                        
/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656