DETAILED ACTION
Introduction
This office action is in response to Applicant submission filed on 4/8/2021. Claims 1-16 are
pending in the application. As such, Claims 1-16 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Specification
Applicant is reminded of the proper language and format for an abstract of the disclosure.
The abstract should be in narrative form and generally limited to a single paragraph on a separate sheet within the range of 50 to 150 words in length. The abstract should describe the disclosure sufficiently to assist readers in deciding whether there is a need for consulting the full patent text for details.
The language should be clear and concise and should not repeat information given in the title. It should avoid using phrases which can be implied, such as, “The disclosure concerns,” “The disclosure defined by this invention,” “The disclosure describes,” etc.  In addition, the form and legal phraseology often used in patent claims, such as “means” and “said,” should be avoided.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-6, 8-14, and 16 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Ward et al. (US 10210860 B1) (Further referred to as “Ward”).

Regarding Claim 1, Ward teaches a method, performed by a computer-based pronunciation analysis system, of detecting phoneme mispronunciation and facilitating phonological feature feedback based on a speech representation of a sampled speech waveform and expected linguistic content, the expected linguistic content including an expected phoneme sequence, the method comprising: mapping, with a first trained universal function approximator, the speech representation to predicted phonological feature and phoneme class probabilities, thereby establishing predicted phonological feature probabilities and predicted phoneme class probabilities (Ward Column 6 Lines 43-56 -  CNN stack 202 receive the representation of the audio input from front-end module 201. CNN stack 202 processes the audio features to determine a first set of features. Specifically, CNN stack 202 generates a number of feature maps corresponding to a number of convolutional filters, where each convolutional filter represents some characteristic or feature of the audio input. This step may be regarded as roughly analogous to determining a phoneme representation of input audio, however CNN stack 202 does not discretize the output to a set number of acoustic representations. The features determined by CNN stack 202 are not limited to a predetermined set of phonemes. Because it is not so limited, CNN stack 202 can encode a wide range of information.);
determining expected phonological feature values based on an automatic phonetic segmentation using the expected phoneme sequence and the predicted phoneme class probabilities (Ward Column 14 Lines 53-61 - One difference between end-to-end speech recognition system 200 and end-to-end phoneme recognition system 800 is the output neural network stack 806. The output neural network stack 806 of end-to-end phoneme recognition system 800 contains phonemes rather than words in a vocabulary. In an embodiment, one output node may be provided in the output layer 806 per phoneme, where the value of each output node is the probability that the audio input corresponds to the associated phoneme.);
and classifying, with a second trained universal function approximator different from the first trained universal function approximator, a combination of the predicted phonological feature probabilities and the expected phonological feature values to thereby detect a mispronunciation present in the sampled speech waveform and facilitate phonological feature feedback associated with the mispronunciation (Ward Column 13 Lines 49-65 -  Alternatively, a single output node may be used for the binary classification of male or female. The output of this example would be to classify spoken word as either male or female. Any number of classifications may be used to classify speech by output neural network stack 706. For multi-class classification, such as semantic topic, emotion or sentiment, speaker identification, speaker age, or speaker stress or strain, a single output node may be provided in output layer 706 for each potential classification, where the value of each output node is the probability that the spoken word or words corresponds to the associated classification. While not illustrated, there may be a customization layer that modifies the output of output neural network stack 706 similar to customization layer 207 discussed in connection with FIG. 2. A customization layer may alter predicted outputs based on some external guidance, similar to customization layer 207.).

Regarding Claim 2, Ward teaches all of the limitations of claim 1. Ward also teaches that the speech representation is a time-varying speech waveform, the method further comprising processing the time-varying speech waveform with a filterbank to generate the speech representation in a form of speech features (Ward Column 5 Lines 59-67 and Column 6 Lines 1-4 - The spectrograms for each frame may then be arranged sequentially, producing a two-dimensional representation of the input audio that reflects the frequency content over time. In this way, the front-end module may generate a visual, two-dimensional representation of the input audio for the following neural networks. In some embodiments, front-end module 201 generates other features of the input audio frames. Examples of feature representations include: log-mel filterbanks, Mel-Frequency Cepstral Coefficients (MFCC), and perceptual linear prediction coefficients, among other similar acoustic feature representations.).

Regarding Claim 3, Ward teaches all of the limitations of claim 2. Ward also teaches that the processing of the time-varying speech waveform includes analyzing it with a mel-scale log filterbank (Ward Column 5 Lines 59-67 and Column 6 Lines 1-4 - The spectrograms for each frame may then be arranged sequentially, producing a two-dimensional representation of the input audio that reflects the frequency content over time. In this way, the front-end module may generate a visual, two-dimensional representation of the input audio for the following neural networks. In some embodiments, front-end module 201 generates other features of the input audio frames. Examples of feature representations include: log-mel filterbanks, Mel-Frequency Cepstral Coefficients (MFCC), and perceptual linear prediction coefficients, among other similar acoustic feature representations.).

Regarding Claim 4, Ward teaches all of the limitations of claim 1. Ward also teaches that the speech representation includes multiple frames and the predicted phonological feature probabilities include, for each frame, a set of probability values for each predicted phonological feature (Ward Column 12 Lines 2-11 - For each frame of input audio data, output stack 206 produces a probability distribution over its output nodes for a word transcription or a null output. For each spoken word in the input audio, one frame of the output sequence will be desired to have a high probability prediction for a word of the vocabulary. All other frames of audio data that correspond to the word will be desired to contain the null or blank output. The alignment of a word prediction with the audio of the word is dependent on the hyperparameters of the various stacks and the data used for training.).

Regarding Claim 5, Ward teaches all of the limitations of claim 1. Ward also teaches that the speech representation includes multiple frames and the predicted phoneme class probabilities include, for each frame, a set of probability values for each predicted phoneme class (Ward Column 12 Lines 2-11 - For each frame of input audio data, output stack 206 produces a probability distribution over its output nodes for a word transcription or a null output. For each spoken word in the input audio, one frame of the output sequence will be desired to have a high probability prediction for a word of the vocabulary. All other frames of audio data that correspond to the word will be desired to contain the null or blank output. The alignment of a word prediction with the audio of the word is dependent on the hyperparameters of the various stacks and the data used for training.).

Regarding Claim 6, Ward teaches all of the limitations of claim 1. Ward also teaches that the determining comprises generating the automatic phonetic segmentation by temporally locating each phoneme of the expected phoneme sequence based on the predicted phoneme class probabilities (Ward Column 13 Lines 66-67 and Column 14 Lines 1-11 - FIG. 8 illustrates an end-to-end phoneme recognition system 800 according to an embodiment. The example end-to-end phoneme recognition system 800 illustrated in FIG. 8 is configured to generate a set of phonemes from audio rather than generate a transcription. For example, end-to-end phoneme recognition system 800 may generate a sequence of phonemes corresponding to spoken words rather than a transcription of the words. A useful application of the end-to-end phoneme recognition system 800 is for addressing the text alignment problem, in other words, aligning an audio file with a set of text that is known to correspond to the audio.).

Regarding Claim 8, Ward teaches all of the limitations of claim 1. Ward also teaches that the determining comprises converting the automatic phonetic segmentation to the expected phonological feature values based on a preconfigured model (Ward Column 12 Lines 50-59 - For example, if end-to-end speech recognition system 200 is employed by a particular company, documents from that company may be analyzed to determine relative frequency of words. The output of end-to-end speech recognition system 200 may then be modified by these custom priors to reflect the language usage of the company. In this way, end-to-end speech recognition system 200 may be trained once on a general training dataset and customized for a number of particular use cases while using the same trained model.).

Regarding Claim 9, Ward teaches all of the limitations of claim 1. Ward also teaches that the speech representation includes multiple frames and the method further comprises providing frame-level phonological feature feedback associated with the mispronunciation (Ward Column 22 Lines 33-48 - However, it may also be desirable to train a neural network, such as end-to-end speech recognition system 200, end-to-end speech classification system 700, and end-to-end phoneme recognition system 800, specifically for a custom domain 1220. A custom domain 1220 may differ from the general domain 1210 in numerous aspects, such as frequencies of words, classifications, and phonemes, audio features (such as background noise, accents, and so on), pronunciations, new words that are present in the custom domain 1220 but unseen in the general domain 1210, and other aspects. The statistical distribution of audio examples in general domain 1210 may differ from the distribution in custom domain 1220. It may be desirable to customize the neural network for the custom domain 1220, which can potentially improve performance significantly in the custom domain 1220.).

Regarding Claim 10, Ward teaches all the limitations of claim 1. Ward also teaches that the classifying comprises adjusting sensitivity of mispronunciation detection based on a threshold applied to an output of the second trained universal function approximator (Ward Column 21 Lines 557-67 - As more training is performed, more expert neural network layers are expected to be needed to address the pigeon hole principle. In an embodiment, a counter stores the number of training examples that have been run through the neural network. The counter is incremented with each new training example. A threshold, which may be a threshold value or threshold function, defines the points at which the size of the expert knowledge store increases in size. When the counter of training examples exceeds the threshold, one or more new rows are added to the expert knowledge store.).

Regarding Claim 11, Ward teaches all the limitations of claim 1. Ward also teaches calculating a confidence score for the mispronunciation (Ward Column 16 Lines 12-28 - In an embodiment, the scoring function for evaluating candidate alignments produces a score based on the number of matching phonemes, that is, the number of audio phonemes and text phonemes that are mapped to each other and are the same phoneme; the number of missed phonemes, meaning the number of audio phonemes or text phonemes that are not mapped to any phoneme in the other set; and the distance from the hint, where the hint is the alignment at the parent iteration of the beam search. In an embodiment, the distance from the hint is evaluated by iterating over the audio phonemes or text phonemes and producing a score for each of the phonemes. The score is higher when the audio phoneme or text phoneme has stayed in the same position or changed position only a little and lower when the audio phoneme or text phoneme has moved to a significantly farther position, where the distance may be measured by, for example, time or number of phoneme positions moved.).

Regarding Claim 12, Ward teaches all the limitations of claim 1. Ward also teaches that the phonological feature feedback comprises a confidence score for a phonological feature error (Ward Column 16 Lines 12-28 - In an embodiment, the scoring function for evaluating candidate alignments produces a score based on the number of matching phonemes, that is, the number of audio phonemes and text phonemes that are mapped to each other and are the same phoneme; the number of missed phonemes, meaning the number of audio phonemes or text phonemes that are not mapped to any phoneme in the other set; and the distance from the hint, where the hint is the alignment at the parent iteration of the beam search. In an embodiment, the distance from the hint is evaluated by iterating over the audio phonemes or text phonemes and producing a score for each of the phonemes. The score is higher when the audio phoneme or text phoneme has stayed in the same position or changed position only a little and lower when the audio phoneme or text phoneme has moved to a significantly farther position, where the distance may be measured by, for example, time or number of phoneme positions moved.).

Regarding Claim 13, Ward teaches all of the limitations of claim 1. Ward also teaches training one or both of the first and second trained universal function approximator (Ward Column 17 Lines 16-26- Turning to the method of training the neural networks, in some embodiments, all layers and stacks of an end-to-end speech recognition system 200, end-to-end speech classification system 700, or end-to-end phoneme recognition system 800 are jointly trained as a single neural network. For example, end-to-end speech recognition system 200, end-to-end speech classification system 700, or end-to-end phoneme recognition system 800 may be trained as a whole, based on training data that contains audio and an associated ground-truth output, such as a transcription.).

Regarding Claim 14, Ward teaches all the limitations of claim 1. Ward also teaches that the first trained universal function approximator comprises a convolutional neural network (Ward Column 6 Lines 43-56 -  Returning to FIG. 2, CNN stack 202 receive the representation of the audio input from front-end module 201. CNN stack 202 processes the audio features to determine a first set of features. Specifically, CNN stack 202 generates a number of feature maps corresponding to a number of convolutional filters, where each convolutional filter represents some characteristic or feature of the audio input. This step may be regarded as roughly analogous to determining a phoneme representation of input audio, however CNN stack 202 does not discretize the output to a set number of acoustic representations. The features determined by CNN stack 202 are not limited to a predetermined set of phonemes. Because it is not so limited, CNN stack 202 can encode a wide range of information.).

Regarding Claim 16, Ward teaches all the limitations of claim 1. Ward also teaches one or more non-transitory computer-readable storage devices storing instructions thereon that, when executed by one or more processors implementing a computer-based pronunciation analysis system configured to detect phoneme mispronunciation and provide phonological feature feedback based on a speech representation of a sampled speech waveform and expected linguistic content that includes an expected phoneme sequence, configure the one or more processors to perform the method of claim 1 (Ward Column 13 Lines 49-65 -  Alternatively, a single output node may be used for the binary classification of male or female. The output of this example would be to classify spoken word as either male or female. Any number of classifications may be used to classify speech by output neural network stack 706. For multi-class classification, such as semantic topic, emotion or sentiment, speaker identification, speaker age, or speaker stress or strain, a single output node may be provided in output layer 706 for each potential classification, where the value of each output node is the probability that the spoken word or words corresponds to the associated classification. While not illustrated, there may be a customization layer that modifies the output of output neural network stack 706 similar to customization layer 207 discussed in connection with FIG. 2. A customization layer may alter predicted outputs based on some external guidance, similar to customization layer 207.). 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 7 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ward in view of Thomson et al. (US 10388272 B1) (Further referred to as “Thomson”).

Regarding Claim 7, Ward teaches all of the limitations of claim 6. Thomson further teaches that  the temporally locating comprises processing the expected phoneme sequence and the predicted phoneme class probabilities with a finite state transducer (Thomson Column 181 Lines 18-19 - For example, the denormalizer 5830, in some embodiments, may include a finite state transducer.).
Ward and Thomson are both considered to be analogous to the claimed invention because both relate to speech processing. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Ward based on Thomson to better train the neural network system for increased accuracy in processing the users statements (Thomson Column 181 Lines 23-28 - Alternatively or additionally, the environment 5800 may be used to train other models. For example, using features as a first input to the model trainer 5802 and target values as a second input, the environment 5800 may be used to train models for capitalization, punctuation, accuracy estimation, or transcription unit selection.).

Regarding Claim 15, Ward teaches all of the limitations of claim 1. Thomson further teaches that the second trained universal function approximator comprises a deep neural network (Thomson Column 119 Lines 9-14 - Additionally or alternatively, nodes in the neural network may be organized in layers. The neural network may have as few as one layer or it may have multiple layers as in deep neural networks (DNNs). The neural network may be feed-forward so that all connections send signals towards the output.).
Ward and Thomson are both considered to be analogous to the claimed invention because both relate to speech processing. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Ward based on Thomson to create a better and more accurate sensitive information model (Thomson Column 207 Lines 27-37-  A machine learning method, such as logistic regression or deep neural network training or another method from Table 9, may process the marked corpus to learn patterns associated with sensitive information and to create a sensitive information model. Once the model is created, a classifier may use the sensitive information model to identify n-grams likely to contain sensitive information. 25. The n-gram may contain at least one specified combination of sensitive information, where sensitive information may be one or more of the items listed above.).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: Prabhavalkar et al. (US 20200043483 A1).
Prabhavalkar et al. (US 20200043483 A1) teaches “methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models” (Prabhavalkar – Abstract).
Please, see additional references in form PTO-892 for more details.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to UTHEJ KUNAMNENI whose telephone number is (571)272-5428. The examiner can normally be reached M-F 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on (571) 272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/UTHEJ KUNAMNENI/               Examiner, Art Unit 2656                                                                                                                                                                                         
/EDGAR X GUERRA-ERAZO/               Primary Examiner, Art Unit 2656