Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/23/2021has been entered.
 
Compact Prosecution
A training system configured to generate a subsequent trained ML model from a training data by a ML model trainer. 
wherein the ML model trainer is configured automatically generate multiple different ML models from the same or similar training data for comparison.
(Please see section 0023 of applicant’s specification for details)

Response to Arguments
Applicant’s arguments with respect to claim(s) 1, 9 and 17 have been considered but are moot because the new ground of rejection does not rely on any reference 
Regarding Claim 1 and other similar independent claims (9 and 17), applicant argues the combination of the cited references does not disclose the limitation “ …the sequence to sequence model is trained to receive feature vectors from the feature aggregation model, identify one or more sequences, the one or more sequences including a previous state and a current state and generate a prediction sequence score for at least one of the one or more identified sequences, the prediction sequence score representing a likelihood of the at least one of the more identified sequences being indicative of the at least one of the one or more speech features.” 
Examiner respectfully disagree because the combination of Thomason (US20200175961) in view of Kim (US 20200005794) 
Thomason discloses the sequence to sequence model is trained to receive feature vectors from aggregate model  (Section 0255, lines 1-8 “the probability calculator receives a vector of feature” and Table 1 item 7 discusses Multiple clusters (aggregate model) that can be used to aggregate features of vectors-See section 1132) identify one or more sequences, the one or more sequences including a previous state  and a current state  (Section 0229, lines 1-3- thus “Past accuracy estimates reads on the previous state of sequences) and generate a prediction sequence for at least one of the one or more identified sequences,  (Section 0255, lines 1-8 “the probability calculator receives a vector of features and determine a set of probabilities such as phoneme probabilities”  discussed as reads on the prediction sequences) 
(Regarding the previous state and a current state, the secondary reference Kim-20200005794 Section 0101-teaches back propagation which is using the total loss in the past (Previous state) back into the current (current state) estimation) 
and generate a prediction sequence for at least one of the one or more identified sequences,  (Section 0255, lines 1-8 “the probability calculator receives a vector of features and determine a set of probabilities such as phoneme probabilities”  discussed as reads on the prediction sequences) the prediction sequence representing a likelihood of the at least one of the more identified sequences being indicative of the at least one of the one or more speech features. (Section 0255, lines 4-7- thus the phoneme probabilities  (likelihood) indicates the probability that the audio sample described in the vector of features is a particular phoneme of speech means at least one or more speech features are indicated by the phoneme probability) 

    PNG
    media_image1.png
    358
    762
    media_image1.png
    Greyscale

Figure 1: Probabilities (Likelihood) of Phoneme is computed.

Thomason does not express the phoneme probability as a score.
 Kim teaches calculating a probability score by combining the correspondence probability and placement probability uttered speech using  acoustic models- Section 0115. 

    PNG
    media_image2.png
    351
    481
    media_image2.png
    Greyscale

Figure 2 shows how probability is calculated as a score

Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention as taught by Thomason to include the teaching of encoding networks to be able to obtain a predicted score. The motivation is that the score makes it easier for the system to determine which transcription unit to select for processing.
Applicant argues the same limitation for claims 7,9,15,17 and 18 and therefore the response above is use to explain why Thomason reads the limitation. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have 

Claims 1-6, 8-14,16-17 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Thomason (20200175961) in view of Kim (US 20200005794).
Claim 1, Thomason discloses a data processing system (Section 1514 Environment or system 8400 shown in Fig. 78) comprising:
a processor; (Model processors 8402 in Fig. 78 in Section 1518)  
and a memory in communication with the processor, the memory comprising executable instructions that, when executed by the processor, (Memory 9112 in fig. 84 other embodiments works together with processor 8402 to execute the system) cause the data processing system to perform functions (Section 1530, lines 14-16 the method shown in El. 8500 are executed based on instructions stored on one or more non-transitory computer readable) of: 
receiving speech audio data; (Section 1600, lines 8-10- the provided audio to the external transcription reads on the received speech audio data). 
performing preprocessing on the speech audio data to prepare the speech audio data (Section 1520, lines 2-5- thus the environment 8400 prepares the obtained audio data  before the Automatic speech recognition (ASR) system uses it for training models) for use in one or more models that detect one or more speech features; (Section 1501- thus a switched model is selected based on one or more features as shown in Tables 2 and Table 5). 
 and providing the preprocessed speech audio data to a stacked machine learning (ML) model for the stacked ML model (Section 1603, lines 2-4- thus models 8802a, 8802b, 8803c and model processor 8402 reads on the stacked ML model) (Section 1602, feature extractor may be configured to determine a first set of features from audio) 
(in regards to the stacked machine learning (ML) model Thomason teaches a combinations of models that may include different set of models-See Section 1512)  
wherein the stacked ML model includes a feature aggregation model, (Section 1518, lines 5-8: model processor-8402 receives a set of features which is derived from the source audio- thus the model processor 8402 collects the features which has been extracted from the speech of the user) 
 a sequence to sequence model, (Section 0251, lines 1-4-the feature transformer model) and a decision-making model. (Thomason: Section 1544 teaches Deep neural network (DNN) can be a model such as feature transformation) 
the sequence to sequence model is trained to receive feature vectors from the feature aggregation model (Section 0251, lines 1-4- thus “the feature transformer may be configured to convert the extracted features based on a transformed model into a transformed format”-   (Section 0255, lines 1-8 “the probability calculator receives a vector of feature” and Table 1 item 7 discusses Multiple clusters (aggregate model) that can be used to aggregate features of vectors-See section 1132) identify one or more sequences, the one or more sequences including a previous state  and a current state  (Section 0229, lines 1-3- thus “Past accuracy estimates reads on the previous state of sequences) 
(Regarding the previous state and a current state, the secondary reference Kim-20200005794 Section 0101-teaches back propagation which is using the total loss in the past (Previous state) back into the current (current state) estimation) 
and generate a prediction sequence for at least one of the one or more identified sequences,  (Section 0255, lines 1-8 “the probability calculator receives a vector of features and determine a set of probabilities such as phoneme probabilities”  discussed as reads on the prediction sequences) the prediction sequence representing a likelihood of the at least one of the more identified sequences being indicative of the at least one of the one or more speech features. (Section 0255, lines 4-7- thus the phoneme probabilities  (likelihood) indicates the probability that the audio sample described in the vector of features is a particular phoneme of speech means at least one or more speech features are indicated by the phoneme probability) 

    PNG
    media_image1.png
    358
    762
    media_image1.png
    Greyscale

Figure 3: Probabilities (Likelihood) of Phoneme is computed.

Thomason does not express the phoneme probability as a score.


    PNG
    media_image2.png
    351
    481
    media_image2.png
    Greyscale

Figure 4 shows how probability is calculated as a score

Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention as taught by Thomason to include the teaching of encoding networks to be able to obtain a predicted score. The motivation is that the score makes it easier for the system to determine which transcription unit to select for processing.  

Claim 2, Thomason in view of Kim discloses wherein the feature aggregation model is an attention-based feature aggregation model for aggregating features in the (Thomason: Section 0198, lines 5-13: model processor-8402 receives a set of features which is derived from the source audio) 
 
Claim 3, Thomason in view of Kim discloses wherein the sequence to sequence model is a long short-term memory model (Thomason: Section 0618 lines 4-8 “ the long-time and short time memory neural network”)  for generating the one or more prediction sequence scores. (Thomason: Section 0198, lines 5-14- thus the predicted ASR system accuracy is used to determine if the selected transcription unit needs to be changed)
Claim 4, Thomason in view of Kim discloses wherein the decision-making model is a deep neural network for making a classification based on the prediction sequence score. (Thomason: Section 1544 teaches Deep neural network (DNN) can be a model such as feature transformation) 
Claim 5, Thomason in view of Kim discloses wherein the preprocessing includes segmenting the speech audio data into a plurality of utterances and labeling at least one of the plurality of utterances. (Thomason: Section 1495, lines 3-7- thus “the data (speech) from the speaker before it is trained is divided into multiple training set based on speaker age, phone number, gender, accent, language, voice patterns… voice characteristics”- this means that the data (speech) from the user is segmented in pluralities of data (voice or speech or utterance) based on labels such as accent, voice patterns or voice characteristics in other to be trained by the appropriate model)
(Thomason: Feature Extractor 8430 in Fig. 78) from the speech audio data for each of the plurality of utterances. (Thomason: Section 0108, lines 7-11- thus the speech recognition features are extracted from the first audio (speech audio from users)) 
Claim 8, Thomason in view of Kim discloses wherein the one or more speech features include at least one of a filler pause, clarity, stress level, and disfluency. (Thomason: Section 0112, lines 18-20- thus filler words such as um are speech features that are extracted from the user audio). 
Claim 9, Thomason discloses a data processing system (Section 1514 Environment or system 8400 shown in Fig. 78) comprising 
a processor; (Model processors 8402 in Fig. 78 in Section 1518) and a memory in communication with the processor the memory comprising executable instructions that when executed by the processor, (Memory 9112 in fig. 84 other embodiments works together with processor 8402 to execute the system) cause the data processing system to perform functions (Section 1530, lines 14-16 the method shown in El. 8500 are executed based on instructions stored on one or more non-transitory computer readable) of:
 receiving speech audio data; (Section 1600, lines 8-10- the provided audio to the external transcription reads on the received speech audio data). 
 performing preprocessing on the speech audio data to prepare the speech audio data (Section 1520, lines 2-5- thus the environment 8400 prepares the obtained audio data  before the Automatic speech recognition (ASR) system uses it for training models) for use as an input into one or more models that detect one or more speech features; (Section 1501- thus a switched model is selected based on one or more features as shown in Tables 2 and Table 5). 
providing the preprocessed speech audio data to a stacked ML model; and analyzing the preprocessed speech audio data via the stacked ML model (Section 1603, lines 2-4- thus models 8802a, 8802b, 8803c and model processor 8402 reads on the stacked ML model) to detect the one or more speech features, (Section 1602, feature extractor may be configured to determine a first set of features from audio) 
(in regards to the stacked machine learning (ML) model Thomason teaches a combinations of models that may include different set of models-See Section 1512)
wherein the stacked ML model includes a feature aggregation model (Section 1518, lines 5-8: model processor-8402 receives a set of features which is derived from the source audio- thus the model processor 8402 collects the features which has been extracted from the speech of the user) 
a sequence to sequence model, (Section 0251, lines 1-4-the feature transformer model) and a decision-making model. (Thomason: Section 1544 teaches Deep neural network (DNN) can be a model such as feature transformation) 
the sequence to sequence model is trained to receive feature vectors from the feature aggregation model (Section 0251, lines 1-4- thus “the feature transformer may be configured to convert the extracted features based on a transformed model into a transformed format”-   (Section 0255, lines 1-8 “the probability calculator receives a vector of feature” and Table 1 item 7 discusses Multiple clusters (aggregate model) that can be used to aggregate features of vectors-See section 1132) identify one or more sequences, the one or more sequences including a previous state  and a current state  (Section 0229, lines 1-3- thus “Past accuracy estimates reads on the previous state of sequences) 
(Regarding the previous state and a current state, the secondary reference Kim-20200005794 Section 0101-teaches back propagation which is using the total loss in the past (Previous state) back into the current (current state) estimation) 
and generate a prediction sequence for at least one of the one or more identified sequences,  (Section 0255, lines 1-8 “the probability calculator receives a vector of features and determine a set of probabilities such as phoneme probabilities”  discussed as reads on the prediction sequences) the prediction sequence representing a likelihood of the at least one of the more identified sequences being indicative of the at least one of the one or more speech features. (Section 0255, lines 4-7- thus the phoneme probabilities  (likelihood) indicates the probability that the audio sample described in the vector of features is a particular phoneme of speech means at least one or more speech features are indicated by the phoneme probability) 

    PNG
    media_image1.png
    358
    762
    media_image1.png
    Greyscale

Figure 5: Probabilities (Likelihood) of Phoneme is computed.

Thomason does not express the phoneme probability as a score.
 Kim teaches calculating a probability score by combining the correspondence probability and placement probability uttered speech using  acoustic models- Section 0115. 

    PNG
    media_image2.png
    351
    481
    media_image2.png
    Greyscale

Figure 6 shows how probability is calculated as a score

Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention as taught by Thomason to include the teaching of encoding networks to be able to obtain a predicted score. The motivation is that the score makes it easier for the system to determine which transcription unit to select for processing.  

Claim 10, Thomason in view of Kim discloses wherein the feature aggregation model is a deep convolutional neural network for aggregating features in the preprocessed speech data. (Thomason: Section 1544 teaches Deep neural network (DNN) can be a model used for extracting audio features) 
Claim 11, Thomason in view of Kim discloses wherein the sequence to sequence model is a long short-term memory model (Thomason: Section 0618 lines 4-8 “ the long-time and short time memory neural network”)  for generating the one or more prediction sequence scores. (Thomason: Section 0198, lines 5-14- thus the predicted ASR system accuracy is used to determine if the selected transcription unit needs to be changed)
Claim 12, Thomason in view of Kim discloses wherein the decision-making model is deep neural network for making a classification based on the prediction sequence score. (Thomason: Section 1506-switched model that delivers the highest likelihood score)
Claim 13, Thomason in view of Kim discloses wherein the preprocessing includes segmenting the speech audio data into a plurality of utterances. (Thomason: Section 0158, lines 9-14- thus “the first transcription, the second transcription and the third transcription based on the input from the CA before being provided to the fuser for further processing” –hence from the speech audio input a plurality of utterances which is transcribes into 1st, 2nd and 3rd  transcriptions are provided before (preprocessing) further processing by the fuser). 
Claim 14, Thomason in view of Kim discloses wherein the preprocessing further includes extracting one or more audio features from the speech audio data for each of the plurality of utterances. (Thomason: Section 1495, lines 3-7- thus “the data (speech) from the speaker before it is trained is divided into multiple training set based on speaker age, phone number, gender, accent, language, voice patterns… voice characteristics”- this means that the data (speech) from the user is segmented in pluralities of data (voice or speech or utterance) based on labels such as accent, voice patterns or voice characteristics in other to be trained by the appropriate model)

Claim 16, Thomason in view of Kim discloses wherein the one or more speech features include at least one of a filler pause, clarity, stress level, and disfluency. (Thomason: Section 0112, lines 18-20- thus filler words such as um are speech features that are extracted from the user audio).
Claim 17, Thomason discloses a method for detecting one or more speech features in speech audio data (Section 1602, - thus the feature extractor detects audio features from the set of audio) comprising: 
(Section 1602 as described above)  comprising:
 receiving the speech audio data; (Section 1600, lines 8-10- the provided audio to the external transcription reads on the received speech audio data)
 performing preprocessing on the speech audio data to prepare the speech audio data (Section 1520, lines 2-5- thus the environment 8400 prepares the obtained audio data  before the Automatic speech recognition (ASR) system uses it for training models) for use as an input into one or more models that detect the one or more speech features; (Section 1501- thus a switched model is selected based on one or more features as shown in Tables 2 and Table 5)
 providing the preprocessed speech audio data to a stacked ML model; (Section 1603, lines 2-4- thus models 8802a, 8802b, 8803c and model processor 8402 reads on the stacked ML model)
 and analyzing the preprocessed speech audio data via the stacked ML model to detect the one or more speech features, (Section 1602, feature extractor may be configured to determine a first set of features from audio) 
(in regards to the stacked machine learning (ML) model Thomason teaches a combinations of models that may include different set of models-See Section 1512)  
 wherein the stacked ML model includes a feature aggregation model, (Section 1518, lines 5-8: model processor-8402 receives a set of features which is derived from the source audio- thus the model processor 8402 collects the features which has been extracted from the speech of the user) 
(Section 0251, lines 1-4-the feature transformer model) and a decision-making model. (Thomason: Section 1544 teaches Deep neural network (DNN) can be a model such as feature transformation) 
the sequence to sequence model is trained to receive feature vectors from the feature aggregation model (Section 0251, lines 1-4- thus “the feature transformer may be configured to convert the extracted features based on a transformed model into a transformed format”-   (Section 0255, lines 1-8 “the probability calculator receives a vector of feature” and Table 1 item 7 discusses Multiple clusters (aggregate model) that can be used to aggregate features of vectors-See section 1132) identify one or more sequences, the one or more sequences including a previous state  and a current state  (Section 0229, lines 1-3- thus “Past accuracy estimates reads on the previous state of sequences) 
(Regarding the previous state and a current state, the secondary reference Kim-20200005794 Section 0101-teaches back propagation which is using the total loss in the past (Previous state) back into the current (current state) estimation) 
and generate a prediction sequence for at least one of the one or more identified sequences,  (Section 0255, lines 1-8 “the probability calculator receives a vector of features and determine a set of probabilities such as phoneme probabilities”  discussed as reads on the prediction sequences) the prediction sequence representing a likelihood of the at least one of the more identified sequences being indicative of the at least one of the one or more speech features. (Section 0255, lines 4-7- thus the phoneme probabilities  (likelihood) indicates the probability that the audio sample described in the vector of features is a particular phoneme of speech means at least one or more speech features are indicated by the phoneme probability) 

    PNG
    media_image1.png
    358
    762
    media_image1.png
    Greyscale

Figure 7: Probabilities (Likelihood) of Phoneme is computed.

Thomason does not express the phoneme probability as a score.
 Kim teaches calculating a probability score by combining the correspondence probability and placement probability uttered speech using  acoustic models- Section 0115. 

    PNG
    media_image2.png
    351
    481
    media_image2.png
    Greyscale

Figure 8 shows how probability is calculated as a score

Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention as taught by Thomason to include the teaching of encoding networks to be able to obtain a predicted score. The motivation is that the score makes it easier for the system to determine which transcription unit to select for processing.  


Claim 19, Thomason in view of Kim discloses wherein the one or more extracted features include at least one of one or more Mel-frequency cepstral coefficients (MFCCs), (Thomason: Section 0249, lines 1-6- thus features extracted are MFCCs) normalized continuous pitch, probability of voicing, pitch delta, a number of formant frequencies and one or more bands for each formant frequency.
(Thomason: Section 0249, lines 1-6- thus features extracted are MFCCs)) 

Claim Rejections - 35 USC § 103
Claims  7, 15  and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Thomason (20200175961) in view Kim as applied to claims 1-6, 8-14 and 16-20 above, and further in view of Iso-Sipila (20030115054). 
Claims 7 and 15, Thomason in view of Kim does not disclose wherein the preprocessing further includes down sampling the one or more audio features for each of the plurality of utterances.
Iso-Sipila discloses a front end system for speech recognition that its preprocessing further includes down sampling the one or more audio features for each of the plurality of utterances (Section 0029- thus a down-sampling device that reduces the sampling rate of the speech features prior to conveying the third signal to the distributed speech recognition). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of down sampling the rate of sampling speech feature before processing the audio to the teaching as taught by Thomason in view of Kim. The motivation is to decrease the bit rate when processing(such as automatic speech recognition) the audio data. 
Claim 18, Thomason in view of Kim discloses wherein preprocessing the speech audio data includes segmenting the speech audio data into one or more utterances, (Thomason: Section 0158, lines 9-14- thus “the first transcription, the second transcription and the third transcription based on the input from the CA before being provided to the fuser for further processing” –hence from the speech audio input a plurality of utterances which is transcribes into 1st, 2nd and 3rd  transcriptions are provided before (preprocessing) further processing by the fuser)
Thomason in view of Kim does not disclose down sampling the one or more extracted features to generate low-level feature vectors for providing to the stacked ML model.
Iso-Sipila discloses a front end system for speech recognition that its preprocessing further includes down sampling the one or more audio features for each of the plurality of utterances (Section 0029- thus a down-sampling device that reduces the sampling rate of the speech features prior to conveying the third signal to the distributed speech recognition). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of down sampling the rate of sampling speech feature before processing the audio to the teaching as taught by Thomason in view of Kim. The motivation is to decrease the bit rate when processing(such as automatic speech recognition)the audio data. 

Cited Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Mishra (20190172458) discloses A computer-implemented method for speech analysis is disclosed comprising: collecting, on a computing device, a first group of utterances in a first language with an associated first set of mental states; storing, on an electronic storage device, the first group of utterances and the associated first set of mental states; training a machine learning system using the first group of utterances and the associated first set of mental states that were stored; processing, on the machine learning system that was trained, a second group of utterances from a second language, wherein the processing determines a second set of mental states corresponding to the second group of utterances; and outputting the second set of mental states.
Yadav (20190279618) discloses an electronic device is provided. The electronic device includes a processor. The processor is configured to identify a set of observable features associated with one or more users. The processor is also configured to generate a set of latent features from the set of observable features. The processor is additionally configured to sort the latent features into one or more clusters, each of the one or more clusters representing verbal utterances of a group of users that share a portion of the latent features.

Conclusion
 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Akwasi M Sarpong whose telephone number is (571)270-3438.  The examiner can normally be reached on Mon-Fri. 8:00am-4:00pm.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KING D POON can be reached on 571-272-7440.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/AKWASI M SARPONG/           Primary  Examiner, Art Unit 2675                                                                                                                                                                                                          02/14/2022