Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 2  are rejected under 35 U.S.C. 103 as being unpatentable over Cannon (US-20220164472-A1) in further view of Sohail (US-11508392-B1), Choi (US-20170053652-A1) and Sivaram (US 20210241776 A1).

With respect to claim 1 Cannon teaches  A system for machine learning assisted speech scoring, comprising: 
a neural network having a nodes (Cannon: ¶[0120]In an embodiment, a Deep Neural Network (DNN) 620 receives the transcription from the memory 618 ); 
a memory storing executable software code ([0006] The computer system includes a processor, a computer-readable memory, and a computer-readable storage medium, and program instructions stored on the storage medium for execution by the processor via the memory.), 
wherein the executable software code includes a software framework (Cannon: ¶[0034] explanation algorithms can be used to determine features that caused classifier to predict sensitive category/classification (e.g., using open source Python libraries [software framework]), a preprocessing submodule (Cannon: ¶[ [0120] In an embodiment, a Deep Neural Network (DNN) 620 receives the transcription from the memory 618 and applies machine learning algorithms to extract grammatical features of the dialogue (e.g., to identify tone, sentiment, etc.), eliminate noise [preprocessing submodule]), a transcriber class ([0120] In an embodiment, the NLP module [transcriber class] 616 applies speech-to-text NLP algorithms to generate a text transcription of the incoming audio.), a confidence submodule (Cannon: ¶[0120] and output a classification indicating whether the post include sensitive information and a confidence score indicative of a degree of certainty the DNN has in the classification output), [[and an application programming interface]] ; 
Cannon does not explicitly disclose but Sohail teaches an application programming interface (Sohail ¶ Col 18 ll 45-57: Specialized components 430 can include, for example, transcriber 434, document understanding model 436, parts-of-speech tagger 438, boost generator 440, intent generator 442, constraint applier 444, content item mapper 446, storyline builder 448, context identifier 450, output provider interface(s) 452, conversation integrator 454, and components and APIs that can be used for providing user interfaces)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Cannon in view of Sohail, to have  an application programming interface in order for items to be selected using external providers, connected through provider interfaces (Col 21 ll 18-20, Sohail).
Cannon, Sohail do not explicitly teach but Choi teaches a processor for implementing commands of the executable software code, wherein the commands include directing the processor to instantiate transcribers from the transcriber class (Choi: ¶[0055]The first recognizer 110 may output a first recognition result of an input audio[input audio instantiates the transcriber] signal in a linguistic recognition unit by using an acoustic model (AM).), to invoke the preprocessing submodule (Choi: ¶[0055] The first recognizer 110 may output a first recognition result of an input audio signal in a linguistic recognition unit by using an acoustic model (AM). In this case, as only an example and noting that alternatives are available in differing embodiments, the audio signal may be converted into audio frames (e.g., 100 frames per second) through one or more preprocessing processes) , and to ensemble the transcribers (Choi: ¶[0054] Referring to FIG. 1, the speech recognition apparatus 100 includes a first recognizer 110, a second recognizer 120, and a combiner 130, for example.), [[wherein the preprocessing submodule is configured to downsample a raw audio file into an audio file]]; and 
wherein each node of the neural network includes one or more of the transcribers (Choi:¶[0021] The generating of the first recognition result [first transcriber which is the output node of neural network; see also Fig. 2] may include generating a recognition result of the audio signal in the first linguistic recognition unit, ¶[0061] The second recognizer 120 may output a second recognition result [second transcriber which is the output node of neural network; see also Fig. 2]in a linguistic recognition unit by using a language model (LM), in which the second recognition result may include a linguistic recognition unit, e.g., alphabetic or syllabic probability information or state information), wherein the transcribers are configured to create text (¶[0056] In addition, herein, the linguistic recognition unit refers to a predetermined linguistic unit to be recognized among basic units in a language, such as phonemes, syllables, morphemes, words, phrases, sentences, paragraphs), from the audio file.  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Cannon in view of Sohail, to instantiate transcribers from the transcriber class in order to provide the final recognition result to a node of the neural network that represents an input of the language model. ([0014], Choi).
 Cannon, Sohail and Choi do not explicitly disclose but Sivaram  teaches wherein the preprocessing submodule is configured to downsample a raw audio file into an audio file (Sivaram:¶[0087] In some cases, where the raw audio file originates via a communications channel configured for (and that generates audio signals having) a relatively wide bandwidth (e.g., 16 kHz), the input layers 402 can down-sample or execute a codec on the raw audio file to produce a corresponding simulated narrowband audio signal.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Cannon, Sohail and Choi  in view of Sivaram such that the preprocessing submodule is configured to downsample a raw audio file into an audio file in order to reduce the dimensions of the feature vectors ([0105], Sivaram).

With respect to claim 2 Sohail further teaches wherein the transcriber class is encapsulated by the application programming interface (Sohail: ¶ Col 18 ll 45-57: Specialized components 430 can include, for example, transcriber 434, document understanding model 436, parts-of-speech tagger 438, boost generator 440, intent generator 442, constraint applier 444, content item mapper 446, storyline builder 448, context identifier 450, output provider interface(s) 452, conversation integrator 454, and components and APIs that can be used for providing user interfaces).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Cannon in view of Sohail, to have  an application programming interface in order for items to be selected using external providers, connected through provider interfaces (Col 21 ll 18-20, Sohail).

Claims 3, 4  are rejected under 35 U.S.C. 103 as being unpatentable over Cannon, Sohail, Choi, and Sivaram as applied to claims 1 and 3 respectively in further view of Relin (US-20210312901-A1)

With respect to claim 3 Cannon, Sohail, Choi and Sivaram do not explicitly disclose but Relin teaches wherein the neural network is configured to score the text (Relin:¶[0055] The virtual assistant of FIG. 1 includes ASR 11, which receives the speech audio and outputs a text transcription or multiple hypothesized text transcriptions with a score for each one representing the probability that it is correct, and,  Relin: ¶[0057] Speech recognition involves at least the steps of acoustic analysis and tokenization. FIG. 3 shows a diagram of an example of speech recognition. An acoustic analysis step 31 receives speech audio and performs acoustic analysis according to an acoustic model 32. Some examples of acoustic models are hidden Markov models (HMM) and neural network acoustic models).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Cannon, Sohail, Choi and Sivaram in view of Relin, to have  the neural network  configured to score the text in order to automatically enhance natural language (NLU) recognition ([0012], Relin).

With respect to claim 4 Cannon, Sohail, Choi and Sivaram do not explicitly disclose but Relin teaches wherein the confidence submodule is configured to calculate probabilities that the text was transcribed accurately (Relin:¶[0055] The virtual assistant of FIG. 1 includes ASR 11, which receives the speech audio and outputs a text transcription or multiple hypothesized text transcriptions with a score for each one representing the probability that it is correct, Relin:¶[0112] Various examples described above may be implemented with computers by running software embodied as instructions on non-transitory computer readable media.).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Cannon, Sohail, Choi and Sivaram in view of Relin, to have  the neural network  configured to score the text in order to automatically enhance natural language (NLU) recognition ([0012], Relin).


Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Cannon, Sohail, Choi, and Sivaram as applied to claims 1 and 3 respectively in further view of Biadsy (US-20210312901-A1).

With respect to claim 5 Choi further teaches wherein the system is further configured to transcribe speech and predicts the score in parallel and [[to combine a plurality of scores to predict a final score]] (Choi: ¶ [0067] In addition, in one or more embodiments, the acoustic model, the language model, and the unified model are trained in advance to output probabilities [score]or state information in a predetermined linguistic recognition unit, for example, and Choi: ¶[0087] In one or more embodiments, operation 420 may be initiated after the initiation of operation 410, operation 420 may begin before operation 410, or they may begin at the same time [parallel], depending on embodiment.) 

Cannon, Sohail, Choi and Sivaram do not explicitly disclose but Biadsy teaches combine a plurality of scores to predict a final score (Biadsy: ¶[0095]. For example, for the first candidate transcription, the re-scoring module may combine scores 155 from the language model 150 for the individual words “hair,” “mousse,” and “beach” to determine an overall score for the phrase “hair mousse beach.”)

It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Cannon, Sohail, Choi and Sivaram in view of Biadsy, to combine a plurality of scores to predict a final score in order to provide data indicating candidate transcriptions. ([0172], Biadsy).

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Choi (US-20170053652-A1) in further view of  Jankowski (US- 20200074997-A1) 

With respect to claim 6 Choi teaches [[preprocessing an audio file to filter out unscoreable]] 
transcribing the audio file among a plurality of automated transcribers (Choi:¶[0021] The generating of the first recognition result [first transcriber which is the output node of neural network; see also Fig. 2] may include generating a recognition result of the audio signal in the first linguistic recognition unit, ¶[0061] The second recognizer 120 may output a second recognition result [second transcriber which is the output node of neural network; see also Fig. 2] in a linguistic recognition unit by using a language model (LM), in which the second recognition result may include a linguistic recognition unit, e.g., alphabetic or syllabic probability information or state information) into a plurality of transcripts (¶[0056] In addition, herein, the linguistic recognition unit refers to a predetermined linguistic unit to be recognized among basic units in a language, such as phonemes, syllables, morphemes, words, phrases, sentences, paragraphs); and 
scoring the plurality of transcripts among nodes of a neural network to create a plurality of scores (Choi: ¶ [0067] In addition, in one or more embodiments, the acoustic model, the language model, and the unified model are trained in advance to output probabilities [score]or state information in a predetermined linguistic recognition unit, for example, and Choi: ¶[0087] In one or more embodiments, operation 420 may be initiated after the initiation of operation 410, operation 420 may begin before operation 410, or they may begin at the same time [parallel], depending on embodiment.), wherein the transcribing and the scoring is performed in parallel (Choi: ¶ [0067] In addition, in one or more embodiments, the acoustic model, the language model, and the unified model are trained in advance to output probabilities or state information in a predetermined linguistic recognition unit, for example, and Choi: ¶[0087] In one or more embodiments, operation 420 may be initiated after the initiation of operation 410, operation 420 may begin before operation 410, or they may begin at the same time[parallel], depending on embodiment.) 
Choi does not explicitly disclose however Jankowski teaches preprocessing an audio file to filter out unscoreable audio (Jankowski: ¶[0067] Denoising refers to a process which removes noise from the audio representation thus allowing the classifier to better discriminate between speech and non-speech) and to downsample scorable audio ([0137] In some embodiments the system can be configured to down-sample AISHELL to 8 kHz)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Choi in view of  Jankowski, to preprocessing an audio file to filter out unscoreable in order to provide the  system with much greater robustness to noise ([0068], Jankowski).
With respect to claim 8 Jankowski further teaches wherein the unscorable audio is an audio file that contains no speech, that is longer than a predetermined time, that is corrupted, or that contains speech from multiple speakers (Jankowski: ¶[0067] Denoising refers to a process which removes noise from the audio representation thus allowing the classifier to better discriminate between speech and non-speech) and to downsample scorable audio ([0137] In some embodiments the system can be configured to down-sample AISHELL to 8 kHz). 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Choi in view of  Jankowski, to preprocessing an audio file to filter out unscoreable in order to provide the  system with much greater robustness to noise ([0068], Jankowski).


Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Choi, Jankowski as applied to claim 6 in further view of Biadsy.

With respect to claim 7  Choi teaches further comprising ensembling the plurality of transcripts and the plurality of scores [[to predict a final score]] (Choi: ¶[0054] Referring to FIG. 1, the speech recognition apparatus 100 includes a first recognizer 110, a second recognizer 120, and a combiner 130, for example, ¶ [0067] In addition, in one or more embodiments, the acoustic model, the language model, and the unified model are trained in advance to output probabilities [score]or state information in a predetermined linguistic recognition unit, for example, and Choi: ¶[0087] In one or more embodiments, operation 420 may be initiated after the initiation of operation 410, operation 420 may begin before operation 410, or they may begin at the same time [parallel], depending on embodiment.)
Choi, Jankowski do not explicitly teach but Biadsy teaches combine a plurality of scores to predict a final score (Biadsy: ¶[0095]. For example, for the first candidate transcription, the re-scoring module may combine scores 155 from the language model 150 for the individual words “hair,” “mousse,” and “beach” to determine an overall score for the phrase “hair mousse beach.”)
 It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify, Choi, Jankowski in view of Biadsy, to combine a plurality of scores to predict a final score in order to provide data indicating candidate transcriptions. ([0172], Biadsy).

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Choi, Jankowski as applied to claim 6 in further view of Liu (US 20170125031 A1).

With respect to claim 9 Choi, Jankowski do not explicitly disclose but Liu teaches wherein preprocessing further comprises creating a condition code model ([0525] In this embodiment of the present disclosure, the processor 1001 executes the code or instruction in the memory 1005, to: perform time-frequency transformation processing on a time-domain signal of a current audio frame, to obtain spectral coefficients of the current audio frame; acquire a reference coding parameter of the current audio frame; and if the acquired reference coding parameter of the current audio frame satisfies a first parameter condition, code the spectral coefficients of the current audio frame based on a transform coded excitation algorithm, or if the acquired reference coding parameter of the current audio frame satisfies a second parameter condition, code the spectral coefficients of the current audio frame based on a high quality transform coding algorithm.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Choi, Jankowski in view of Liu, to have  preprocessing further comprise creating a condition code model in order to improve coding quality or coding efficiency of the current audio frame ([0190], Liu.)


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ATHAR N PASHA whose telephone number is (408)918-7675. The examiner can normally be reached on Monday-Thursday Alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ATHAR N PASHA/Examiner, Art Unit 2657  

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657