DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
In response to the office action from 3/31/2021, the applicant has submitted an amendment, filed 6/30/2021, amending claims 1, 4, 13, 18, 20, while arguing to traverse the prior art rejections. Applicant’s argument have been fully considered but are moot with respect to new grounds of rejections further in view of Chopra et al. (US Patent 10,529,324) mandated by the latest amendments.
Response to Arguments
In what follows applicant’s arguments and comments will be addressed in the order presented with each argument or comment presented in a given ¶, to be followed by one or more ¶’s of respective examiner’s responses.
Following a broad overview of the last office action on page 10 the 1st ¶, in section “I”, the previous claim objections are discussed.
Due to the latest amendments, except one, all the remaining objections are overcome.
From the last ¶ on page 10 to the end of the 1st ¶ on page 14, after providing a detailed overview of the primary reference Zhou et al. (US 2018/0137857) as well as 
Since a new reference Chopra et al. is used for all the recent amendments, therefore the applicant is respectfully directed to the new office action for further details.
Page 14 the 2nd ¶ argues that the “Dependent Claims” “are also allowable at least for the reason described above”.
Since applicants have not argued the merits of these dependent claims, but assert patentability solely through their dependence on the allegedly patentable parent claims, they stand or fall with said parent claims and hence no further response to applicant’s arguments is necessary.
The remainder of page 14 and page 15 discuss the double patenting rejections of the last office action and in particular it is recited: “Applicant respectfully request that the Examiner hold this rejection in abeyance until all allowable subject matter has been indicated”.
The rejection with respect to 16/056,298 is thus maintained. The rejection with respect to 16/297,603 due to the latest amendments in the instant application is withdrawn.

Claim Objections
18 objected to because of the following informalities:  “an difference in length” appears to be a misspelling of “a difference in length”.  Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-20 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding claims 1, 13, 20, the limitation before last recites: “determining a predetermined number of the multiple ASR engines having” “transcripts that is better than the historical performance of remaining number of the multiple ASR engines”. This implies either the “predetermined number of the multiple ASR engines” is one “ASR” “engine” associated with a single “transcript”. Or the verb “is” (presented in bold) was a typo intended for the verb “are”. The examiner interpreted the former option since in 
Claims 2-12 (dependent on claim 1), and 14-19 (dependent on claim 13), as they don’t obviate the problem noted for their respective parent claims, they are thus rejected under similar rationale.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1, 5, 12-13, 16, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over  Zhou et al. (US 2018/0137857), and further in view of Chopra et al. (US Patent 10,529,324).
Regarding claim 1, Zhou et al. do teach a method for analyzing a transcription of a recording (Title: “METHOD FOR” “HYBRID SPEECH RECOGNITION” which according to ¶ 0015 lines 3-4 “The result can be” “text” (a transcription generated) “in a machine readable format” (and analyzed from), which according to step “204” (Fig. 2) operates on ‘RECORDED SPEECH INPUT” (a recording)) , 
comprising:

speech recognition (ASR) engines from voice activity in the recording (according to Fig. 2 step “204”: “GENERATE MULTIPLE CANDIDATE SPEECH RECOGNTION REULSTS” (i.e., “text” (transcriptions produced)) “USING MULTIPLE SPEECH RECOGNITION ENGINES” (by multiple automatic speech recognition engines) “BASED ON RECORDED SPEECH INPUT” (from voice activity in recording); step “208”: “EXTRACT ONE OR MORE OF TRIGGER PART FEATURES, CONFIDENCE SCORE FEATURES” (generating features) “FROM EACH CANDIDATE SPEECH RECOGNITION RESULT” (representing the transcriptions); also according to step “164” there are “WORD-LEVEL FEATURES” (another feature), and according to ¶ 0053 lines 9+: “Levenshtein distance metric that quantifies the differences between the speech recognition result and the predetermined ground-truth speech input training data” (another feature generated); ¶ 0033 lines 1-2 : “feature extractor” “generates” “a” “bag-of-words with decay feature” (another feature generated))
and a best transcription of the recording produced by an ensemble model from the transcriptions (¶ 0046 last 5 lines: “The controller 148 identifies the candidate speech recognition result with the highest ranking score” (a best transcription of the recording) “based on the index of the output neuron” “that produces the highest ranking score within the neural network” (based on an ensemble model));

storing the score in association with the best transcription (step “412”: “STORE TRAINED NEURAL NETWORK STRUCTURE” (storing transcriptions including the best transcription) “AND FEATURE VECTOR STRUCTURE” (e.g. “CONFIDENCE SCORE FEATUERS” (the scores)); e.g., ¶ 0052 lines 6-7: “memory” “stores” (storing) “data corresponding to training input data”, where according to ¶ 0053 lines 5+: “The training speech recognition result data also include confidence scores” (“training” “data” comprises “speech recognition result” (e.g. best transcription) as well their associated “confidence scores” (and its score) and they are both “stor[ed]”)) .
Zhou et al. do not specifically disclose:
Wherein applying the machine learning model to the features to produce the score representing the accuracy of the best transcription comprises:
Determining a predetermined number of the multiple ASR engines having historical performance in generating transcripts that is better than the historical performance of a remaining number of the multiple ASR engines under predetermined conditions as contributor ASR engines, the pre-determined conditions comprising a selected metadata associated with the recording,
Determining a remaining number of the multiple ASR engines as selector ASR engines, and
Applying the machine learning model to produce the score representing error rates between the best transcription and the transcriptions from the selector ASR engines.
Chopra et al. do teach:
Determining a predetermined number of the multiple ASR engines having historical performance in generating transcripts that is better than the historical performance of a remaining number of the multiple ASR engines under predetermined conditions as contributor ASR engines, the pre-determined conditions comprising a selected metadata associated with the recording (Col. 9 lines 16+: “selecting” “using a balancer, an automatic speech recognition tool from a plurality of automatic speech recognition tools” (determining one ASR engine among a multiple of ASR engines) “for the geographic location” (under a predetermined condition also by considering “dialect” (Col. 9 line 14 (another predetermined condition)) “based at least” “on a historical accuracy” (based on historical performance) “of the plurality of automatic speech recognition tools” (of a remaining number of the multiple ASR engines) “compris[ing] a neural network” (by applying machine learning) “that utilizes classification training” 
Determining a remaining number of the multiple ASR engines as selector ASR engines, and Applying the machine learning model to produce the score representing error rates between the best transcription and the transcriptions from the selector ASR engines (Col. 7 lines 31+: “various ASR(s) may be utilized to generate the textual representation at 406, and each is assigned a confidence score” (producing a score) “to narrow or select the proper ASR based on the determined score” (which reflects error rate of the “proper” (best) “ASR” “textual representation” (transcription) with respect to all the other remaining (selector) ASR engines; Col. 7 lines 18+: “confidence score may be based off of historical statistical information” and according to Col. 9 lines 19-22: “historical accuracy” is determined based on “speech recognition tool compri[sing] a neural network” (“score” is determined based on “neural network” (machine learning))).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the methods used in determining “confidence score” in Chopra et al. into the methods in obtaining “confidence score” in Zhou et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Zhou et al. technique “increasing the ability of the device to operate using voice 


Regarding claim 5, Zhou et al. do teach the method of claim 1, wherein generating the features representing the transcriptions produced by the multiple ASR engines from voice activity in the recording and the best transcription of the recording produced by the ensemble model from the transcriptions comprises:
generating a first set of features from the transcriptions (the “Levenshtein distance metric” according to ¶ 0054 line 4 item “2” “is at most the length of the longer string” (a first set of features); step “208” “CONFIDENCE SCORE FEATURES” (another first set of features)); 
generating a second set of features from pairwise comparisons of the transcriptions (the “Levenshtein distance metric” according to ¶ 0054 lines 3-4 item “1” does also determine a “difference of the sizes of the two strings” (a second set of features); furthermore according to ¶ 0053 lines 13-14: “Levenshtein distance metric” “is” an “edit distance” (also second set of features) which according to ¶ 0056 last 7 lines “accurately reflect” “level of correctness” “which” “include” “range of errors that affect ranking score”; according to ¶ 0054 last sentence: “edit distance” “describe the differences between the training speech recognition results and the corresponding 
generating a third set of features from the best transcription (step “208” “CONFIDENCE SCORE FEATURE” helps “identif[y]” “candidate speech recognition result with the highest ranking score” (¶ 0047 last 5 lines (best transcription attribute (third set of features)); 
and generating a fourth set of features from the recording (¶ 0033 lines 1-2 and 8-9 : “feature extractor” “generates” “a” “bag-of-words with decay feature” (fourth set of  features generated) which is “based on the occurrence times and positions” (based on a position in the recording) “of the word within the result”; step “404” “TRIGGER PAIRS” and/or “WORD-LEVEL FEATURES” (another fourth set of features obtained from the audio)).

Regarding claim 12, Zhou et al. do teach the method of claim 1, wherein the machine learning model comprises an artificial neural network (¶ 0006 lines 17+: “The method includes performing with the processor, a training process for a neural network” (using neural network) “ranker using the plurality of feature vectors”(to the feature vectors)).

Regarding claim 13, Zhou et al. do teach a non-transitory computer readable medium storing instructions that, when executed by a processor (¶ 0023: “The controller 148 includes one or more integrated circuits configured as one or a combination of a central processing unit (CPU), graphical processing unit (GPU), microcontroller, field programmable gate array (FPGA), application specific integrated circuit (ASIC), digital signal processor (DSP), or any other suitable digital logic device. The controller 148 also includes a memory, such as a solid state or magnetic data storage device, that stores programmed instructions for operation of the in-vehicle information system 100”),
Cause the processor to perform the steps of:
generating features representing transcriptions produced by multiple automatic
speech recognition (ASR) engines from voice activity in the recording (according to Fig. 2 step “204”: “GENERATE MULTIPLE CANDIDATE SPEECH RECOGNTION REULSTS” (i.e., “text” (transcriptions produced)) “USING MULTIPLE SPEECH RECOGNITION ENGINES” (by multiple automatic speech recognition engines) “BASED ON RECORDED SPEECH INPUT” (from voice activity in recording); step “208”: “EXTRACT ONE OR MORE OF TRIGGER PART FEATURES, CONFIDENCE SCORE FEATURES” (generating features) “FROM EACH CANDIDATE SPEECH RECOGNITION RESULT” (representing the transcriptions); also according to step “164” there are “WORD-LEVEL FEATURES” (another feature), and 
and a best transcription of the recording produced by an ensemble model from the transcriptions (¶ 0046 last 5 lines: “The controller 148 identifies the candidate speech recognition result with the highest ranking score” (a best transcription of the recording) “based on the index of the output neuron” “that produces the highest ranking score within the neural network” (based on an ensemble model));
applying a machine learning model to the features to produce a score representing an accuracy of the best transcription (step “212”: “PROVIDE FEATURE VECTORS FOR MULTIPLE SPEECH RECOGNITION RESULTS AS INPUTS TO NEURAL NETWORK” (apply machine learning to the “FEATURE VECTORS” (the features)) “TO GENERATE RANKING SCORES” (to produce scores representing accuracy of the transcriptions and in particular the “highest ranking” (best) “result” (transcription)); and
storing the score in association with the best transcription (step “412”: “STORE TRAINED NEURAL NETWORK STRUCTURE” (storing transcriptions including the best transcription) “AND FEATURE VECTOR STRUCTURE” (e.g. “CONFIDENCE SCORE FEATUERS” (the scores)); e.g., ¶ 0052 lines 6-7: “memory” “stores” (storing) “data 
Zhou et al. do not specifically disclose:
Wherein applying the machine learning model to the features to produce the score representing the accuracy of the best transcription comprises:
Determining a predetermined number of the multiple ASR engines having historical performance in generating transcripts that is better than the historical performance of a remaining number of the multiple ASR engines under predetermined conditions as contributor ASR engines, the pre-determined conditions comprising a selected metadata associated with the recording,
Determining a remaining number of the multiple ASR engines as selector ASR engines, and
Applying the machine learning model to produce the score representing error rates between the best transcription and the transcriptions from the selector ASR engines.
Chopra et al. do teach:
Determining a predetermined number of the multiple ASR engines having historical performance in generating transcripts that is better than the historical performance of a remaining number of the multiple ASR engines under predetermined conditions as contributor ASR engines, the pre-determined conditions comprising a selected metadata associated with the recording (Col. 9 lines 16+: “selecting” “using a balancer, an automatic speech recognition tool from a plurality of automatic speech recognition tools” (determining one ASR engine among a multiple of ASR engines) “for the geographic location” (under a predetermined condition also by considering “dialect” (Col. 9 line 14 (another predetermined condition)) “based at least” “on a historical accuracy” (based on historical performance) “of the plurality of automatic speech recognition tools” (of a remaining number of the multiple ASR engines) “compris[ing] a neural network” (by applying machine learning) “that utilizes classification training” (comprising a selected metadata) “to improve the accuracy of the geographically dependent automatic speech recognition tool”),
Determining a remaining number of the multiple ASR engines as selector ASR engines, and Applying the machine learning model to produce the score representing error rates between the best transcription and the transcriptions from the selector ASR engines (Col. 7 lines 31+: “various ASR(s) may be utilized to generate the textual representation at 406, and each is assigned a confidence score” (producing a score) “to narrow or select the proper ASR based on the determined score” (which reflects error rate of the “proper” (best) “ASR” “textual representation” (transcription) with respect to all the other remaining (selector) ASR engines; Col. 7 lines 18+: “confidence score may 
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the methods used in determining “confidence score” in Chopra et al. into the methods in obtaining “confidence score” in Zhou et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Zhou et al. technique “increasing the ability of the device to operate using voice or speech input” based on “obtaining geographical information relating to” “the user or the user’s device” as discussed in Chopra et al. Col. 7 lines 45-50.

Regarding claim 16, Zhou et al. do teach the non-transitory computer readable medium of claim 13, wherein generating the features representing the transcriptions produced by the multiple ASR engines from voice activity in the recording and the best transcription of the recording produced by the ensemble model from the transcriptions comprises:
generating a first set of features from the transcriptions (the “Levenshtein distance metric” according to ¶ 0054 line 4 item “2” “is at most the length of the longer 
generating a second set of features from pairwise comparisons of the transcriptions (the “Levenshtein distance metric” according to ¶ 0054 lines 3-4 item “1” does also determine a “difference of the sizes of the two strings” (a second set of features); furthermore according to ¶ 0053 lines 13-14: “Levenshtein distance metric” “is” an “edit distance” (also second set of features) which according to ¶ 0056 last 7 lines “accurately reflect” “level of correctness” “which” “include” “range of errors that affect ranking score”; according to ¶ 0054 last sentence: “edit distance” “describe the differences between the training speech recognition results and the corresponding ground-truth training inputs”; ¶ 0054 lines 11-12: “Hamming distance, in turn, refers to a metric of the minimum number of substitutions required to change one string into the other” (another second set of features reflecting pairwise comparisons of the transcriptions));
generating a third set of features from the best transcription (step “208” “CONFIDENCE SCORE FEATURE” helps “identif[y]” “candidate speech recognition result with the highest ranking score” (¶ 0047 last 5 lines (best transcription attribute (third set of features)); 
and generating a fourth set of features from the recording (¶ 0033 lines 1-2 and 8-9 : “feature extractor” “generates” “a” “bag-of-words with decay feature” (fourth set 

Regarding claim 20, Zhou et al. do teach a system, comprising: a memory that stores instructions; and a processor that is coupled to the memory and, when executing the instructions (¶ 0023: “The controller 148 includes one or more integrated circuits configured as one or a combination of a central processing unit (CPU), graphical processing unit (GPU), microcontroller, field programmable gate array (FPGA), application specific integrated circuit (ASIC), digital signal processor (DSP), or any other suitable digital logic device. The controller 148 also includes a memory, such as a solid state or magnetic data storage device, that stores programmed instructions for operation of the in-vehicle information system 100”),
Is configured to:
generate features representing transcriptions produced by multiple automatic
speech recognition (ASR) engines from voice activity in the recording (according to Fig. 2 step “204”: “GENERATE MULTIPLE CANDIDATE SPEECH RECOGNTION REULSTS” (i.e., “text” (transcriptions produced)) “USING MULTIPLE SPEECH RECOGNITION ENGINES” (by multiple automatic speech recognition engines) “BASED ON RECORDED SPEECH 
and a best transcription of the recording produced by an ensemble model from the transcriptions (¶ 0046 last 5 lines: “The controller 148 identifies the candidate speech recognition result with the highest ranking score” (a best transcription of the recording) “based on the index of the output neuron” “that produces the highest ranking score within the neural network” (based on an ensemble model));
apply a machine learning model to the features to produce a score representing an accuracy of the best transcription (step “212”: “PROVIDE FEATURE VECTORS FOR MULTIPLE SPEECH RECOGNITION RESULTS AS INPUTS TO NEURAL NETWORK” (apply machine learning to the “FEATURE VECTORS” (the features)) “TO GENERATE RANKING SCORES” (to produce scores representing accuracy of the transcriptions and in particular the “highest ranking” (best) “result” (transcription)); and

Zhou et al. do not specifically disclose:
Wherein applying the machine learning model to the features to produce the score representing the accuracy of the best transcription comprises:
Determining a predetermined number of the multiple ASR engines having historical performance in generating transcripts that is better than the historical performance of a remaining number of the multiple ASR engines under predetermined conditions as contributor ASR engines, the pre-determined conditions comprising a selected metadata associated with the recording,
Determining a remaining number of the multiple ASR engines as selector ASR engines, and
Applying the machine learning model to produce the score representing error rates between the best transcription and the transcriptions from the selector ASR engines.
Chopra et al. do teach:
Determining a predetermined number of the multiple ASR engines having historical performance in generating transcripts that is better than the historical performance of a remaining number of the multiple ASR engines under predetermined conditions as contributor ASR engines, the pre-determined conditions comprising a selected metadata associated with the recording (Col. 9 lines 16+: “selecting” “using a balancer, an automatic speech recognition tool from a plurality of automatic speech recognition tools” (determining one ASR engine among a multiple of ASR engines) “for the geographic location” (under a predetermined condition also by considering “dialect” (Col. 9 line 14 (another predetermined condition)) “based at least” “on a historical accuracy” (based on historical performance) “of the plurality of automatic speech recognition tools” (of a remaining number of the multiple ASR engines) “compris[ing] a neural network” (by applying machine learning) “that utilizes classification training” (comprising a selected metadata) “to improve the accuracy of the geographically dependent automatic speech recognition tool”),
Determining a remaining number of the multiple ASR engines as selector ASR engines, and Applying the machine learning model to produce the score representing error rates between the best transcription and the transcriptions from the selector ASR engines (Col. 7 lines 31+: “various ASR(s) may be utilized to generate the textual representation at 406, and each is assigned a confidence score” (producing a score) “to narrow or select the proper ASR based on the determined score” (which reflects error rate of the “proper” (best) “ASR” “textual representation” (transcription) with respect to all the other remaining (selector) ASR engines; Col. 7 lines 18+: “confidence score may be based off of historical statistical information” and according to Col. 9 lines 19-22: “historical accuracy” is determined based on “speech recognition tool compri[sing] a neural network” (“score” is determined based on “neural network” (machine learning))).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the methods used in determining “confidence score” in Chopra et al. into the methods in obtaining “confidence score” in Zhou et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Zhou et al. technique “increasing the ability of the device to operate using voice or speech input” based on “obtaining geographical information relating to” “the user or the user’s device” as discussed in Chopra et al. Col. 7 lines 45-50.

s 2-4, 8, 14-15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. in view of Chopra et al., and further in view of Kahn et al. (US 2006/0190249).
Regarding claim 2, Zhou et al. in view of Chopra et al. do not specifically disclose the method of claim 1, further comprising:
applying one or more thresholds to the score to characterize the accuracy of the best transcription;
and
determining a candidacy of the recording for human transcription based on the characterized accuracy of the best transcription.
Kahn et al. do teach:
applying one or more thresholds to the score to characterize the accuracy of the best transcription (¶ 0205 last sentence: “An example of a predetermined target accuracy is 95%” (one threshold used in assessing accuracy of transcriptions by either one of the “First SPEECH ENGINE 211” or “SECOND SPEECH ENGINE 213” (Fig. 1)); ¶ 0124 last sentence: “The verbatim text window is, by definition” “as being 100.00% accurate” (another threshold to assess transcription accuracy)); 
and
determining a candidacy of the recording for human transcription based on the characterized accuracy of the best transcription (¶ 0136 last 2 lines: “a correctionist” (a 
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “verbatim” text usage in training the multiple speech recognizers of Kahn et al. into the multiple speech recognizers of Zhou et al. in Zhou et al. in view of Chopra et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Zhou et al. in view of Chopra et al. to use  “verbatim text” “to provide an avenue by which the correctionist may correct text for the purposes of training a speech engine” as disclosed in Kahn et al. ¶ 0123 sentence 2 in order to further aid it in its own multiple speech recognizer training.

Regarding claim 3, Zhou et al. do teach the method of claim 2, further comprising:
updating parameters of the ensemble model based on the training data (¶ 0056 sentence 2: “As is known to the art, stochastic gradient descent trainers include a class of related training processes that train a neural network” (using training data for) “in an 
Zhou et al. in view of Chopra et al. do not specifically disclose:
generating training data for the ensemble model from the best transcription and the human transcription.
Kahn et al. do teach:
generating training data for the ensemble model from the best transcription and the human transcription (¶ 0067 sentence 2: “ The process 200 includes simultaneous use of graphical user interface (GUI) windows to create both a verbatim text” (using human transcription) “for speech engine training” (for generating training data along with) “and a final text” (a best transcription) “to be distributed as a document or report”).
For obviousness to combine Zhou et al. in view of Chopra et al. and Kahn et al. see claim 2.

Regarding claim 4, Zhou et al. in view of Chopra et al. do not specifically disclose the method of claim 2, wherein determining the candidacy of the recording for the human transcription based on the characterized accuracy of the best transcription comprises:
the human transcription when the score falls between a first threshold for a high error rate and a second threshold for a low error rate.
Kahn et al. do teach:
identifying the recording as a candidate for the human transcription when the score falls between a first threshold for a high error rate and a second threshold for a low error rate (¶ 0124 last sentence: “The verbatim text” (a candidate for human transcription) “window is, by definition” “as being 100.00% accurate” (abides by the threshold “100.00%” (a second threshold for low error rate) and is above the “95%” (first threshold for high error rate) as defined in ¶ 0205 last sentence: “An example of a predetermined target accuracy is 95%”).
For obviousness to combine Zhou et al. in view of Chopra et al. and Kahn et al. see claim 2.

Regarding claim 8, Zhou et al. do teach the method of claim 5, wherein the third set of features comprises a second feature representing an attribute of the best transcription (step “208” “CONFIDENCE SCORE FEATURE” helps identify “candidate speech recognition result with the highest ranking score” (an attribute associated with the “highest ranking” (best) transcription)).

Kahn et al. do teach the third set of features comprises a first feature representing a pairwise comparison of the best transcription and each of the transcriptions (¶ 0158 last sentence: “the speech editor may instead find the "matches" and " differences"” (a pairwise comparison) “between a text generated by a single speech engine” (between each of the transcription engines “211” and/or “213” (Fig. 1)) “and the verbatim text” (and the best transcription) “produced by a human transcriptionist”).
For obviousness to combine Zhou et al. in view of Chopra et al. and Kahn et al. see claim 2.

Regarding claim 14, Zhou et al. in view of Chopra et al. do not specifically disclose the non-transitory computer readable medium of claim 13, wherein the steps further comprises:
applying one or more thresholds to the score to characterize the accuracy of the best transcription;
and

Kahn et al. do teach:
applying one or more thresholds to the score to characterize the accuracy of the best transcription (¶ 0205 last sentence: “An example of a predetermined target accuracy is 95%” (one threshold used in assessing accuracy of transcriptions by either one of the “First SPEECH ENGINE 211” or “SECOND SPEECH ENGINE 213” (Fig. 1)); ¶ 0124 last sentence: “The verbatim text window is, by definition” “as being 100.00% accurate” (another threshold to assess transcription accuracy)); 
and
determining a candidacy of the recording for human transcription based on the characterized accuracy of the best transcription (¶ 0136 last 2 lines: “a correctionist” (a human transcriber) “can” “produce” “verbatim text” (to generate the “verbatim” which is “100.00% accurate” (therefore something which abides by this accuracy is candidate of a human transcription and supersedes the “95%” “accuracy” (best machine transcription))).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “verbatim” text usage in training the multiple speech recognizers of Kahn et al. into the multiple speech recognizers of Zhou et al. in Zhou et al. in view of Chopra et al. would enable the 

Regarding claim 15, Zhou et al. do teach the non-transitory computer readable medium of claim 14, wherein the steps further comprise:
updating parameters of the ensemble model based on the training data (¶ 0056 sentence 2: “As is known to the art, stochastic gradient descent trainers include a class of related training processes that train a neural network” (using training data for) “in an iterative process by adjusting” (updating) “the parameters” (parameters) “within the neural network” (of the ensemble model)).
Zhou et al. in view of Chopra et al. do not specifically disclose:
generating training data for the ensemble model from the best transcription and the human transcription.
Kahn et al. do teach:
generating training data for the ensemble model from the best transcription and the human transcription (¶ 0067 sentence 2: “ The process 200 includes simultaneous use of graphical user interface (GUI) windows to create both a verbatim text” (using 
For obviousness to combine Zhou et al. in view of Chopra et al. and Kahn et al. see claim 14.

Claims 6, 9, 17, 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. in view of Chopra et al., and further in view of Suendermann-Oeft et al. (US Patent 10,937,444).
Regarding claim 6, Zhou et al. do teach the method of claim 5, wherein the first set of features comprises a length of a transcription, a confidence in the transcription (the “Levenshtein distance metric” (first set of features comprises) according to ¶ 0054 line 4 item “2” “is at most the length of the longer string” (a length of a transcription); step “208” “CONFIDENCE SCORE FEATURES” (and a confidence in transcription)).
Zhou et al. in view of Chopra et al. do not specifically disclose features associated with, letters per second associated with the transcription.
Suendermann-Oeft et al. do teach features associated with, letters per second associated with the transcription (in a system using “a plurality of ASR’s” using “neural network” (abstract), according to Col. 6 lines 60+: “Table 1” “examples include features 
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “Example Features” of the “Plurality of ASR” system and method of Suendermann-Oeft et al. into the “MULTIPLE SPEECH RECOGNITION ENGINES” of Zhou et al. in Zhou et al. in view of Chopra et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable the speech recognition of Zhou et al. in view of Chopra et al. to benefit from these added features including “number of words per second” to assess “fluency” of speech that had been transcribed as disclosed in Col. 6 line 60.

Regarding claim 9, Zhou et al. do teach the method of claim 5, wherein the fourth set of features comprises a position of the voice activity in the recording, and an audio feature (¶ 0033 lines 1-2 and 8-9 : “feature extractor” “generates” “a” “bag-of-words with decay feature” (fourth set of  features generated) which is “based on the occurrence times and positions” (based on a position in the recording) “of the word within the result”; step “404” “TRIGGER PAIR” and/or “WORD-LEVEL FEATURES” (and an audio feature)).

Suendermann-Oeft et al. do teach a feature associated with a duration of the voice activity (Col. 9 lines 51-54: “deploying a neural network acoustic model encoder implementing” “word duration” (a feature comprising a duration of the voice activity)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “Example Features” of the “Plurality of ASR” system and method of Suendermann-Oeft et al. into the “MULTIPLE SPEECH RECOGNITION ENGINES” of Zhou et al. in Zhou et al. in view of Chopra et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable the speech recognition of Zhou et al. to benefit from these added features including “average duration” to assess “fluency” of speech that had been transcribed as disclosed in Col. 6 lines 60-62.

Regarding claim 17, Zhou et al. do teach the non-transitory computer readable medium of claim 16, wherein the first set of features and third set of features comprise a length of a transcription, a confidence in the transcription (the “Levenshtein distance metric” (first set of features comprises) according to ¶ 0054 line 4 item “2” “is at most 
Zhou et al. in view of Chopra et al. do not specifically disclose features associated with, letters per second associated with the transcription.
Suendermann-Oeft et al. do teach features associated with, letters per second associated with the transcription (in a system using “a plurality of ASR’s” using “neural network” (abstract), according to Col. 6 lines 60+: “Table 1” “examples include features based on the number of words per second” (a feature based on number of letters per second)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “Example Features” of the “Plurality of ASR” system and method of Suendermann-Oeft et al. into the “MULTIPLE SPEECH RECOGNITION ENGINES” of Zhou et al. in Zhou et al. in view of chopra et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable the speech recognition of Zhou et al. in view of Chopra et al. to benefit from these added features including “number of words per second” to assess “fluency” of speech that had been transcribed as disclosed in Col. 6 line 60.


Zhou et al. in view of Chopra et al. do not specifically disclose a feature associated with a duration of the voice activity.
Suendermann-Oeft et al. do teach a feature associated with a duration of the voice activity (Col. 9 lines 51-54: “deploying a neural network acoustic model encoder implementing” “word duration” (a feature comprising a duration of the voice activity)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “Example Features” of the “Plurality of ASR” system and method of Suendermann-Oeft et al. into the “MULTIPLE SPEECH RECOGNITION ENGINES” of Zhou et al. in Zhou et al. in view of Chopra et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable the speech recognition of Zhou et al. in view of Chopra et al. to benefit from these added features .


Claims 7, 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. in view of Chopra et al., and further in view of Suzuki (US 2006/0247912).
Regarding claim 7, Zhou et al. do teach the method of claim 5, wherein the second set of features comprises a word error rate between two transcriptions, an difference in length between the two transcriptions, and an average difference in length across all pairs of transcriptions (the “Levenshtein distance metric” according to ¶ 0054 lines 3-4 item “1” does also determine a “difference in sizes of the two strings” (a difference in length between the two transcriptions); according to ¶ 0054 last sentence: “edit distance” “describe the differences between the training speech recognition results and the corresponding ground-truth training inputs” (an average difference in length across all pairs of transcriptions); ¶ 0054 lines 11-12: “Hamming distance, in turn, refers to a metric of the minimum number of substitutions required to change one string into the other” (a word error rate between two transcriptions)).
Zhou et al. in view of Chopra et al. do not specifically disclose a feature corresponding to an average word error rate across all pairs of transcriptions.

It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the techniques pertaining to determining “errors” by comparison of “errors” in one transcription with respect to those in a second transcription in a system with plurality of recognizers of Suzuki into the “MULTIPLE SPEECH RECOGNITION ENGINE” platform of Zhou et al. in Zhou et al. in view of Chopra et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Zhou et al. in view of Chopra et al. to determine the “performance” of each recognizer with respect to another one as disclosed in Suzuki claim 9 last limitation.


Zhou et al. in view of Chopra et al. do not specifically disclose a feature corresponding to an average word error rate across all pairs of transcriptions.
Suzuki does teach a feature corresponding to an average word error rate across all pairs of transcriptions (claim 9: “determining” “number of new errors being the number of errors in a first text” (errors in a first transcription) “that are not present in a second text” (not found in a second transcription); “determining” “number of corrected errors” “being the number of errors in the second text” (errors in the second transcription) “that are not present in the first text” (not found in the first transcription); 
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the techniques pertaining to determining “errors” by comparison of “errors” in one transcription with respect to those in a second transcription in a system with plurality of recognizers of Suzuki into the “MULTIPLE SPEECH RECOGNITION ENGINE” platform of Zhou et al. in Zhou et al. in view of Chopra et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Zhou et al. in view of Chopra et al. to determine the “performance” of each

Claim 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. in view of chopra et al. and Suendermann-Oeft et al. , and further in view of Wang (US Patent 5,509,103).
Regarding claim 10, Zhou et al. in view of Chopra et al. and Suendermann-Oedt et al. do not specifically disclose the method of claim 9, wherein the audio feature comprises at least one of a mel-frequency cepstral coefficient (MFCC), a perceptual linear prediction (PLP), a root mean square (RMS), a zero crossing rate, a spectral flux, a spectral energy, a chroma vector, and a chroma deviation.

The method of claim 9, wherein the audio feature comprises at least one of a mel-frequency cepstral coefficient (MFCC), a perceptual linear prediction (PLP), a root mean square (RMS), a zero crossing rate, a spectral flux, a spectral energy, a chroma vector, and a chroma deviation (Col. 2 lines 57 and 65 respectively: in “a method of training a plurality of neural networks used in a speech-recognition system” it does “perform[] cepstral analysis of the digitized word” (generates cepstal coefficients or features, because according to Col. 4 line 18 “cepstral analysis” is defined as “feature extraction”)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the “cepstral analysis” feature of the “speech-recognition system” with “neural networks” of Wang into the “MULTIPLE SPEECH RECOGNITION ENGINES” of Zhou et al. in Zhou et al. in view of Chopra et al. and Suendermann-Oeft et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Zhou et al. in view of Chopra et al. and  Suendermann-Oeft et al.   to obtain “results in a representation of the signal which characterizes the relevant features of the spoken speech” as disclosed in Wang Col. 4 lines 18-21.

11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Zhou et al. in view of Chopra et al., and further in view of SUGITANI et al. (US 2017/0140752).
Regarding claim 11, Zhou et al. in view of Chopra et al. do not specifically disclose the method of claim 5, wherein the second set of features comprises a fixed-size encoding of per-character differences between two transcriptions.
SUGITANI et al. do teach the method of claim 5, wherein the second set of features comprises a fixed-size encoding of per-character differences between two transcriptions (in a system comprising of a plurality of “VOICE RECOGNITION UNIT[S]” (Fig. 1), according to ¶ 0048 lines 4+: “calculates as the above index an order distance indicating a degree of difference in an order of candidate character strings” (per-character differences between transcriptions associated with units “11” and “12” are determined) “aligned in order of the score values obtained by the first and second voice recognition units 11 and 12”).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate “distance” calculations involving “difference in” “character strings” of SUGITANI et al. plural “VOICE RECOGNITION UNIT[s]” into the “MULTIPLE SPEECH RECOGNITION ENGINES” of Zhou et al. in Zhou et al. in view of Chopra et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable determine a “score value indicating accuracy of said candidate character strings, .

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1, 12, 13, 20 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1, 12, 13, 20 of copending Application No. 16/056,298 (reference application). Although the claims at issue are not identical, they are not patentably distinct from each other because:
16,297,602
1. A method for analyzing a transcription of a recording, comprising: 

generating features representing transcriptions produced by multiple automatic speech recognition (ASR) engines from voice activity in a recording; generating a best transcription of the recording produced by an ensemble model from the transcriptions; 
applying a machine learning model to the features to produce a score representing an accuracy of the best transcription; 
Wherein applying the machine learning model to the features to produce the score representing the accuracy of the best transcription comprises:
Determining a predetermined number of the multiple ASR engines having historical performance in generating transcripts that is better than the historical performance of a remaining number of the multiple ASR engines under predetermined conditions as contributor ASR engines, the pre-determined conditions comprising a selected metadata associated with the recording,
Determining a remaining number of the multiple ASR engines as selector ASR engines, and
Applying the machine learning model 













to produce the score representing error rates between the best transcription and the transcriptions from the selector ASR engines.




and storing the score in association with the best transcription. 



generating input to a machine learning model from snippets of voice activity in the recording and transcriptions produced by multiple automatic speech recognition (ASR) engines from the recording; and for each snippet in the snippets:
determining contributor ASR engines and selector ASR engines from the multiple ASR engines, wherein determining the contributor ASR engines and the selector ASR engines from the multiple ASR engines comprises:
determining a predetermined number of the multiple ASR engines having  historical performance in generating transcriptions that is better than the historical performance of a remaining number of the multiple ASR engines under predetermined conditions as the contributor ASR engines, the pre-determined conditions comprising a selected metadata associated with the recording, and

determining the remaining number of the multiple ASR engines as the contributor ASR engines;
applying the machine learning model to the input to select, based on transcriptions of the snippet produced by at least one contributor ASR engine of the contributor ASR engines, one or more transcriptions generated by the at least one contributor ASR engine of the snippet from possible transcriptions of 

measuring, using one or more transcriptions generated by the at least one selector ASR engine, an accuracy of the one or more transcriptions generated by the at least one contributor ASR engine, and wherein the machine learning model is trained based on a first historical performance of the at least one selector ASR engine; and

storing, based on the measured accuracy, one of the one or more transcriptions generated by the at least one contributor ASR engine as a best transcription in association with the snippet.

In re Karlson, 136 USPQ 184: .

This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 





Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARZAD KAZEMINEZHAD whose telephone number is (571)270-5860.  The examiner can normally be reached on 10:30 am to 11:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, DANIEL C WASHBURN can be reached on (571)272-5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you 






/Farzad Kazeminezhad/
Art Unit 2657
September 15th 2021.