DETAILED ACTION
This communication is in response to the Application filed on 01/20/2021. Claims 1-9 are pending and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 05/27/2021 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159.  See MPEP §§ 706.02(l)(1) - 706.02(l)(3) for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-20 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-18 of U.S. Patent No. 10,147,428. Although the claims at issue are not identical, they are not patentably distinct from each other because the claims of the issued patent are directed towards a system and method and the instant application is directed towards and CRM. Hence, it would have been obvious to one skilled in the art to have used a CRM in the issued patent in order to store and execute a program. Please see the claim mapping below, where the bolded limitations indicate the identical limitations in the issued patent.
With respect to the other claims, the each claim maps to the issued patent claim in the following manner, where (I) represents the instant application and the (P) represents the issued patent:
Claim 1 (I): Claim 1 (P); Claim 2 (I): Claim 2 (P); Claim 3 (I): Claim 3 (P); Claim 4 (I): Claim 4 (P); Claim 5 (I): Claim 5 (P); Claim 6 (I): Claim 6 (P); Claim 7 (I): Claim 7 (P); Claim 8 (I): Claim 8 (P); Claim 9 (I): Claim 9 (P).

Instant Application: 17/153575
Issued Patent: 10,147,428
Claim 1: A non-transitory computer readable medium comprising instructions that, when executed by a processor in a first computing device of a plurality of computing devices, direct the processor to: 

generate at least one speech recognition model specification for a plurality of distinct speech-to-text transcription engines; 



wherein each distinct speech-to-text transcription engine corresponds to a respective distinct speech recognition model; 

wherein, for each distinct speech-to-text transcription engine, the at least one speech recognition model specification at least identifies: 

i) a respective value for at least one pre-transcription evaluation parameter, and 

ii) a respective value for at least one post-transcription evaluation parameter; 

wherein the generating of the at least one speech recognition model specification for the plurality of distinct speech-to-text transcription engines comprises: 

receive at least one training audio recording and at least one truth transcript of the at least one training audio recording;


 segment the at least one training audio recording into a plurality of training audio segments and the at least one truth transcript into a plurality of corresponding truth training segment transcripts; 



apply at least one pre-transcription audio classifier to each training audio segment of the plurality of training audio segments to generate first metadata classifying each training audio segment based at least on: 



i) language, 
ii) audio quality, and 
iii) accent; 

apply at least one text classifier to each corresponding truth training segment transcript of the plurality of corresponding truth training segment transcripts to generate second metadata classifying each corresponding truth training segment transcript based at least on at least one content category; 


combine the plurality of training audio segments, the plurality of corresponding truth training segment transcripts, the first metadata, and the second metadata to form at least one benchmark set; 

test each distinct speech-to-text transcription engine of the plurality of distinct speech-to- text transcription engines based on the at least one benchmark set to form a plurality of model result sets; 


wherein each model result set of the plurality of model result sets corresponds to the respective distinct speech-to-text transcription engine; 

wherein each model result set of the plurality of model result sets comprises:
 i) the at least one benchmark set, 
ii) at least one model-specific training hypothesis for each training audio segment, 
iii) at least one confidence value associated with the at least one model- specific training hypothesis, and
iv) at least one word error rate (WER) associated with the at least one model- specific training hypothesis; 

determine for each model result set of the plurality of model result sets, a respective set of transcription decisions for each distinct speech-to-text transcription engine of the plurality of distinct speech-to-text transcription engines, wherein the respective set of transcription decisions defines, for each distinct speech-to-text transcription engine, the value of the at least one pre-transcription evaluation parameter and the value of the at least one post-transcription evaluation parameter; 

and combine, for each model result set of the plurality of model result sets, each respective set of transcription decisions for each distinct speech-to-text transcription engine of the plurality of distinct speech-to-text transcription engines into the at least one speech recognition model specification for the plurality of distinct speech-to-text transcription engines; 

receive at least one audio recording representing at least one speech of at least one person; 


segment the at least one audio recording into a plurality of audio segments; 



wherein each audio segment corresponds to a respective single phrase of a respective single person that has been bounded by points of silence in the at least one audio recording; 

determine, based on the respective value of the at least one pre-transcription evaluation parameter of the respective distinct speech recognition model in the at least one speech recognition model specification, a respective distinct speech-to-text transcription engine from the plurality of distinct speech-to-text transcription engines to be utilized to transcribe a respective audio segment of the plurality of audio segments; 


submit the respective audio segment to the respective distinct speech-to-text transcription engine; 


receive from the respective distinct speech-to-text transcription engine, at least one hypothesis for the respective audio segment; 


accept the at least one hypothesis for the respective audio segment based on the respective value of the at least one post-transcription evaluation parameter of the respective distinct speech recognition model in the at least one speech recognition model specification to obtain a respective accepted hypothesis for the respective audio segment of the plurality of audio segments of the at least one audio recording; 



wherein the accepting of the at least one hypothesis for each respective audio segment as the respective accepted hypothesis for the respective audio segment removes a need to submit the respective audio segment to another distinct speech-to-text transcription engine from the plurality of distinct speech-to-text transcription engines resulting in the improved computer speed and the accuracy of automatic speech transcription; 

generate at least one transcript of the at least one audio recording from respective accepted hypotheses for the plurality of audio segments; and 


output the at least one transcript of the at least one audio recording.
Claim 1: A computer-implemented method for improving computer speed and accuracy of automatic speech transcription, comprising:



generating, by at least one processor, at least one speech recognition model specification for a plurality of distinct speech-to-text transcription engines;



wherein each distinct speech-to-text transcription engine corresponds to a respective distinct speech recognition model;

wherein, for each distinct speech-to-text transcription engine, the at least one speech recognition model specification at least identifies:

i) a respective value for at least one pre-transcription evaluation parameter, and

ii) a respective value for at least one post-transcription evaluation parameter;
  
wherein the generating the at least one speech recognition model specification comprises:



receiving, by the at least one processor, at least one training audio recording and at least one truth transcript of the at least one training audio recording;

segmenting, by the at least one processor, the at least one training audio recording into a plurality of training audio segments and the at least one truth transcript into a plurality of corresponding truth training segment transcripts;


applying, by the at least one processor, at least one pre-transcription audio classifier to each training audio segment of the plurality of training audio segments to generate first metadata classifying each training audio segment based at least on:


i) language,
ii) audio quality, and
iii) accent;

applying, by the at least one processor, at least one text classifier to each corresponding truth training segment transcript of the plurality of corresponding truth training segment transcripts to generate second metadata classifying each corresponding truth training segment transcript based at least on at least one content category; 

combining, by the at least one processor, the plurality of training audio segments, the plurality of corresponding truth training segment transcripts, the first metadata, and the second metadata to form at least one benchmark set;
testing, by the at least one processor, each distinct speech-to-text transcription engine of the plurality of distinct speech-to-text transcription engines based on the at least one benchmark set to form a plurality of model result sets; 

wherein each model result set corresponds to a respective distinct speech-to-text transcription engine;


wherein each model result set comprises:
i) the at least one benchmark set,
ii) at least one model-specific training hypothesis for each training audio segment, 
iii) at least one confidence value associated with the at least one model-specific training hypothesis, and
iv) at least one word error rate (WER) associated with the at least one model-specific training hypothesis;


determining, by the at least one processor, a respective set of transcription decisions for each distinct speech-to-text transcription engine of the plurality of distinct speech-to-text transcription engines, wherein the respective set of transcription decisions defines, for each distinct speech-to-text transcription engine, the value of the at least one pre-transcription evaluation parameter and the value of the at least one post-transcription evaluation parameter; and

combining, by the at least one processor, each respective set of transcription decisions for each distinct speech-to-text transcription engine of the plurality of distinct speech-to-text transcription engines into the at least one speech recognition model specification for the plurality of distinct speech-to-text transcription engines;


receiving, by the at least one processor, at least one audio recording representing at least one speech of at least one person;

segmenting, by the at least one processor, the at least one audio recording into a plurality of audio segments;


wherein in each audio segment corresponds to a respective single phrase of a respective single person that has been bounded by points of silence in the at least one audio recording; 

determining, by the at least one processor, based on the respective value of the at least one pre-transcription evaluation parameter of the respective distinct speech recognition model in the at least one speech recognition model specification, a respective distinct speech-to-text transcription engine from the plurality of distinct speech-to-text transcription engines to be utilized to transcribe a respective audio segment of the plurality of audio segments;
 
submitting, by the at least one processor, the respective audio segment to the respective distinct speech-to-text transcription engine;

receiving,  by the at least one processor, from the respective distinct speech-to-text transcription engine, at least one hypothesis for the respective audio segment;

accepting, by the at least one processor, the at least one hypothesis for the respective audio segment based on the respective value of the at least one post-transcription evaluation parameter of the respective distinct speech recognition model in the at least one speech recognition model specification to obtain a respective accepted hypothesis for the respective audio segment of the plurality of audio segments of the at least one audio recording;


wherein the accepting of the at least one hypothesis for each respective audio segment as the respective accepted hypothesis for the respective audio segment removes a need to submit the respective audio segment to another distinct speech-to-text transcription engine from the plurality of distinct speech-to-text transcription engines resulting in the improved computer speed and the accuracy of automatic speech transcription; 

generating, by the at least one processor, at least one transcript of the at least one audio recording from respective accepted hypotheses for the plurality of audio segments; and

outputting, by the at least one processor, the at least one transcript of the at least one audio recording.



Claims 1-9 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-9 of U.S. Patent No. 10,930,287. Although the claims at issue are not identical, they are not patentably distinct from each other because the claims of the issued patent are directed towards a system and method and the instant application is directed towards and CRM. Hence, it would have been obvious to one skilled in the art to have used a CRM in the issued patent in order to store and execute a program. Please see the claim mapping below, where the bolded limitations indicate the identical limitations in the issued patent.
With respect to the other claims, the each claim maps to the issued patent claim in the following manner, where (I) represents the instant application and the (P) represents the issued patent:
Claim 1 (I): Claim 1 (P); Claim 2 (I): Claim 2 (P); Claim 3 (I): Claim 3 (P); Claim 4 (I): Claim 4 (P); Claim 5 (I): Claim 5 (P); Claim 6 (I): Claim 6 (P); Claim 7 (I): Claim 7 (P); Claim 8 (I): Claim 8 (P); Claim 9 (I): Claim 9 (P).

Instant Application: 17/153575
Issued Patent: 10,930,287
Claim 1: A non-transitory computer readable medium comprising instructions that, when executed by a processor in a first computing device of a plurality of computing devices, direct the processor to: 

generate at least one speech recognition model specification for a plurality of distinct speech-to-text transcription engines; 



wherein each distinct speech-to-text transcription engine corresponds to a respective distinct speech recognition model; 

wherein, for each distinct speech-to-text transcription engine, the at least one speech recognition model specification at least identifies: 

i) a respective value for at least one pre-transcription evaluation parameter, and 

ii) a respective value for at least one post-transcription evaluation parameter; 

wherein the generating of the at least one speech recognition model specification for the plurality of distinct speech-to-text transcription engines comprises: 

receive at least one training audio recording and at least one truth transcript of the at least one training audio recording;


 segment the at least one training audio recording into a plurality of training audio segments and the at least one truth transcript into a plurality of corresponding truth training segment transcripts; 



apply at least one pre-transcription audio classifier to each training audio segment of the plurality of training audio segments to generate first metadata classifying each training audio segment based at least on: 



i) language, 
ii) audio quality, and 
iii) accent; 

apply at least one text classifier to each corresponding truth training segment transcript of the plurality of corresponding truth training segment transcripts to generate second metadata classifying each corresponding truth training segment transcript based at least on at least one content category; 


combine the plurality of training audio segments, the plurality of corresponding truth training segment transcripts, the first metadata, and the second metadata to form at least one benchmark set; 



test each distinct speech-to-text transcription engine of the plurality of distinct speech-to- text transcription engines based on the at least one benchmark set to form a plurality of model result sets; 




wherein each model result set of the plurality of model result sets corresponds to the respective distinct speech-to-text transcription engine; 

wherein each model result set of the plurality of model result sets comprises:
 i) the at least one benchmark set, 
ii) at least one model-specific training hypothesis for each training audio segment, 
iii) at least one confidence value associated with the at least one model- specific training hypothesis, and
iv) at least one word error rate (WER) associated with the at least one model- specific training hypothesis; 

determine for each model result set of the plurality of model result sets, a respective set of transcription decisions for each distinct speech-to-text transcription engine of the plurality of distinct speech-to-text transcription engines, wherein the respective set of transcription decisions defines, for each distinct speech-to-text transcription engine, the value of the at least one pre-transcription evaluation parameter and the value of the at least one post-transcription evaluation parameter; 



and combine, for each model result set of the plurality of model result sets, each respective set of transcription decisions for each distinct speech-to-text transcription engine of the plurality of distinct speech-to-text transcription engines into the at least one speech recognition model specification for the plurality of distinct speech-to-text transcription engines; 

receive at least one audio recording representing at least one speech of at least one person; 


segment the at least one audio recording into a plurality of audio segments; 



wherein each audio segment corresponds to a respective single phrase of a respective single person that has been bounded by points of silence in the at least one audio recording; 

determine, based on the respective value of the at least one pre-transcription evaluation parameter of the respective distinct speech recognition model in the at least one speech recognition model specification, a respective distinct speech-to-text transcription engine from the plurality of distinct speech-to-text transcription engines to be utilized to transcribe a respective audio segment of the plurality of audio segments; 


submit the respective audio segment to the respective distinct speech-to-text transcription engine; 


receive from the respective distinct speech-to-text transcription engine, at least one hypothesis for the respective audio segment; 


accept the at least one hypothesis for the respective audio segment based on the respective value of the at least one post-transcription evaluation parameter of the respective distinct speech recognition model in the at least one speech recognition model specification to obtain a respective accepted hypothesis for the respective audio segment of the plurality of audio segments of the at least one audio recording; 



wherein the accepting of the at least one hypothesis for each respective audio segment as the respective accepted hypothesis for the respective audio segment removes a need to submit the respective audio segment to another distinct speech-to-text transcription engine from the plurality of distinct speech-to-text transcription engines resulting in the improved computer speed and the accuracy of automatic speech transcription; 

generate at least one transcript of the at least one audio recording from respective accepted hypotheses for the plurality of audio segments; and 


output the at least one transcript of the at least one audio recording.
Claim 1: 1. A computer-implemented method for improving computer speed and accuracy of automatic speech transcription, comprising: 



generating, by at least one processor, at least one speech recognition model specification for a plurality of distinct speech-to-text transcription engines; 



wherein each distinct speech-to-text transcription engine corresponds to a respective distinct speech recognition model; 

wherein, for each distinct speech-to-text transcription engine, the at least one speech recognition model specification at least identifies: 

i) a respective value for at least one pre-transcription evaluation parameter, and 

ii) a respective value for at least one post-transcription evaluation parameter; 

wherein the generating of the at least one speech recognition model specification for the plurality of distinct speech-to-text transcription engines comprises: 

receiving, by the at least one processor, at least one training audio recording and at least one truth transcript of the at least one training audio recording; 

segmenting, by the at least one processor, the at least one training audio recording into a plurality of training audio segments and the at least one truth transcript into a plurality of corresponding truth training segment transcripts; 


applying, by the at least one processor, at least one pre-transcription audio classifier to each training audio segment of the plurality of training audio segments to generate first metadata classifying each training audio segment based at least on: 


i) language, 
ii) audio quality, and 
iii) accent; 

applying, by the at least one processor, at least one text classifier to each corresponding truth training segment transcript of the plurality of corresponding truth training segment transcripts to generate second metadata classifying each corresponding truth training segment transcript based at least on at least one content category; 

combining, by the at least one processor, the plurality of training audio segments, the plurality of corresponding truth training segment transcripts, the first metadata, and the second metadata to form at least one benchmark set; 


testing, by the at least one processor, each distinct speech-to-text transcription engine of the plurality of distinct speech-to-text transcription engines based on the at least one benchmark set to form a plurality of model result sets; 



wherein each model result set of the plurality of model result sets corresponds to the respective distinct speech-to-text transcription engine; 

wherein each model result set of the plurality of model result sets comprises: 
i) the at least one benchmark set,
ii) at least one model-specific training hypothesis for each training audio segment,
iii) at least one confidence value associated with the at least one model-specific training hypothesis, and 
iv) at least one word error rate (WER) associated with the at least one model-specific training hypothesis; 

determining, by the at least one processor, for each model result set of the plurality of model result sets, a respective set of transcription decisions for each distinct speech-to-text transcription engine of the plurality of distinct speech-to-text transcription engines, wherein the respective set of transcription decisions defines, for each distinct speech-to-text transcription engine, the value of the at least one pre-transcription evaluation parameter and the value of the at least one post-transcription evaluation parameter; and 

combining, by the at least one processor, for each model result set of the plurality of model result sets, each respective set of transcription decisions for each distinct speech-to-text transcription engine of the plurality of distinct speech-to-text transcription engines into the at least one speech recognition model specification for the plurality of distinct speech-to-text transcription engines; 
receiving, by the at least one processor, at least one audio recording representing at least one speech of at least one person; 

segmenting, by the at least one processor, the at least one audio recording into a plurality of audio segments; 


wherein in each audio segment corresponds to a respective single phrase of a respective single person that has been bounded by points of silence in the at least one audio recording; 

determining, by the at least one processor, based on the respective value of the at least one pre-transcription evaluation parameter of the respective distinct speech recognition model in the at least one speech recognition model specification, a respective distinct speech-to-text transcription engine from the plurality of distinct speech-to-text transcription engines to be utilized to transcribe a respective audio segment of the plurality of audio segments; 

submitting, by the at least one processor, the respective audio segment to the respective distinct speech-to-text transcription engine; 

receiving, by the at least one processor, from the respective distinct speech-to-text transcription engine, at least one hypothesis for the respective audio segment; 

accepting, by the at least one processor, the at least one hypothesis for the respective audio segment based on the respective value of the at least one post-transcription evaluation parameter of the respective distinct speech recognition model in the at least one speech recognition model specification to obtain a respective accepted hypothesis for the respective audio segment of the plurality of audio segments of the at least one audio recording; 


wherein the accepting of the at least one hypothesis for each respective audio segment as the respective accepted hypothesis for the respective audio segment removes a need to submit the respective audio segment to another distinct speech-to-text transcription engine from the plurality of distinct speech-to-text transcription engines resulting in the improved computer speed and the accuracy of automatic speech transcription; 

generating, by the at least one processor, at least one transcript of the at least one audio recording from respective accepted hypotheses for the plurality of audio segments; and 

outputting, by the at least one processor, the at least one transcript of the at least one audio recording.



Allowable Subject Matter
Claims 1-9 would be allowable if rewritten or amended to overcome the rejection(s) under Double Patenting set forth in this Office action.
The following is a statement of reasons for the indication of allowable subject matter:
The closest prior art of record of Cook teaches a computer-implemented method for improving computer speed and accuracy of automatic speech transcription, comprising: generating, by at least one processor (see Figure 4, processors 402-1-402-N), at least one speech recognition model specification for a plurality of distinct speech-to-text transcription engines (see [0015], where the speech to text engines 120 each have acoustic and language models, [0019]-[0020], where acoustic models are trained and [0023], where language models are trained in order to generate the speech to text engine); wherein each distinct speech-to-text transcription engine corresponds to a respective distinct speech recognition model (see [0015], where each of the speech to text engines can operate at varying speech and varying levels of accuracy and employ assorted models); wherein, for each distinct speech-to-text transcription engine, the at least one speech recognition model specification at least identifies: i) a respective value for at least one pre-transcription evaluation parameter (see [0039], where only a portion of the speech to text engines are selected based on being at or below max processing power requirement, speed or threshold), and ii) a respective value for at least one post-transcription evaluation parameter (see [0026], where speech to text engine engines 120 is selected based on accuracy and see [0040], where most accurate engine selected); receiving, by the at least one processor, at least one audio recording representing at least one speech of at least one person (see [0016], speech data received 105 and see [0012], audible input 101 which maybe from conversations, speech dictated, or from television show); segmenting, by the at least one processor, the at least one audio recording into a plurality of audio segments (see [0016], where speech data 105 is segmented into smaller portions such as words and phrases); determining, by the at least one processor, based on the respective value of the at least one pre-transcription evaluation parameter of the respective distinct speech recognition model in the at least one speech recognition model specification, a respective distinct speech-to-text transcription engine from the plurality of distinct speech-to-text transcription engines to be utilized to transcribe a respective audio segment of the plurality of audio segments (see [0039], where a portion of the speech to text engines are selected based in the criteria of max processing power, speed and threshold); 
submitting, by the at least one processor, the respective audio segment to the respective distinct speech-to-text transcription engine (see [0039], where the speech data 105 are provided to the selected speech to text engines);  receiving, by the at least one processor, from the respective distinct speech-to-text transcription engine, at least one hypothesis for the respective audio segment (see [0040], where the samples of decoded speech are received and compared to each other); accepting, by the at least one processor, the at least one hypothesis for the respective audio segment based on the respective value of the at least one post-transcription evaluation parameter of the respective distinct speech recognition model in the at least one speech recognition model specification to obtain a respective accepted hypothesis for the respective audio segment of the plurality of audio segments of the at least one audio recording (see [0040], where the accuracy of the samples of decoded speech is compared and the speech to text engine which is the most accurate is selected along with the hypothesis) ; 
wherein the accepting of the at least one hypothesis for each respective audio segment as the respective accepted hypothesis for the respective audio segment removes a need to submit the respective audio segment to another distinct speech- to-text transcription engine from the plurality of distinct speech-to-text transcription engines resulting in the improved computer speed and the accuracy of automatic speech transcription (see [0041], where the speech recognition engine selected to be most accurate is selected and decodes next portion of speech data 105which continues until the system instructs to stop); generating, by the at least one processor, at least one transcript of the at least one audio recording from respective accepted hypotheses for the plurality of audio segments (see [0036], where text data 115 is output by the speech to text engines after decoding the speech data); and outputting, by the at least one processor, the at least one transcript of the at least one audio recording (see [0036], where the text data 115 is output via a display).
However, Cook does not specifically teach wherein in each audio segment corresponds to a respective single phrase of a respective single person that has been bounded by points of silence in the at least one audio recording.
Bhardwaj does teach wherein in each audio segment corresponds to a respective single phrase of a respective single person that has been bounded by points of silence in the at least one audio recording (see col. 5, lines 55-67, where the audio segment is segmented based on relative silence between audio segments with respect to the speaker speaking the speech (see col. 5, lines 35-36)).
However, none of the cited reference either alone or in combination thereof teaches or makes obvious the limitations of “wherein the generating of the at least one speech recognition model specification for the plurality of distinct speech-to-text transcription engines comprises: receiving, by the at least one processor, at least one training audio recording and at least one truth transcript of the at least one training audio recording; segmenting, by the at least one processor, the at least one training audio recording into a plurality of training audio segments and the at least one truth transcript into a plurality of corresponding truth training segment transcripts; applying, by the at least one processor, at least one pre-transcription audio classifier to each training audio segment of the plurality of training audio segments to generate first metadata classifying each training audio segment based at least on: i) language, ii) audio quality, and iii) accent; applying, by the at least one processor, at least one text classifier to each corresponding truth training segment transcript of the plurality of corresponding truth training segment transcripts to generate second metadata classifying each corresponding truth training segment transcript based at least on at least one content category; combining, by the at least one processor, the plurality of training audio segments, the plurality of corresponding truth training segment transcripts, the first metadata, and the second metadata to form at least one benchmark set; testing, by the at least one processor, each distinct speech-to-text transcription engine of the plurality of distinct speech-to-text transcription engines based on the at least one benchmark set to form a plurality of model result sets; wherein each model result set corresponds to the respective distinct speech-to-text transcription engine; wherein each model result set comprises: i) the at least one benchmark set, ii) at least one model-specific training hypothesis for each training audio segment, iii) at least one confidence value associated with the at least one model- specific training hypothesis, and iv) at least one word error rate (WER) associated with the at least one model-specific training hypothesis; determining, by the at least one processor, a respective set of transcription decisions for each distinct speech-to-text transcription engine of the plurality of distinct speech-to-text transcription engines, wherein the respective set of transcription decisions defines, for each distinct speech-to-text transcription engine, the value of the at least one pre-transcription evaluation parameter and the value of the at least one post-transcription evaluation parameter; and combining, by the at least one processor, each respective set of transcription decisions for each distinct speech-to-text transcription engine of the plurality of distinct speech-to-text transcription engines into the at least one speech recognition model specification for the plurality of distinct speech-to-text transcription engines.”
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PARAS D SHAH whose telephone number is (571)270-1650.  The examiner can normally be reached on Monday-Thursday 7:30AM-2:30PM, 5PM-7PM (EST), Friday 8AM-noon (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        
08/25/2022