DETAILED ACTION
This communication is in response to the Amendments and Arguments filed on 07/14/2022. Claims 1-6, 8-13, and 15-20 are pending and have been examined. Hence, this action has been made FINAL.
All previous objections/rejections not mentioned in this Office Action have been withdrawn by the Examiner.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments and Amendments
Amendments to the claims by the Applicant have been considered and addressed below. 
With respect to the 35 USC § 112(b), 101, and 103 rejections, the Applicant provides several arguments in which the Examiner will respond accordingly, below.

Claim Rejections - 35 U.S.C. § 112(b):
Arguments: 
The Current Office Action, at page 2, rejects claims 7 and 14, under 35 U.S.C. § 112(b) due to the limitation of "the high frequency terms". The Current Office Action states, "[t]here is [allegedly] insufficient antecedent basis for this limitation in the claim." With the goal of compact prosecution in mind, Applicant has elected to cancel claims 7 and 14. That being said, Applicant respectfully request the claim rejections under 35 U.S.C. § 112(b) be withdrawn.

Examiner response to Arguments:
Applicant’s arguments with respect to 35 U.S.C. 112(b) rejection of claims 7 and 14 have been fully considered and are persuasive.  The 35 U.S.C. 112(b) rejection of claims 7 and 14 has been withdrawn. 

Claim Rejections - 35 U.S.C. § 101:
Arguments: 
The Current Office Action, at page 3, rejects claims 1 - 20, under 35 U.S.C. 101 "because the claimed invention is [allegedly] directed to an abstract idea without significantly more." The Current Office Action further states: 
"[t]he limitations... as drafted cover a human organizing of activities. More specifically, a human based on spoken/voice/utterance data and to write it as text (i.e., transcript); calculating a confidence score for the transcript considering the written transcript and other audio metrics such as (background noise, speech ratio, etc); comparing phonemes of the transcription and phonemes in a list of frequently used terms; calculating a similarity score for phonemes in the aforementioned list; and editing the transcription by replacing a term from the frequently used terms list if the similarity score surpasses a value." 
Applicant respectfully disagrees and the rejection is traversed. 
Legal Standard 
Step 1 of the Alice/Mayo analysis is to determine the whether the claims are directed to a statutory category of the invention. See MPEP 2106.03(II). The next question is whether the analysis can be streamlined or "when viewed as a whole, the eligibility of the claim is self- evident." See MPEP 2106.06(a). If it is self-evident that the claim when viewed as a whole is eligible subject matter, the analysis ends and the claim is eligible subject matter, under 35 U.S.C. 101. 
Step 2 of the Alice/Mayo analysis analyzes whether the claim recites subject matter that is within a judicial exception and is broken into two parts, Step 2A and Step 2B. Step 2A, is further broken down into Prong 1 and Prong 2. 
In Step 2A-Prong 1, the analysis asks does the claim recite one or more of the enumerated judicial exceptions (an abstract idea, law of nature, or natural phenomenon). See Fig 2. of p.11 of 2019 Revised Patent Subject Matter Eligibility Guidance ("2019 RPEG"). If it is Page 9 of13  Docket No. P202001085AUS01Application No. 17/034,114determined the claim does recite one or more of the enumerated judicial exceptions, the analysis proceeds to Step 2A-prong 2. 
In Step 2A-prong 2 the analysis postulates, whether the claim recites any additional elements that integrate the judicial exception into a practical application, that imposes a meaningful limit on the judicial exception. See 2019 Patent Eligibility Guidance. If it is determined the claim does not integrate the judicial exception into a practical application, the analysis proceeds to Step 2B. 
Step 2B of the Alice/Mayo analysis asks, "Does the claim recite additional elements that amount to significantly more than the judicial exception?" See p.10 of 2019 RPEG. 
Argument 
In an effort ensure compact prosecution, Applicant has elected to make clarifying amendments to independent claims 1, 8, and 15. That being said, claim 1 now reads (in pertinent part): "...transform an utterance into an audio spectrogram; transcribe the audio spectrogram of the utterance into text;" 
[1] As amended, claim 1 is no longer directed to an alleged judicial exception of an abstract idea of a "organizing human activity". This is because it is impossible for a person to transform an utterance into an audio spectrogram. Further, transforming an utterance into an audio spectrogram cannot in anyway be conceived as organizing human activity. Claims 8 and 15 have been amended to contain similar limitations. As such, Applicant requests that the 35 USC § 101 rejections of the claims be withdrawn because independent claims 1, 8, and 15 do not recite a judicial exception. Further, Applicant request the withdraw of the 35 USC § 101 rejections of claims 2 - 6, 9 - 13 and 16 - 20 due to their respective dependencies. 

Examiner response to Arguments:
[1]: Applicant notes that “claim 1 is no longer directed to an alleged judicial exception of an abstract idea of a "organizing human activity". This is because it is impossible for a person to transform an utterance into an audio spectrogram. Further, transforming an utterance into an audio spectrogram cannot in anyway be conceived as organizing human activity. Claims 8 and 15 have been amended to contain similar limitations.” 
The Examiner respectfully disagrees with the Applicant’s assertions of “it is impossible for a person to transform an utterance into an audio spectrogram” In fact, the “transforming, by one or more processors, an utterance into an audio spectrogram; transcribing, by the one or more processors, the audio spectrogram of the utterance into text;” as drafted in amended independent claim 1 is interpreted as a human (by pen a paper) transforming an utterance signal (i.e., speech received from another human) into a time-frequency signal (i.e., by applying a predefined known time-frequency analysis relationship; such as a Short-Time Fourier Transform (STFT)) and then relating said transformation with specific words (i.e., text). Also, the examiner notes that the assertion of “by one or more processors” as drafted is interpreted as a general-purpose computer or computing device and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. For more details, please refer to analysis presented in the updated 35 USC 101 rejection, below.

Claim Rejections - 35 U.S.C. § 103:
Arguments: 
On page 5, The Current Office Action rejects claims 1, 4, 6, 8, 11, 13, 15 and 18 under 35 U.S.C. 103 as being allegedly unpatentable over Strohman et al. (US20170229124A1) (hereinafter Strohman) in view of Ingmarsson (US9959864B 1) (hereinafter Ingmarsson) and Finlay et al. US20210065679A1 (hereinafter Finlay). 
Legal standards 
The Examiner bears the burden of establishing a prima facie case of obviousness based on prior art when rejecting claims under 35 U.S.C. § 103. In re Fritch, 972 F.2d 1260, 23 U.S.P.Q.2d 1780 (Fed. Cir. 1992). The prior art reference (or references when combined) must teach or suggest all the claim limitations. In re Royka, 490 F.2d 981, 180 USPQ 580 (CCPA 1974). "In determining obviousness, the scope and content of the prior art are ... determined; differences between the prior art and the claims at issue are ... ascertained; and the level of ordinary skill in the pertinent art resolved. Against this background the obviousness or non- obviousness of the subject matter is determined." Graham v. John Deere Co., 383 U.S. 1 (1966). "Often, it will be necessary for a court to look to interrelated teachings of multiple patents; the effects of demands known to the design community or present in the marketplace; and the background knowledge possessed by a person having ordinary skill in the art, all in order to determine whether there was an apparent reason to combine the known elements in the fashion claimed by the patent at issue." KSR Int'l. Co. v. Teleflex, Inc., 550 U.S. 398 (2007). 
"Rejections on obviousness grounds cannot be sustained by mere conclusory statements; instead, there must be some articulated reasoning with some rational underpinning to support the legal conclusion of obviousness." Id. (citing In re Kahn, 441 F.3d 977, 988 (Fed. Cir. 2006)). 
Page 11 of 13Docket No. P202001085AUS01 Application No. 17/034,114Argument 
Page 11 of the Current Office Action states "here the sound value/score is interpreted as analogous to the sounds similar score and the comparing is interpreted as associated with the comparison of the waveform from the sound data (input utterance) and the waveform of the generated sound value (by the ASR). Applicant respectfully disagrees. 
[1] Applicant contends Strohman does not teach the limitation of "generating, by the one or more processors, a sounds similar score for phonemes in the at least one terms from a high frequency term list based on the comparing" (emphasis added) Specifically, Strohman does not teach the high frequency terms list. As stated on page 11 of the Current Office Action, Strohman "can receive sound data and corresponding to the work or subwords of the voice data, for example, phoneme generating sound score, the sound value can reflect out the pronunciation similarity between words or subwords and sound data". A key missing limitation is the high frequency term list, which is not contemplated by Strohman. That being said, neither Ingmarsson nor Finlay make mention of a high frequency term list. 
[2] Furthermore, the Current Office Action fails to mention how any of the cited references could be interpreted to teach, alone or in combination, the limitation of "...responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list;...", as required. Therefore, for at least the aforementioned reasons Applicant respectfully requests the rejections under 35 USC § 103 be withdrawn from independent claims 1, 8, and 15. Further, Applicant request the withdraw of the 35 USC § 103 rejections of claims 2 - 6, 9 - 13 and 16 - 20 due to their respective dependencies. 

Claims as amended:
1. (Currently Amended) A computer-implemented method for training a model for improving speech recognition, the computer-implemented method comprising: transforming, by one or more processors, an utterance into an audio spectrogram; transcribing, by the one or more processors, the audio spectrogram of the utterance into text; generating, by the one or more processors, a transcription confidence score based on the transcription and audio metrics; responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list; generating, by the one or more processors, a sounds similar score for phonemes in the at least one terms from a the high frequency term list, based on the comparing; and replacing, by the one or more processors, the transcription with the at least one term from the high frequency term list, if the sounds similar score is above a threshold.
8. (Currently Amended) A computer system for improving speech recognition transcriptions, the system comprising: one or more computer processors; one or more computer readable storage media; computer program instructions to; transform an utterance into an audio spectrogram; transcribe the audio spectrogram of the utterance into text; generate a transcription confidence score based on the transcription and audio metrics; responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list; generate a sounds similar score for phonemes in the at least one terms from a the high frequency term list, based on the comparing; and replace the transcription with the at least one term from the high frequency term list, if the sounds similar score is above a threshold.
15. (Currently Amended) A computer program product for improving speech recognition transcriptions, the computer program product comprising a computer readable storage media and program instructions sorted on the computer readable storage media, the program instructions including instructions to: transform an utterance into an audio spectrogram; transcribe the audio spectrogram of the utterance into text; generate a transcription confidence score based on the transcription and audio metrics; responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list; generate a sounds similar score for phonemes in the at least one terms from a the high frequency term list, based on the comparing; and replace the transcription with the at least one term from the high frequency term list, if the sounds similar score is above a threshold.

Examiner response to Arguments:
[1]: Applicant notes “Strohman does not teach the limitation of "generating, by the one or more processors, a sounds similar score for phonemes in the at least one terms from a high frequency term list based on the comparing" (emphasis added) Specifically, Strohman does not teach the high frequency terms list. As stated on page 11 of the Current Office Action, Strohman "can receive sound data and corresponding to the work or subwords of the voice data, for example, phoneme generating sound score, the sound value can reflect out the pronunciation similarity between words or subwords and sound data". A key missing limitation is the high frequency term list, which is not contemplated by Strohman. That being said, neither Ingmarsson nor Finlay make mention of a high frequency term list.”
The Examiner respectfully disagrees and notes that in page 22 of the Office Action mailed on 04/14/2022, discloses “the high frequency terms phonemes are interpreted as associated with the terms present in the language model, while the sound similar list is interpreted as associated with the terms associated with the sound matching scores (i.e. SE-ET-EE-ZE, language model 114 can [consider] sound matching score of 0.9, 0.9, 0.9. 0.7 and "SE-ET-I-ZE", language model 114 can [consider] a sound matching score 0.9, 0.9, 0.7, 0.9).).” (emphasis added). Hence, for clarification purposes, the Examiner interpreted the “high frequency terms list” as terms present in the language model, such as, “Citizen” (e.g., “CityZen” from example discussed in page 11 of the Office Action (¶ 3 of Preferred embodiments (page 4) citation of the Strohman reference). Further, the Applicant is refer to the following citation: 
“Language model 114 may be based on likelihood word sequence is present and sound value to generate the initial candidate transcription. For example, language model 114 may be based on word "CityZen Reservation" likelihood of occurrence is 0%, for example, because the word "CityZen" is not included in language model 114, word-Citizen Reservation "likelihood of occurrence is 70%,” (emphasis added) (¶ 6 of Preferred embodiments (page 4)).
Applicant’s arguments with respect to the rejection(s) of claims 1, 4, 6, 8, 11, 13, 15 and 18 under 35 U.S.C. 103 under Strohman et al. (US20170229124A1) in view of Ingmarsson (US9959864B 1) and Finlay et al. (US20210065679A1) have been fully considered and are not persuasive. However, upon further consideration of the amended independent claims, a new grounds of rejection is made in view of  Strohman et al. (US20170229124A1) in view of Ingmarsson (US9959864B 1) and Finlay et al. (US20210065679A1)  and further in view of Hannun et al. (US 20160171974 A1). Please see details below.
1. (Currently Amended) A computer-implemented method for training a model for improving speech recognition (see ¶ 1 of Invention contents and ¶ 5 of Preferred embodiments section (page 4) of Strohman et al.]: “In general, one innovative aspect of the subject matter described in this specification can is embodied in a method of improving speech recognition using the external data source. For example, an automatic speech recognizer may receive intercom audio data for encoding and using a first language model provides initial candidate transcription of the speech. [0020] ASR 110 language model may receive the sound value and based on the sound score generating initial candidate transcription. For example, ASR 110 of the language model 114 receives the voice score less than SE-0.9/0/0/, ... EE-0/0/0.9/ ... I-0/0.7/0/ ... ", and in response, generating initial candidate transcription" Citizen Reservation ".” Here, the “training a model” is interpreted as determining good values (i.e., likelihood) for all the Hannunghts (i.e., candidate transcription).), the computer-implemented method comprising: 
transforming, by one or more processors, an utterance into an audio spectrogram recognition (see ¶ [0052] of Hannun et al.]: “…The jitter set of audio files, including the corresponding original audio file, are converted (310) into a set of spectrograms”); 
transcribing, by the one or more processors, the audio spectrogram of the utterance into text (see ¶ [0035] of Hannun et al.]: “In embodiments, a recurrent neural network (RNN) is trained to ingest speech spectrograms and generate English text transcriptions.”);  
generating, by the one or more processors, a transcription confidence score based on the transcription and audio metrics (see Col. 7, lines 33-47 of Ingmarsoon: In some implementations, the ASR 110 may additionally include the re-scorer 114, which rescores the confidence scores calculated by ASR 110 for each particular candidate transcription. For instance, the re-scorer 114 may additionally compare the phonetic similarity between each of candidate transcription and the audio data 104b to determine which individual candidate transcription represents the transcription that is most likely to be the correct transcription. For example, if the audio data 104b includes significant amounts of noise, the re-scorer 114 may adjust the confidence scores assigned to each of the initial candidate transcription 104c and the additional candidate transcription 104d such that the ASR 110 appropriately select the candidate transcription that is most likely to be an accurate transcription.),
; 
responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list (see ¶ [0044, 0047] of Finlay et al.: “[0044] A search over a phonetic index is performed over and using phoneme sequences, rather than text strings as used in an ASR search. Embodiments may convert portions or all of the ASR index or transcript to phonetic representation at indexing time, prior to searching, and prior to receiving a search. An embodiment may “pronounce” (e.g. convert from text to phoneme) each word in the ASR transcript to build a master lookup table providing a correspondence between pronounced words and their appearance in the ASR index. For example, for each word in the ASR transcript, the phoneme sequence corresponding to the word may be generated. […]. [0047] […] In some embodiments, words with ASR confidence scores below a threshold will not appear in the phoneme sequence lookup table, and thus the decision at search time as to whether a word is low confidence can be made by determining that the pronounced word does not appear in the phoneme sequence lookup table. Here, the comparison of the phonemes in the utterance to the phonemes from a term list is interpreted as analogous to the terms not appearing in the sequence lookup table, which results in determining that the pronounced word (utterance) does not appear in the phoneme sequence (from the ASR).ere, He); 
generating, by the one or more processors, a sounds similar score for phonemes in the at least one terms from the high frequency term list, based on the comparing (see ¶ 3 of Preferred Embodiments (page 4) ) of Strohman et al.: ASR 110 of the acoustic model 112 can receive sound data and corresponding to the word or sub words of the voice data, for example, phoneme-generating sound score. the sound value can reflect out the pronunciation similarity between words or sub words and sound data. For example, the acoustic model can receive "CityZen Reservation" sound data and generates sound value SE-0.9/0/0/ ..., ... EE-0/0/0.9/ ... I-0/0.7/0/ ... ". The exemplary sound score may indicate a phoneme "SE" in the speech sound matching, of the first sub word has 90% for speaking in the second sub acoustic matching words with 0%, and third sub word in the speech with sound of 0% matching, for the phoneme "EE", in the speaker of the first sub acoustic matching word has 0%, for speech in the second sub acoustic matching word has 0%, and third sub word in the speech with sound matching of 90%; and for the phoneme "I", in the speaker of the first sub acoustic matching word has 0%, for speech in the second sub acoustic matching word has 0%, and third sub word in the speech with sound matching of 70%. In more examples, a voice model 112 can each output sound value for combined phoneme and position of sub words in the speech. Acoustic model 112 may be based on the waveform indicated by the sound data and is indicated as corresponding to a specific sub word by comparing the waveform to generate sound value. For example, the acoustic model 112 can receive the "CityZen Reservation" talk and identifying out the beginning of sound data represents the phoneme "SE" 90% matching the stored waveform having a waveform, and in response, a phoneme "SE" generating sound score of 0.9 for the first phoneme in the speech.” Here, the sound value/score is interpreted as analogous to the sounds similar score and the comparing is interpreted as associated with the comparison of the waveform from the sound data (input utterance) and the waveform of the generated sound value (by the ASR).); and 
replacing, by the one or more processors, the transcription with the at least one term from the high frequency term list, if the sounds similar score is above a threshold (see ¶ 1 of Invention contents section (page 2) ) of Strohman et al.: The system then can be different to the initial candidate transcription application of second language model to generate a replacement candidate transcription, (i) a sound similar to the initial candidate transcription, and (ii) possibly appearing in the given language.”).

[2]: Applicant notes that “Office Action fails to mention how any of the cited references could be interpreted to teach, alone or in combination, the limitation of "...responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list;...", as required.” 
The Examiner respectfully disagrees with the Applicant’s argument and refers to the Office Action mailed on 04/14/2022, pages 13-15, incorporated below for reference.
(From Claim 1 rejection. Rejected under 35 U.S.C. 102(a)(1) as being anticipated by Strohman; Trevor D. et al. (CN 107045871 A; hereinafter referred to as Strohman et al.) further in view of Ingmarsson; Carl-Anton ( US 9959864 B1; hereinafter referred to as Ingmarsson) and Finlay; William Mark et al. (US 20210065679 A1; hereinafter referred to as Finlay et al.).)
However, Strohman et al in combination with Ingmarsson do not explicitly teach:
responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list
Finlay et al. does teach wherein:
responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list (see ¶ [0044, 0047]: “[0044] A search over a phonetic index is performed over and using phoneme sequences, rather than text strings as used in an ASR search. Embodiments may convert portions or all of the ASR index or transcript to phonetic representation at indexing time, prior to searching, and prior to receiving a search. An embodiment may “pronounce” (e.g. convert from text to phoneme) each word in the ASR transcript to build a master lookup table providing a correspondence between pronounced words and their appearance in the ASR index. For example, for each word in the ASR transcript, the phoneme sequence corresponding to the word may be generated. […]. [0047] […] In some embodiments, words with ASR confidence scores below a threshold will not appear in the phoneme sequence lookup table, and thus the decision at search time as to whether a word is low confidence can be made by determining that the pronounced word does not appear in the phoneme sequence lookup table. Here, the comparison of the phonemes in the utterance to the phonemes from a term list is interpreted as analogous to the terms not appearing in the sequence lookup table, which results in determining that the pronounced word (utterance) does not appear in the phoneme sequence (from the ASR).ere, He);
Strohman et al, Ingmarsson , and Finlay et al. are all considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Strohman et al. in combination with Ingmarsson to incorporate the teachings of Finlay et al. of responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list which provides the benefit of leveraging the use of ASR for in-vocabulary words. ([0005] of Finlay et al.).

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
The independent claims 1, 8, and 15 recite:
transcribing, by the one or more processors, an utterance into text;
generating, by the one or more processors, a transcription confidence score based on the transcription and audio metrics;
responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list;
generating, by the one or more processors, a sounds similar score for phonemes in the at least one terms from a high frequency term list, based on the comparing; and
replacing, by the one or more processors, the transcription with the at least one term from the high frequency term list, if the sounds similar score is above a threshold.

The limitations of “transforming…”, “transcribing…”; “generating…”; “responsive to … comparing…”; “generating…”; and “replacing…” as drafted cover a human organizing of activities. More specifically, a human (by pen a paper) transforming an utterance signal (i.e., speech received from another human) into a time-frequency signal (i.e., by applying a predefined known time-frequency analysis relationships; such as a Short-Time Fourier Transform (STFT)) and then relating said transformation with specific words (i.e., text); calculating a confidence score for the transcript considering the written transcript and other audio metrics such as (background noise, speech ratio, etc); comparing phonemes of the transcription and phonemes in a list of frequently used terms; calculating a similarity score for phonemes in the aforementioned list; and editing the transcription by replacing a term from the frequently used terms list if the similarity score surpasses a value.
This judicial exception is not integrated into a practical application because for example: the claims mention “one or more processors.” Also, in [0017] and [0061] of the as filed specification, “Server 102 and client computer 112 can be a standalone computing device, a management server, a web server, a mobile computing device, or any other electronic device or computing system capable of receiving, sending, and processing data. In other embodiments, server 102 and client computer 112 can represent a server computing system utilizing multiple computers as a server system. In another embodiment, server 102 and client computer 112 can be a laptop computer, a tablet computer, a netbook computer, a personal computer, a desktop computer, or any programmable electronic device capable of communicating with other computing devices (not shown) within speech recognition transcription correction environment 100 via network 110. […] These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.”. Therefore, a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 

With respect to claims 2, 9, and 16, the claims recite:
determining, by the one or more processors, a number of phonemes in the utterance;
removing, by the one or more processors, high frequency terms from consideration that do not have the same number of phonemes as the utterance; and
matching, by the one or more processors, the phonemes of remaining high frequency terms to the phonemes in the utterance.
This relates to a human organizing of ideas. This reads on a human dividing an utterance into a number of phonemes and counting them; disregarding or deleting terms from a list that do not have the same number of phonemes; and relating or comparing the remaining terms of the list to the phonemes in the input utterance. 
This judicial exception is not integrated into a practical application because for example: in [0017] and [0061] of the as filed specification (as cited above), a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 

With respect to claims 3, 10, and 17, the claims recite:
responsive to the phoneme not matching, determining, by the one or more processors, whether the utterance phoneme that does not match to the high frequency terms match phonemes from a sounds similar list for the corresponding high frequency term phoneme.
This relates to an action. This reads on a human comparing the phonemes of the input utterance with the phonemes of the frequently used terms, determining there is no match and comparing with terms/phonemes associated with the similarity score calculated above. 
This judicial exception is not integrated into a practical application because for example: in [0017] and [0061] of the as filed specification (as cited above), a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 

With respect to claims 4, 11, and 18, the claims recite:
wherein the audio metrics are comprised of at least one of the following: signal-to-noise ratio, background noise, speech ratio, high frequency loss, direct current offset, clipping rate, speech level, or non-speech level.
This relates to an operation. This reads on a human computing one of the examples determining “audio metrics” such as, signal-to-noise ratio, speech ratio, frequency loss, etc. No additional limitations are present.

With respect to claims 5,12, and 19, the claims recite:
wherein the transcribing is performed by an automatic speech recognition module based on a deep neural network.
This relates to an action. This reads on a human receiving a spoken utterance and transcribing/writing it.
This judicial exception is not integrated into a practical application because for example: in [0017] and [0061] of the as filed specification (as cited above), a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 

With respect to claims 6, 13, and 20  the claims recite: 
receiving, by the one or more processors, the utterance.
This relates to an action. This reads on a human receiving an utterance from another human. 
This judicial exception is not integrated into a practical application because for example: in [0017] and [0061] of the as filed specification (as cited above), a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 

With respect to claims 7 and 20, the claims recite: 
wherein the receiving is performed by a virtual assistant, at a specific node of the virtual assistant, wherein the high frequency terms over a time period have been identified for the specific node 
This relates to an action. This reads on a human receiving an utterance (spoken              input) from another human, where the human can identify frequently used terms and keep them on  a list. 
This judicial exception is not integrated into a practical application because for example: in [0017] and [0061] of the as filed specification (as cited above), a general-purpose computer or computing device is described and mainly used as an application thereof. Also, in [0014] of the as filed specification  “In an embodiment of the invention, a log of historical recordings of user utterances and audio metrics at a specific node of a virtual assistant (VA) are received” shows that the virtual assistant is simply being used to receive data. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 4, 6, 8, 11, 13, 15, 18 are rejected under 35 U.S.C. 103 as being unpatentable over Strohman et al. (US 20170229124 A1) further in view of Hannun et al. (US 20160171974 A1), Ingmarsson (US 9959864 B1), and Finlay et al. (US 20210065679 A1). 
As to independent claim 1, Strohman et al. teaches 
1. (Currently Amended) A computer-implemented method for training a model for improving speech recognition (see ¶ 1 of Invention contents and ¶ 5 of Preferred embodiments section (page 4) of Strohman et al.]: “In general, one innovative aspect of the subject matter described in this specification can is embodied in a method of improving speech recognition using the external data source. For example, an automatic speech recognizer may receive intercom audio data for encoding and using a first language model provides initial candidate transcription of the speech. [0020] ASR 110 language model may receive the sound value and based on the sound score generating initial candidate transcription. For example, ASR 110 of the language model 114 receives the voice score less than SE-0.9/0/0/, ... EE-0/0/0.9/ ... I-0/0.7/0/ ... ", and in response, generating initial candidate transcription" Citizen Reservation ".” Here, the “training a model” is interpreted as determining good values (i.e., likelihood) for all the Hannunghts (i.e., candidate transcription).), the computer-implemented method comprising: 
generating, by the one or more processors, a sounds similar score for phonemes in the at least one terms from the high frequency term list, based on the comparing (see ¶ 3 of Preferred Embodiments (page 4) ) of Strohman et al.: ASR 110 of the acoustic model 112 can receive sound data and corresponding to the word or sub words of the voice data, for example, phoneme-generating sound score. the sound value can reflect out the pronunciation similarity between words or sub words and sound data. For example, the acoustic model can receive "CityZen Reservation" sound data and generates sound value SE-0.9/0/0/ ..., ... EE-0/0/0.9/ ... I-0/0.7/0/ ... ". The exemplary sound score may indicate a phoneme "SE" in the speech sound matching, of the first sub word has 90% for speaking in the second sub acoustic matching words with 0%, and third sub word in the speech with sound of 0% matching, for the phoneme "EE", in the speaker of the first sub acoustic matching word has 0%, for speech in the second sub acoustic matching word has 0%, and third sub word in the speech with sound matching of 90%; and for the phoneme "I", in the speaker of the first sub acoustic matching word has 0%, for speech in the second sub acoustic matching word has 0%, and third sub word in the speech with sound matching of 70%. In more examples, a voice model 112 can each output sound value for combined phoneme and position of sub words in the speech. Acoustic model 112 may be based on the waveform indicated by the sound data and is indicated as corresponding to a specific sub word by comparing the waveform to generate sound value. For example, the acoustic model 112 can receive the "CityZen Reservation" talk and identifying out the beginning of sound data represents the phoneme "SE" 90% matching the stored waveform having a waveform, and in response, a phoneme "SE" generating sound score of 0.9 for the first phoneme in the speech.” Here, the sound value/score is interpreted as analogous to the sounds similar score and the comparing is interpreted as associated with the comparison of the waveform from the sound data (input utterance) and the waveform of the generated sound value (by the ASR).); and 
replacing, by the one or more processors, the transcription with the at least one term from the high frequency term list, if the sounds similar score is above a threshold (see ¶ 1 of Invention contents section (page 2) ) of Strohman et al.: The system then can be different to the initial candidate transcription application of second language model to generate a replacement candidate transcription, (i) a sound similar to the initial candidate transcription, and (ii) possibly appearing in the given language.”)

However, Strohman et al does not explicitly teach:
transforming, by one or more processors, an utterance into an audio spectrogram  recognition;
transcribing, by the one or more processors, the audio spectrogram of the utterance into text;
 generating, by the one or more processors, a transcription confidence score based on the transcription and audio metrics;
responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list; 
Hannun et al. does teach:
transforming, by one or more processors, an utterance into an audio spectrogram recognition (see ¶ [0052] of Hannun et al.]: “…The jitter set of audio files, including the corresponding original audio file, are converted (310) into a set of spectrograms”); 
transcribing, by the one or more processors, the audio spectrogram of the utterance into text (see ¶ [0035] of Hannun et al.]: “In embodiments, a recurrent neural network (RNN) is trained to ingest speech spectrograms and generate English text transcriptions.”);  
Strohman et al. and Hannun et al. are both considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Strohman et al to incorporate the teachings of Hannun et al. of transforming, by one or more processors, an utterance into an audio spectrogram recognition and transcribing, by the one or more processors, the audio spectrogram of the utterance into text which provides the benefit of maintaining an improvement over audio-only models ([0040] of Hannun et al.).

However, Strohman et al. in combination with Hannun et al. do not explicitly teach:
generating, by the one or more processors, a transcription confidence score based on the transcription and audio metrics;
responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list; 

Ingmarsson does teach wherein:
generating, by the one or more processors, a transcription confidence score based on the transcription and audio metrics (see Col. 7, lines 33-47 of Ingmarsoon: In some implementations, the ASR 110 may additionally include the re-scorer 114, which rescores the confidence scores calculated by ASR 110 for each particular candidate transcription. For instance, the re-scorer 114 may additionally compare the phonetic similarity between each of candidate transcription and the audio data 104b to determine which individual candidate transcription represents the transcription that is most likely to be the correct transcription. For example, if the audio data 104b includes significant amounts of noise, the re-scorer 114 may adjust the confidence scores assigned to each of the initial candidate transcription 104c and the additional candidate transcription 104d such that the ASR 110 appropriately select the candidate transcription that is most likely to be an accurate transcription.),
Strohman et al. in combination with Hannun et al. and Ingmarsson are both considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Strohman et al. in combination with Hannun et al. to incorporate the teachings of Ingmarsson of generating, by the one or more processors, a transcription confidence score based on the transcription and audio metrics which provides the benefit of improving voice recognition accuracy. (Col. 1, lines 20-29 of Ingmarsson).

However, Strohman et al. in combination with Hannun et al. do not explicitly teach:
responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list; 

Finlay et al. does teach wherein:
responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list (see ¶ [0044, 0047] of Finlay et al.: “[0044] A search over a phonetic index is performed over and using phoneme sequences, rather than text strings as used in an ASR search. Embodiments may convert portions or all of the ASR index or transcript to phonetic representation at indexing time, prior to searching, and prior to receiving a search. An embodiment may “pronounce” (e.g. convert from text to phoneme) each word in the ASR transcript to build a master lookup table providing a correspondence between pronounced words and their appearance in the ASR index. For example, for each word in the ASR transcript, the phoneme sequence corresponding to the word may be generated. […]. [0047] […] In some embodiments, words with ASR confidence scores below a threshold will not appear in the phoneme sequence lookup table, and thus the decision at search time as to whether a word is low confidence can be made by determining that the pronounced word does not appear in the phoneme sequence lookup table. Here, the comparison of the phonemes in the utterance to the phonemes from a term list is interpreted as analogous to the terms not appearing in the sequence lookup table, which results in determining that the pronounced word (utterance) does not appear in the phoneme sequence (from the ASR).ere, He);
Strohman et al in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. are all considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Strohman et al in combination with Hannun et al., and Ingmarsson to incorporate the teachings of Finlay et al. of responsive to the transcription confidence score being below a threshold, comparing, by the one or more processors, phonemes in the utterance to phonemes in at least one term from a high frequency term list which provides the benefit of leveraging the use of ASR for in-vocabulary words. ([0005] of Finlay et al.).

Regarding claim 4, 11 and 18, Strohman et al. in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. teach the limitations of claim 1, 8, and 15 as above, 
Ingmarsson further teaches wherein:
signal-to-noise ratio, background noise, speech ratio, high frequency loss, direct current offset, clipping rate, speech level, or non-speech level (see Col. 7, lines 33-47: In some implementations, the ASR 110 may additionally include the re-scorer 114, which rescores the confidence scores calculated by ASR 110 for each particular candidate transcription. For instance, the re-scorer 114 may additionally compare the phonetic similarity between each of candidate transcription and the audio data 104b to determine which individual candidate transcription represents the transcription that is most likely to be the correct transcription. For example, if the audio data 104b includes significant amounts of noise, the re-scorer 114 may adjust the confidence scores assigned to each of the initial candidate transcription 104c and the additional candidate transcription 104d such that the ASR 110 appropriately select the candidate transcription that is most likely to be an accurate transcription.”).
Strohman et al. in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. and Ingmarsson are both considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Strohman et al. in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. to incorporate the teachings of Ingmarsson of generating, by the one or more processors, a transcription confidence score based on the transcription and audio metrics which provides the benefit of improving voice recognition accuracy. (Col. 1, lines 20-29 of Ingmarsson).

Regarding claim 6 and 13, Strohman et al. in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. teaches the limitations of claim 1, 8, and 15 as above, 
Strohman et al. further teaches the method further comprising:
receiving, by the one or more processors, the utterance (see ¶ 1 of Invention contents section (page 2): “For example, an automatic speech recognizer may receive intercom audio data for encoding and using a first language model provides initial candidate transcription of the speech..”).

As to independent claim 8, Strohman et al. in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. teaches the limitations of claim 1.
Strohman et al. further teaches a computer system for improving speech recognition transcriptions, the system comprising: 
one or more computer processors (see ¶ 1 of page 8: “processor 302”); 
one or more computer readable storage media (see ¶ [0045-0047]: “[0047] The storage device 306 is capable of providing mass storage for the computing device 300. In one implementation, the storage device 306 may be or contain a computer-readable medium, …”); 
computer program instructions to (see ¶ 3 of page 8: “storage device 306 can be a computer-readable medium or a computer-readable medium,…”) [perform the instructions disclosed in claim 1].

As to independent claim 15, Strohman et al. in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. further teaches a computer program product for improving speech recognition transcriptions, the computer program product comprising a computer readable storage media and program instructions sorted on the computer readable storage media (see ¶ 3 of page 8: “[…] the computer program product further comprising instructions, the instructions, when executed by one or more methods such as more described..”), the program instructions including instructions to [perform the instructions disclosed in claim 1].

Claims 2-3, 9-10, and 16-17 are rejected under 35 U.S.C. 103 as being unpatentable over Strohman et al. (US 20170229124 A1) further in view of Hannun et al. (US 20160171974 A1), Ingmarsson (US 9959864 B1), and Finlay et al. (US 20210065679 A1) as applied to claims 1, 8, and 15 above, and further in view of Williamson  (US 6785417 B1). 

Regarding claims 2, 9 and 16, Strohman et al. in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. teaches the limitations of claim 1, 8, and 15 as above, above.
However, Strohman et al in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. does not explicitly teach wherein 
the comparing further comprises: 
determining, by the one or more processors, a number of phonemes in the utterance; 
removing, by the one or more processors, high frequency terms from consideration that do not have the same number of phonemes as the utterance; and 
matching, by the one or more processors, the phonemes of remaining high frequency terms to the phonemes in the utterance

Williamson et al. does teach wherein:
the comparing further comprises: 
determining, by the one or more processors, a number of phonemes in the utterance (see Col. 2, lines 11-30 and Col. 5, lines 28-30: “(9) Numerous other variations are possible because of the use of alternates, which also may be returned with a probability ranking. For example, rather than a strict exact match test on the alternates, a scheme that looks for a percentage of matching characters can be implemented, with the user optionally adjusting the percentage, e.g., from loose to exact. Other variations include the Hannunghting of certain characters, (e.g., the first character has to exactly match, with only a percentage of others needed), and/or factoring in the number of syllables. Since alternates are returned with a probability, the probabilities of alternates may be used, e.g., a looser match is adequate on a highly probable word, while an exact match is required on a less probable word. Other variations include length of word Hannunghting, Bayesian combination of probabilities to determine Hannunghting, alternate to alternate exact match, percentage of alternate to alternate matches, the percentage of the percentages and so on, and the use of word/alternate matching in conjunction with ink/feature/bitmap/image matching. Various combinations of these variations are also feasible. […] However, as will be understood, the present invention will operate with any type of recognizer that returns alternates, including a speech recognizer.” Here, the number of phonemes is interpreted as analogous to the number of syllables.); 
removing, by the one or more processors, high frequency terms from consideration that do not have the same number of phonemes as the utterance(see Col. 13, lines 13-37: “(50) In addition to the above tests, the lengths (number of characters) of the words may factor into the formula or formulas used, e.g., a search term alternate needs to be less than the length of the target alternate plus three, else the word will not be considered a match. Other criteria can be used in the evaluation. For example, the number of syllables of the words (which a recognizer can return) can be used to determine a match, e.g., it can be a requirement that the search term alternate and target alternate have to be within one syllable of one another, such as before even attempting the percentage test, (or as a separate test). For example, with such a "within n-syllable" (or syllables) test, if n is set as less than or equal to one, a search term alternate such as "probable" (three syllables) would be further compared against "probably" (three syllables), but would not be tested against the alternate word "probability" (five syllables). Again, the search term alternates may have different syllable-based rules than the target alternates, e.g., "rob" as a search term alternate may be compared with "probable," "probably" and "probability," but if "rob" was the target, it would be skipped over. Note that as used herein, the search term alternate or alternate need not be an actual word, but can be a fragment of a word or even a single character (including numbers or other symbols), e.g., "prob" can be searched.” Here, the high frequency terms are interpreted as associated with the alternate word list while the removal based on the number of phonemes (or syllables) of the transcription and the utterance is interpreted as associated with the “skipped over” step when the number of syllables of alternate words such as the example of “such a "within n-syllable" (or syllables) test, if n is set as less than or equal to one, a search term alternate such as "probable" (three syllables) would be further compared against "probably" (three syllables), but would not be tested against the alternate word "probability" (five syllables).”); and 
matching, by the one or more processors, the phonemes of remaining high frequency terms to the phonemes in the utterance((see Col. 13, lines 13-37 citation as in limitation above: Here, the matching of phonemes (or syllables) with the remaining high frequency terms (i.e., the alternate words) after the alternate words not meeting the syllable number requirement being skipped over, such as the example of “such a "within n-syllable" (or syllables) test, if n is set as less than or equal to one, a search term alternate such as "probable" (three syllables) would be further compared against "probably" (three syllables), but would not be tested against the alternate word "probability" (five syllables).).
Strohman et al in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. and Williamson et al.  are considered to be analogous to the claimed invention because they are in the same field of endeavor in digital data processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have the method for improving speech recognition as taught by Strohman et al in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. include the determining a number of phonemes in the utterance; removing high frequency terms from consideration that do not have the same number of phonemes as the utterance; and matching the phonemes of remaining high frequency terms to the phonemes in the utterance as taught by Williamson et al. in order to yield predictable results of providing the user with not only exact matching but also (optionally) alternate matching (Col. 2, lines 11-30 of Williamson et al.). (See KSR v. Teleflex).

Regarding claim 3, 10 and 17, Strohman et al. in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. teaches the limitations of claim 2, 9, and 16 as above, and 
Strohman et al. further teaches the method further comprising: 
responsive to the phoneme not matching, determining, by the one or more processors, whether the utterance phoneme that does not match to the high frequency terms match phonemes from a sounds similar list for the corresponding high frequency term phoneme (see ¶ 6 of Preferred Embodiment section (page 4): “Language model 114 may be based on likelihood word sequence is present and sound value to generate the initial candidate transcription. For example, language model 114 may be based on word "CityZen Reservation" likelihood of occurrence is 0%, for example, because the word "CityZen" is not included in language model 114, word-Citizen Reservation "likelihood of occurrence is 70%, the" CityZen Reservation "indicating the speaker sound appear more similar to the" City "heel" Zen "rather than" Citizen "sound value to generate the" Citizen Reservation " candidate transcription.” In some embodiments, language model 114 may be the probability indication of the word sequence is a likelihood score, and when generating initial candidate transcription, language model 114 can be sound matching score and the likelihood score. e.g., for phoneme SE-ET-EE-ZE, language model 114 can be sound matching score of 0.9, 0.9, 0.9. 0.7 and "City" heel "Zen" of the probability score of 0.0 multiplied to produce a score of 0, and the phoneme "SE-ET-I-ZE", language model 114 can be a sound matching score 0.9, 0.9, 0.7, 0.9 multiplied by the likelihood score of 0.9 "Citizen" to produce a score of 0.45. and selecting the word "Citizen", the cause of it is 0.45 higher than the scores "City" heel "Zen" score of 0.” Here, the high frequency terms phonemes are interpreted as associated with the terms present in the language model, while the sound similar list is interpreted as associated with the terms associated with the sound matching scores (i.e. SE-ET-EE-ZE, language model 114 can [consider] sound matching score of 0.9, 0.9, 0.9. 0.7 and "SE-ET-I-ZE", language model 114 can [consider] a sound matching score 0.9, 0.9, 0.7, 0.9).).

Claims 5, 12, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Strohman et al. (US 20170229124 A1) further in view of Hannun et al. (US 20160171974 A1), Ingmarsson (US 9959864 B1), and Finlay et al. (US 20210065679 A1) as applied to claims 1, 8, and 15 above, and further in view of Thomson (US 20220059077 A1)

Regarding claim 5, 12, and 19, Strohman et al. in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. teach the limitations of claim 1, 8, and 15 as above.
However, Strohman et al in combination with Ingmarsson, and Finlay et al.  does not explicitly teach wherein 
wherein the transcribing is performed by an automatic speech recognition module based on a deep neural network.
Thomson does teach wherein:
wherein the transcribing is performed by an automatic speech recognition module based on a deep neural network (see ¶ [0253]: “In some embodiments, the model 1214 may be a deep neural network model or other type of machine learning model that may be trained based on providing parameters and a result. In some embodiments, the model may be a language model or an acoustic model that may be used by an ASR system to transcribe audio. Alternately or additionally, the model may be another type of model used by an ASR system to transcribe audio.”).
Strohman et al in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. and Thomson are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Strohman et al to in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. incorporate the teachings of Thomson of the transcribing being performed by an automatic speech recognition module based on a deep neural network which provides the benefit of an improved technology with respect to audio transcriptions and real-time generation of audio transcriptions. ([0039] of Thomson).

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Strohman et al. (US 20170229124 A1) further in view of Hannun et al. (US 20160171974 A1), Ingmarsson (US 9959864 B1), and Finlay et al. (US 20210065679 A1) as applied to claims 1, 8, and 15 above, and further in view of  Tiruveedhula (US 20210042657 A1). 

Regarding claims 7 and 14, Strohman et al. in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. teaches all of the limitations as in claims 1 and 8, above.
However, Strohman et al. in combination with Ingmarsson, and Finlay et al.  does not explicitly teach wherein the receiving is performed by a virtual assistant, at a specific node of the virtual assistant, wherein the high frequency terms over a time period have been identified for the specific node
Tiruveedhula does teach:
wherein the receiving is performed by a virtual assistant, at a specific node of the virtual assistant, wherein the high frequency terms over a time period have been identified for the specific node (see ¶ [0080]: “[0080] Based on the clothes shopping application's initial response, the user speaks into the mobile device with a second voice query, “I'm going to a wedding.” The virtual assistant interface receives the voice query and performs speech-to-text to obtain a text version of the user query. The virtual assistant interface determines that the user query is directed to the clothes shopping application and transmits the text of the user query to the clothes shopping application. […].” Here, it is interpreted that the specific node of the virtual assistant is clothes shopping application and the high frequency terms are interpreted as analogous voice query at a given time (i.e., “I'm going to a wedding.”).).
Strohman et al. in combination with Hannun et al.,  and Ingmarsson, and Finlay et al.  and Tiruveedhula are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Strohman et al. in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. to incorporate the teachings of Tiruveedhula of wherein the receiving is performed by a virtual assistant, at a specific node of the virtual assistant, wherein the high frequency terms over a time period have been identified for the specific node  which provides the benefit of providing timely and relevant data based on subsequent user queries.(¶ [0060] of Tiruveedhula et al.).

Regarding claim 20, Strohman et al in combination with Hannun et al.,  and Ingmarsson, and Finlay et al. teaches all of the limitations as in claim 15, above and further teaches:
The computer program product of claim 15, further comprising instructions to: 
receive the utterance (see ¶ 1 of Invention contents section (page 2) citation as in claims 6 and 13.), 

However, Strohman et al. in combination with Ingmarsson, and Finlay et al.  does not explicitly teach wherein the receiving is performed by a virtual assistant, at a specific node of the virtual assistant, wherein the high frequency terms over a time period have been identified for the specific node
Tiruveedhula does teach:
wherein the receiving is performed by a virtual assistant, at a specific node of the virtual assistant, wherein the high frequency terms over a time period have been identified for the specific node (see ¶ [0080] citation as in claims 7 and 14.). 
Strohman et al. in combination with Ingmarsson, and Finlay et al.  and Tiruveedhula are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech recognition. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Strohman et al. in combination with Hannun et al.,  and Ingmarsson, and Finlay et al.  to incorporate the teachings of Tiruveedhula of wherein the receiving is performed by a virtual assistant, at a specific node of the virtual assistant, wherein the high frequency terms over a time period have been identified for the specific node  which provides the benefit of providing timely and relevant data based on subsequent user queries.(¶ [0060] of Tiruveedhula et al.).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Keisha Y Castillo-Torres whose telephone number is (571)272-3975. The examiner can normally be reached Monday - Friday, 8:30 am - 4:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Keisha Y. Castillo-Torres
Examiner
Art Unit 2659


/Keisha Y. Castillo-Torres/Examiner, Art Unit 2659                                                                                                                                                                                                        
/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        

09/23/2022