DETAILED ACTION
This office action is in response to Applicant’s submission filed on 9/30/2022. Claims 1, 4-6, 10, 11, 16, and 17 were amended. As such, claims 1- 20 have been examined.


Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Response to Arguments
Applicant’s arguments and amendments in the Amendment filed 9/30/2022 (herein “Amendment”) with respect to claim objection have been fully considered are persuasive, as such claim objections are withdrawn.

Applicant’s arguments filed in the Amendment with respect to the claim interpretation, 35 U.S.C. 112(f), on various claims have been fully considered but they are not persuasive.  Consequently, 35 U.S.C. 112(f) claim interpretation is maintained.

Applicant’s arguments filed in the Amendment with respect to the 35 USC §103 rejection raised in the previous office action have been fully considered, but are moot in view of the new grounds of rejection which was necessitated by applicant’s amendment.
Therefore, while all of the Applicant’s arguments and amendments filed in the Amendment have been fully considered, they are not persuasive. Please see below for more detail including updated citations and obviousness rationale.


Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are:
“a first filter that receives,” “a first machine learning system and a second machine learning system that analyzes,” and “a second filter that receives,” as claimed in claim 11.
“a phonetic encoding component that determines,” as claimed in claim 16.


Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 2, 4, 7, 8, 11, 12,  15, 17  and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Mairesse (US9558740B1), Komissarchik et al. (US20170337923A1)(herein  "Komissarchik"), and Steelberg (US20200286485A1).

Mairesse, and Steelberg were applied in the previous Office Action.
Regarding claims 1, and 17, Mairesse teaches [a computer-implemented method for detecting and resolving mis-transcriptions in a transcript generated by an automatic speech recognition (ASR) system when transcribing spoken words, the method comprising - claim 1] and [a non-transitory computer readable storage medium containing computer program instructions for detecting and resolving mis-transcriptions in a transcript generated by an automatic speech recognition system when transcribing spoken words, the computer program instructions, when executed by a processor, causing the processor to perform an operation comprising: - claim 17] (Mairesse, Col. 22, lines 24 - 31:" Aspects of the disclosed system may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure.". and Col.  20, lines 34 - 39:"A device's computer instructions may be stored in a non-transitory manner in non-volatile memory [706/806], storage [708/808], or an external device[s]. Alternatively, some or all of the executable instructions may be embedded in hardware or firmware on the respective device in addition to or instead of software.").
receiving a machine language generated transcript of a speech signal by at least one of a first machine learning system and a second machine learning system; (Mairesse, Col. 3. lines 57 - 63: “A user 10 may speak an utterance including a command. The user's utterance is captured by a microphone of device 110. The system may then determine [152] audio data corresponding to the utterance, for example as a result of the microphone converting the sound to an audio data signal. The system may then perform [154] ASR processing [transcript generated] on the audio data, for example using techniques described below.”).
analyzing, by the at least one of the first machine learning system and the second machine learning system, the machine language generated transcript to find a region of low confidence indicative of a mis-transcription, (Mairesse, Col. 3, lines 36 – 38: “Each processing point may use a model configured using machine learning techniques.”, and Col. 3, line 64 - Col. 4, line 8: “The system may then process [156] the ASR results [transcription] with a first model [machine learning system] to determine if disambiguation [low confidence region] of ASR hypotheses is desired. The first model [machine learning system] may be trained to determine, using confidence scores corresponding to a plurality of ASR hypotheses, whether to select a single ASR hypothesis or whether to perform further selection from among the plurality of ASR hypotheses. If disambiguation [low confidence region] is desired, the system may process [158] ASR results [mis-transcription] with a second model [machine learning system] to determine what hypotheses should be selected for disambiguation. The second model [machine learning system] may be trained to determine, also using confidence scores, which of the plurality of ASR hypothesis to select for disambiguation. The second model [machine learning system] may be trained to determine, also using confidence scores, which of the plurality of ASR hypothesis to select for disambiguation.”).
analyzing, by the at least one of the first machine learning system and the second machine learning system, the region of low confidence and predicting an improvement to the region of low confidence indicative of the mis-transcription; (Mairesse, Col. 3, line 64 - Col. 4, line 8: The system may then process [156] the ASR results [transcription] with a first model [machine learning system] to determine if disambiguation [low confidence region] of ASR hypotheses is desired . The first model [machine learning system] may be trained to determine, using confidence scores corresponding to a plurality of ASR hypotheses, whether to select a single ASR hypothesis or whether to perform further selection from among the plurality of ASR hypotheses. If disambiguation [low confidence region] is desired, the system may process (158) ASR results [mis-transcription] with a second model [machine learning system] to determine what hypotheses should be selected for disambiguation. The second model [machine learning system] may be trained to determine, also using confidence scores, which of the plurality of ASR hypothesis to select for disambiguation. The second model [machine learning system] may be trained to determine, also using confidence scores, which of the plurality of ASR hypothesis to select for disambiguation.”).
Mairesse fails to explicitly disclose, however, Komissarchik teaches the predicted improvement comprising a homophone of the found region having substantially a same pronunciation and different meaning therefrom, (Komissarchik, Par. 0050:” To make voice-based dialog more robust words/phrases used in it should be chosen to be less prone to user mispronunciation and ASR confusion. Major factor in such a confusion is phonetic proximity between different words/phrases. If two words have zero distance in their phonetic pronunciation, they are called homophones. To avoid confusion between homophones human languages are usually built in such a way that homophones have different grammar roles [e.g. “you” vs. “yew”, or “to” vs. “too”]. If they just differ in one phoneme, they are called a minimal pair. There are no similar grammar based provisions in a language for minimal pairs though. So, in reality, when user mispronounces a particular phoneme (or sequence of them), words that normally mean totally different things suddenly become de-facto homophones. Quite similar situation takes place for ASR. If two words are pronounced similarly ASR can recognize one word as another. However, if a word/phrase is quite distant from other words/phrases from phonetic standpoint then confusion due to mispronunciation or ASR errors is less likely. That is the premise of the method of building robust voice-based dialogs.”, and Par. 0066:” Pronunciation peculiarities/errors of a group [e.g. people that share common first language] or an individual introduce “disturbances” into the relationships between entries in Synonyms and Phrase Similarity Repositories. For example, two words/phrases from these repositories suddenly become undistinguishable [homophones] or can easily confuse ASR. This is as if repository “contracts” and words/phrases became “glued” together. So the phrases that were good alternatives become less desirable. Furthermore, certain words/phrases become simply unusable because user cannot reliably pronounce them and ASR provides no results at all.”)
While Komissarchik teaches pronunciations and phonetic similarity and different meaning, Komissarchik does not explicitly teach that the same pronunciation indicates a higher level of phonetic similarity than one homophone having only a similar pronunciation. However, it is well-known to one of ordinary skill in the art that same pronunciations will have a higher level of phonetic similarity than one homophone having only a similar (not same) pronunciation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komissarchik to include that same pronunciations indicate a higher level of phonetic similarity than one homophone having only a similar pronunciation at least because doing so would be combining prior art elements according to known methods to yield predictable results. See MPEP 2143(I)(A).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Mairesse in view of Komissarchik to predict improvement comprising a homophone of the found region having substantially a same pronunciation and different meaning therefrom, wherein the same pronunciation indicates a higher level of phonetic similarity than one homophone having only a similar pronunciation, in order to improve quality of speech recognition and efficiency of speaker adaptation, as evidence by  Komissarchik (See Par. 0002).
Mairesse, and Komissarchik fail to explicitly disclose, however, Steelberg teaches selecting, by a word selector, a replacement word for the mis-transcription based on the predicted improvement to the region of low confidence; and replacing, by the word selector, the mis-transcription by the replacement word. (Steelberg, Par. 0127:” Truth engine 1140 includes algorithms and instructions that, when executed by a processor, cause the processor to identify transcription errors [mis-transcription] in one or more parts of a transcribed portion, for example, by identifying words with confidence score below a predetermined threshold. The truth engine 1140 may then correct [select] the identified errors, for example, by replacing the words with low confidence score with correct words. In some embodiments, the truth engine may utilize machine learning model to find the correct replacement words. The truth engine 1140 may also label [or tag] the corrected words.
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Mairesse, and Komissarchik in view of Steelberg to select, by a word selector, a replacement word for the mis-transcription based on the predicted improvement to the region of low confidence; and replacing, by the word selector, the mis-transcription by the replacement word, in order to generate a revised transcription based on the received reward function and reach a desired accuracy threshold for transcription, as evidence by Steelberg (See Par. 0032).

Regarding claims 2, 18, Mairesse teaches wherein the first machine learning system and the second machine learning system are connected in tandem. (Mairesse, Col. 3, lines 36 – 45: Each processing point may use a model configured using machine learning techniques. A first model may be trained to determine whether speech processing results should be disambiguated before passing results to be executed. A second model may be trained to determine what potential speech processing results should be displayed for user selection [if any], following selection of disambiguation by the first model. A system for operating this improvement is illustrated in FIG. 1A."). Note: models are being processed serially which is an indication of tandem connection.

Regarding claims 4, and 12, Mairesse and Komissarchik do not explicitly teach, but Steelberg further teaches wherein the first machine learning system comprises a research on artificial intelligence (AI) in systems and linguistics (RAILS) model architecture. (Given that RAILS is not a term of art known to one of ordinary skill in the art, the broadest reasonable interpretation is determined in view of the Specification, where Applicant has acted as their own lexicographer [MPEP 2111.01[IV]] in defining the term “RAILS” in Par. 99 of the originally filed specification, understood to be a real-time model receiving low word confidences, and outputting higher probability replacements, and in view of this definition, Steelberg teaches Par. 0127: “Truth engine 1140 includes algorithms and instructions that, when executed by a processor, cause the processor to identify transcription errors [low word confidence] in one or more parts of a transcribed portion, for example, by identifying words with confidence score below a predetermined threshold. The truth engine 1140 may then correct [higher probability replacements] the identified errors, for example, by replacing the words with low confidence score with correct words. In some embodiments, the truth engine may utilize machine learning model to find the correct replacement words.)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Mairesse, and Komissarchik in view of Steelberg to wherein the first machine learning system comprises a RAILS model architecture, in order to generate a revised transcription based on the received reward function, as evidence by Steelberg (See Par. 0032).

Regarding claims 7, and 15, Mairesse, and Komissarchik do not explicitly teach, but Steelberg further teaches wherein the word selector comprises a trained decision trees model. (Steelberg, Par. 0054; “In contrast, in some embodiments, modeling module 200-2 may train one or more transcription models using both existing media files and the most recent data [transcribed data] available for the input media file. In some embodiments, the training modules 200-1 and 200-2 may include machine learning algorithms such as, but not limited to, deep learning neural networks; gradient boosting, random forests, support vector machine learning, decision trees, variational auto-encoders [VAE], generative adversarial networks, recurrent neural networks, and convolutional neural networks [CNN], faster R-CNNs, mask R-CNNs, and SSD neural networks.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Mairesse, and Komissarchik in view of Steelberg to wherein the word selector comprises a trained decision trees model, in order to generate a revised transcription based on the received reward function, as evidence by Steelberg (See Par. 0032).

Regarding claim 8, Mairesse, and Komissarchik do not explicitly teach, but Steelberg further teaches wherein the word selector comprises a Random Forests model. (Steelberg, Par. 0038; “At 110, an initial transcription neural network model can be used to select an initial transcription engine for transcribing the input media file [or a portion of the input media file]. The initial a transcription neural network model [“transcription model”] that can be previously trained. Based the features profile of the input media file, the transcription model may then use one or more machine learning algorithms to generate a list of one or more transcription engines [candidate engines] with the highest predicted transcription accuracy. The one or more machine learning algorithms may include, but not limited to: a deep learning neural network; a gradient boosting algorithm (which may also be referred to as gradient boosted trees), and a random forest algorithm. In some embodiments, all three of the mentioned machine learning algorithms may be used—using model stacking—to create a multi-model.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Mairesse, and Komissarchik in view of Steelberg to wherein the word selector comprises a Random Forests model, in order to generate a revised transcription based on the received reward function, as evidence by Steelberg (See Par. 0032).

Regarding claim 11, Mairesse teaches A system for detecting and resolving mis-transcriptions in a transcript generated by an automatic speech recognition system when transcribing spoken words, the system comprising: a first filter that receives a machine language generated transcript of a speech signal, the first filter including a first machine learning system and a second machine learning system (Mairesse, Col. 3. lines 57 - 63: “A user 10 may speak an utterance including a command. The user's utterance is captured by a microphone of device 110. The system may then determine [152] audio data corresponding to the utterance, for example as a result of the microphone converting the sound to an audio data signal. The system may then perform [154] ASR processing [transcript generated] on the audio data, for example using techniques described below.”).
that analyzes the machine language generated transcript in tandem or in parallel and find a region of low confidence indicative of a mis-transcription, and; (Mairesse, Col. 3, lines 36 – 38: “Each processing point may use a model configured using machine learning techniques.”, and Col. 3, line 64 - Col. 4, line 8: “The system may then process [156] the ASR results [transcription] with a first model [machine learning system] to determine if disambiguation [low confidence region] of ASR hypotheses is desired. The first model [machine learning system] may be trained to determine, using confidence scores corresponding to a plurality of ASR hypotheses, whether to select a single ASR hypothesis or whether to perform further selection from among the plurality of ASR hypotheses. If disambiguation [low confidence region] is desired, the system may process [158] ASR results [mis-transcription] with a second model [machine learning system] to determine what hypotheses should be selected for disambiguation. The second model [machine learning system] may be trained to determine, also using confidence scores, which of the plurality of ASR hypothesis to select for disambiguation. The second model [machine learning system] may be trained to determine, also using confidence scores, which of the plurality of ASR hypothesis to select for disambiguation.”).
the first machine learning system and the second machine learning system that analyzes the region of low confidence and predict an improvement to the region of low confidence, [[the predicted improvement comprising a homophone of the found region having substantially a same pronunciation and different meaning therefrom, wherein the same pronunciation indicates a higher level of phonetic similarity than one homophone having only a similar pronunciation]]; and a second filter that receives the machine generated transcript and the predicted improvement to the region of low confidence from the first filter, (Mairesse, Col. 3, line 64 - Col. 4, line 8: The system may then process [156] the ASR results [transcription] with a first model [machine learning system] to determine if disambiguation [low confidence region] of ASR hypotheses is desired . The first model [machine learning system] may be trained to determine, using confidence scores corresponding to a plurality of ASR hypotheses, whether to select a single ASR hypothesis or whether to perform further selection from among the plurality of ASR hypotheses. If disambiguation [low confidence region] is desired, the system may process [158] ASR results [mis-transcription] with a second model [machine learning system] to determine what hypotheses should be selected for disambiguation. The second model [machine learning system] may be trained to determine, also using confidence scores, which of the plurality of ASR hypothesis to select for disambiguation. The second model [machine learning system] may be trained to determine, also using confidence scores, which of the plurality of ASR hypothesis to select for disambiguation.”).
Mairesse fails to explicitly disclose, however, Komissarchik teaches the predicted improvement comprising a homophone of the found region having substantially a same pronunciation and different meaning therefrom, (Komissarchik, Par. 0050:” To make voice-based dialog more robust words/phrases used in it should be chosen to be less prone to user mispronunciation and ASR confusion. Major factor in such a confusion is phonetic proximity between different words/phrases. If two words have zero distance in their phonetic pronunciation, they are called homophones. To avoid confusion between homophones human languages are usually built in such a way that homophones have different grammar roles [e.g. “you” vs. “yew”, or “to” vs. “too”]. If they just differ in one phoneme, they are called a minimal pair. There are no similar grammar based provisions in a language for minimal pairs though. So, in reality, when user mispronounces a particular phoneme (or sequence of them), words that normally mean totally different things suddenly become de-facto homophones. Quite similar situation takes place for ASR. If two words are pronounced similarly ASR can recognize one word as another. However, if a word/phrase is quite distant from other words/phrases from phonetic standpoint then confusion due to mispronunciation or ASR errors is less likely. That is the premise of the method of building robust voice-based dialogs.”, and Par. 0066:” Pronunciation peculiarities/errors of a group [e.g. people that share common first language] or an individual introduce “disturbances” into the relationships between entries in Synonyms and Phrase Similarity Repositories. For example, two words/phrases from these repositories suddenly become undistinguishable [homophones] or can easily confuse ASR. This is as if repository “contracts” and words/phrases became “glued” together. So the phrases that were good alternatives become less desirable. Furthermore, certain words/phrases become simply unusable because user cannot reliably pronounce them and ASR provides no results at all.”)
While Komissarchik teaches pronunciations and phonetic similarity and different meaning, Komissarchik does not explicitly teach that the same pronunciation indicates a higher level of phonetic similarity than one homophone having only a similar pronunciation. However, it is well-known to one of ordinary skill in the art that same pronunciations will have a higher level of phonetic similarity than one homophone having only a similar (not same) pronunciation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Komissarchik to include that same pronunciations indicate a higher level of phonetic similarity than one homophone having only a similar pronunciation at least because doing so would be combining prior art elements according to known methods to yield predictable results. See MPEP 2143(I)(A).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Mairesse in view of Komissarchik to predict improvement comprising a homophone of the found region having substantially a same pronunciation and different meaning therefrom, wherein the same pronunciation indicates a higher level of phonetic similarity than one homophone having only a similar pronunciation, in order to improve quality of speech recognition and efficiency of speaker adaptation, as evidence by  Komissarchik (See Par. 0002).
Mairesse, and Komissarchik fail to explicitly disclose, however, Steelberg teaches based on the predicted improvement to the region of low confidence, selects a replacement word for the mis-transcription, and replaces the mis-transcription by the replacement word. (Steelberg, Par. 0127:” Truth engine 1140 includes algorithms and instructions that, when executed by a processor, cause the processor to identify transcription errors [mis-transcription] in one or more parts of a transcribed portion, for example, by identifying words with confidence score below a predetermined threshold. The truth engine 1140 may then correct [select] the identified errors, for example, by replacing the words with low confidence score with correct words. In some embodiments, the truth engine may utilize machine learning model to find the correct replacement words. The truth engine 1140 may also label [or tag] the corrected words.
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Mairesse, and Komissarchik in view of Steelberg to base on the predicted improvement to the region of low confidence, select a replacement word for the mis-transcription, and replace the mis-transcription by the replacement word, in order to generate a revised transcription based on the received reward function, as evidence by Steelberg (See Par. 0032).


Claims 3 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Mairesse, Komissarchik, Steelberg , and in further view of Scott Fischthal (US5822741).

Fischthal was applied in the previous Office Action.
Regarding claim 3, and 19 Mairesse, Komissarchik and Steelberg fail to explicitly disclose, however Fischthal teaches wherein the first machine learning system and the second machine learning system are connected in parallel (Fischthal, Col. 2, lines 6-8: The neural network or artificial neural system is defined by a plurality of these simple, densely interconnected processing units which operate in parallel.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Mairesse, Komissarchik and Steelberg in view of Fischthal to wherein the first machine learning system and the second machine learning system are connected in parallel, in order to employ genetic algorithms, as evidence by Fischthal (See Col. 6, lines 35-36).

Claims 9 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Mairesse, Komissarchik, Steelberg , and in further view of Dua (US20200019863A1).

Dua was applied in the previous Office Action.
Regarding claim 9, and 16 Mairesse, Komissarchik and Steelberg fail to explicitly disclose, however Dua teaches a dataset containing all unigrams, bigrams, trigrams and quadgrams present in a corpus of transcripts and their respective probabilities. (Dua, Par. 0124:” Each row of the concatenated matrix is processed by a neural network to generate a bag-of-ngrams [BoN] vector data structure representing the probability distribution over the ngrams of the vocabulary [step 618].”, and Par. 0022:” The bag-of-ngrams is encoded as a probability distribution over a full vocabulary V. Ngrams that do not belong to the bag have a probability of zero, while ngrams in the bag have probability larger than zero.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Mairesse, Komissarchik and Steelberg in view of Fischthal to employ a dataset containing all unigrams, bigrams, trigrams and quadgrams present in a corpus of transcripts and their respective probabilities, in order to improve knowledge and learn with each iteration and interaction through machine learning processes, as evidence by Dua (see Par. 0073).


Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Mairesse, Komissarchik, Steelberg , and in further view of Gruber et al. (US20140365216A1)(herein “Gruber”).

Regarding claim 10, Mairesse, Komissarchik and Steelberg fail to explicitly disclose, however Gruber teaches a phonetic encoding component for determining the same pronunciation. (Gruber, Par. 0133:” Turning to FIG. 5B, by storing the second phonetic representation [for speech synthesis] in association with the text string, the digital assistant is able to use the user-specified pronunciation in speech outputs that include the word. For example, in some implementations, after storing the second phonetic representation in association with the text string, the digital assistant synthesizes a speech output corresponding to the text string using the second phonetic representation [518]. Accordingly, the synthesized speech output will sound substantially similar to the word in the speech input [e.g., the word as spoken by the user]. As a specific example, after storing a second phonetic representation ‘fill-eep-ay’ [corresponding to the user-specified pronunciation of the word in a speech synthesis phonetic alphabet], the digital assistant synthesizes a speech output using the user-specified pronunciation of the word ‘Philippe’ [e.g., ‘Okay, I'm placing a telephone call to fill-eep-ay.’]”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Mairesse, Komissarchik and Steelberg in view of Gruber to employ a phonetic encoding component for determining the same pronunciation, in order to enhance the user experience and potentially increasing the user's confidence in the capabilities of the digital assistant (See Par. 0008).


Claims 5, 13, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Mairesse, Komissarchik, Steelberg , and in further view of Hilleli (US20210099317A1).

Hilleli was applied in the previous Office Action.
Regarding claims 5, and 13 Mairesse, Komissarchik and Steelberg fail to explicitly disclose, however Hilleli teaches wherein the second machine learning system comprises a bidirectional encoder representation from transformers (BERT) model architecture. (Hilleli, Par. 0086:” In an example illustration of a model that may be used to define beginnings and/or ends of action items, BERT models or other similar models can be used. BERT generates a language model by using an encoder to read content all at once or in parallel [i.e., it is bidirectional], as opposed to reading text from left to right, for example.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Mairesse, Komissarchik and Steelberg in view of Hilleli to wherein the second machine learning system comprises a BERT model architecture, in order to improve these virtual assistants because they can clarify action items using contextual data, as evidence by Hilleli (See Par. 0033).

Regarding claim 20, Mairesse, and Komissarchik do not explicitly teach, but Steelburg teaches wherein the first machine learning system comprises a RAILS model architecture, [[the second machine learning system comprises a BERT model architecture,]] and the word selector comprises a trained decision trees model. (Given that RAILS is not a term of art known to one of ordinary skill in the art, the broadest reasonable interpretation is determined in view of the Specification, where Applicant has acted as their own lexicographer [MPEP 2111.01[IV]] in defining the term “RAILS” in Par. 99 of the originally filed specification, understood to be a real-time model receiving low word confidences, and outputting higher probability replacements, and in view of this definition, Steelberg teaches Par. 0127: “Truth engine 1140 includes algorithms and instructions that, when executed by a processor, cause the processor to identify transcription errors [low word confidence] in one or more parts of a transcribed portion, for example, by identifying words with confidence score below a predetermined threshold. The truth engine 1140 may then correct [higher probability replacements] the identified errors, for example, by replacing the words with low confidence score with correct words. In some embodiments, the truth engine may utilize machine learning model to find the correct replacement words.)
and the word selector comprises a trained decision trees model. (Steelberg, Par. 0054; “In contrast, in some embodiments, modeling module 200-2 may train one or more transcription models using both existing media files and the most recent data [transcribed data] available for the input media file. In some embodiments, the training modules 200-1 and 200-2 may include machine learning algorithms such as, but not limited to, deep learning neural networks; gradient boosting, random forests, support vector machine learning, decision trees, variational auto-encoders [VAE], generative adversarial networks, recurrent neural networks, and convolutional neural networks [CNN], faster R-CNNs, mask R-CNNs, and SSD neural networks.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Mairesse, and Komissarchik in view of Steelberg to wherein the first machine learning system comprises a RAILS model architecture, and the word selector comprises a trained decision trees model, in order to generate a revised transcription based on the received reward function, as evidence by Steelberg (See Par. 0032).
Mairesse, Komissarchik and Steelberg fail to explicitly disclose, however Hilleli teaches [[wherein the first machine learning system comprises a RAILS model architecture,]] the second machine learning system comprises a BERT model architecture, [[and the word selector comprises a trained decision trees model.]] (Hilleli, Par 0086:” In an example illustration of a model that may be used to define beginnings and/or ends of action items, BERT models or other similar models can be used. BERT generates a language model by using an encoder to read content all at once or in parallel [i.e., it is bidirectional], as opposed to reading text from left to right, for example.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Mairesse, Komissarchik and Steelberg in view of Hilleli to wherein the second machine learning system comprises a BERT model architecture, in order to improve these virtual assistants because they can clarify action items using contextual data, as evidence by Hilleli (See Par. 0033).

Claims 6, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Mairesse, Komissarchik, Steelberg , and in further view of Mao, “GPT3 and SEO: why AI will revolutionize your content forever”, Feb 2019, Jenni.ai Blog, accessible at: https://jenni.ai/blog/gpt3-seo-content-marketing).

Mao was applied in the previous Office Action.
Regarding claims 6, and 14 Mairesse, Komissarchik and Steelberg fail to explicitly disclose, however Mao teaches wherein the second machine learning system comprises a generative pre-trained transformer 3 (GPT-3) model architecture. (Mao, Page 1: OpenAI has released a new version of Generative Pre-trained Transformer version 3 [in short, GPT-3 or GPT 3] with beta API access GPT 3, much like its predecessor GPT 2, is a large deep neural network that can automatically generate text ... It is an advanced AI that learns how to imitate human writing from the web.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Mairesse, Komissarchik and Steelberg in view of Mao to wherein the second machine learning system comprises a GPT-3 model architecture, in order to learn without human labeled data, as evidence by Mao (see page 4).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. Abdulkader et al. (US-20200286487A1) teaches Par. 0042:” For each inputted set of transcriptions 226-228 and/or associated features, machine learning model 208 may generate a score (e.g., scores 230) reflecting the accuracy or correctness of the transcription from the contributor ASR, based on the corresponding transcriptions 228 and/or distribution of transcriptions 228 produced by selector ASRs 224. For example, machine learning model 208 may produce a score that represents an estimate of the overall or cumulative error rate between the transcription from the contributor ASR and the corresponding collection of transcriptions 228 produced by selector ASRs 224. During calculation of the score, machine learning model 208 may apply different weights to certain transcriptions 228 and/or portions of one or more transcriptions 226-228 (e.g., words of different lengths, words at the beginning or end of each transcription, etc.). As a result, machine learning model 208 may use transcriptions 228 from selector ASRs 224 as “votes” regarding the correctness or accuracy of a transcription from a given contributor ASR.”
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DARIOUSH AGAHI whose telephone number is (408)918-7689. The examiner can normally be reached Monday - Thursday and alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DARIOUSH AGAHI/Examiner, Art Unit 2656           

/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656