DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
Claims 2 and 5 are cancelled leaving claims 1, 3-4 and 6-20 pending in this application. 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

The following is a quotation of pre-AIA  35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA  35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

Claim 14 is rejected under 35 U.S.C. 112(d) or pre-AIA  35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends.  Claim 14 is analogous to now canceled claim 5 whose limitations have now been incorporated into the independent claims. The claim therefore fails to further limit the scope of independent claim 10 on which it is dependent.  Applicant may cancel the claim, amend the claim to place the claim in proper dependent form, rewrite the claim in independent form, or present a sufficient showing that the dependent claim complies with the statutory requirements.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for 
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 3, 4, 7, 10-16 and 19-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Printz (U.S. Patent Application Publication 2018/0068661).
As per claim 1, Printz discloses: 
a method for transcription, the method comprising: 
receiving, from a first transcription engine, one or more transcribed portions of a media file (Paragraphs [0116]-[0124] - "primary recognizer may include a conventional open dictation automatic speech recognition (ASR) system. Such a system accepts as input an audio signal comprising human speech. It may produce as output a textual transcription of this input"); 
determining a confidence of accuracy value for each of the one or more transcribed portions (Paragraphs [0116]-[0124], [0200], [0206] - "It may also attach an ASR confidence score to each transcribed word and optionally to the output transcription as a whole"); 
identifying, by a transcription analyzer, a first transcribed portion, from the one or more transcribed portions, with a first confidence value below a first predetermined threshold (Paragraphs [0116]-[0124], [0135]-[0139), [0200], [0206] - "the audio input is first passed to the primary recognizer, which generates an initial nominal transcription of its input"; "This initial transcription may well be incorrect"; "the system may determine if one or more word confidence values are deficient, e.g., have confidence levels falling below a threshold"); 
requesting analysis of the first transcribed portion (Paragraphs [0135]-[0139], [0222] - "this imperfect initial transcription may be presented to the natural language understanding module. This module processes the input word sequence, and determines by application of standard methods of computational linguistics"; "The system may then proceed to the secondary recognition"; "the system may proceed to block 1440, where the system may submit the queued hypotheses to the client system for analysis, or depending upon the topology, to the appropriate component for analyzing the hypotheses"); 
receiving, in response to requesting for analysis, an analysis result having a revised-transcription portion of the first transcribed portion, wherein the revised-transcription portion comprises one or more parts of the first transcribed potion that have been revised (Paragraphs [0135]-[0139] - "the audio that comprises the extraneous words "tell me how to get to" has been suppressed from the secondary recognizers input, and as the secondary recognizer is constrained by its grammar to recognize only the phrases "Masala Dosa," "Tikka Masala," "Guddu de Karahi," "Naan-N-Curry," and "Noori Pakistani & Indian Cuisine," the correct transcription "Guddu de Karahi" of the acoustic span is easily obtained"); 
analyzing the revised-transcription portion, using a textual analyzer, to determine a probability of the revised-transcription portion is correct (Paragraphs [0084], [0135]-[0139], [0153]-[0154], [0173], [0210]-[0215], [0315]-[0327] and Figure 39); 
in response to the probability being below a threshold, constructing a phoneme sequence of an audio segment corresponding to the first transcribed portion based on at least on a reward function (Paragraphs [0084], [0153]-[0154], [0173], [0210]-[0215], [0252]-[0257], [0315]-[0327] and Figure 39 -''secondary recognizer may generate an ASR confidence score for its output, and may be operated in "n-best mode" to generate up to a given number n of distinct outputs, each of which may bear an associated ASR confidence score"; "Proper selection of the span extent may improve secondary decoding"; "This may make the indicated secondary recognition less sensitive to the nominal start and end of the full span, and the secondary recognition can then choose freely just where the proper name entity itself begins and ends within the full span. Moreover, by properly structuring the grammar, it can be arranged that the audio corresponding to the prefix words immediately preceding the putative proper name entity words, and likewise the suffix words immediately following, can be absorbed into well-matching words within the active grammar").; 
(Paragraphs [0084], [0153]-[0154], [0173], [0210]-[0215], [0252]-[0257], [0315]-[0327] and Figure 39 -''secondary recognizer may generate an ASR confidence score for its output, and may be operated in "n-best mode" to generate up to a given number n of distinct outputs, each of which may bear an associated ASR confidence score"; "Proper selection of the span extent may improve secondary decoding"; "This may make the indicated secondary recognition less sensitive to the nominal start and end of the full span, and the secondary recognition can then choose freely just where the proper name entity itself begins and ends within the full span. Moreover, by properly structuring the grammar, it can be arranged that the audio corresponding to the prefix words immediately preceding the putative proper name entity words, and likewise the suffix words immediately following, can be absorbed into well-matching words within the active grammar").; and 
generating a new transcription using a transcription engine based on the new audio waveform (Paragraphs [0135]-[0139] - "With these secondary recognition results in hand, the acoustic span transcription "Guddu de Karahi" may be interpolated into the primary recognizer’s transcription, replacing the word sequence "go to do a call Rocky" that was initially guessed for this span, thereby yielding a final transcription").

As per claim 3, Printz discloses all of the limitations of claim 1 above. Printz further discloses:
training a machine learning model using a training data set that includes an audio segment corresponding to the first transcribed portion; 
identifying, by a transcription analyzer, a second transcribed portion having a third confidence value below a second predetermined threshold from the one or more transcribed portions; and 
using the trained machine learning model, re-transcribing a second audio segment of the media file that corresponds with the second transcribed portion (Paragraphs [0116] -[0124], [0135]-[0139], [0200], [0206], [0240]-[0242], [0315]-[0327]).

As per claim 4, Printz discloses all of the limitations of claim 1 above. Printz further discloses:

correcting the identified one or more transcription errors in the one or more parts; and 
labelling the one or more corrected transcription errors in the one or more parts (Paragraphs [0125], [0271]-[0275], [0297], [0301]-[0310], [0315]-[0327]).

As per claim 6, Printz discloses all of the limitations of claim 1 above. Printz further discloses:
generating a string of cumulants comprising of one or more transcription portions preceding and following the low confidence of accuracy portion, wherein the constructed phoneme sequence is based at least one the string of cumulants; and 
generating a reward function based at least on one or more characteristics of the transcription engine (Paragraphs [0084], [0153]-[0154], [0173], [0200], [0206], [0210]-[0215], [0236]-[0239], [0252]-[0257], [0315]-[0327] and Figure 39).

As per claim 7, Printz discloses all of the limitations of claim 6 above. Printz further discloses:
wherein generating the reward function comprises learning characteristics of the transcription engine by computing a Shannon entropy (Paragraphs [0079], [0084], [0124]-[0126] - "Other than the general descriptions provided above, or in the teachings of the invention and its embodiments as found below, we make no further stipulation regarding the internal structure of the primary and secondary recognizers. They may utilize any of the internal structures, computational methods, designs, strategies or techniques as may be appropriate to performing automatic speech recognition, for instance as may be found in books like Automatic Speech Recognition A Deep Learning Approach, by Dong Yu and Li Deng, published by Springer-Verlag London, ISBN 1860-4862, ISBN 1860-4870, ISBN 978-1-4471-5778-6, ISBN 978-1-4471-5779-3, or in any reference found therein. Notably but way of example only and not by way of restriction this may include mel-frequency cepstral coefficients, linear predictive coding (LPC) coefficients, maximum likelihood linear regression, acoustic models, Gaussian mixture models and observation likelihoods computed therefrom, neural networks including deep neural networks, recurrent neural networks, convolutional neural networks, LSTM networks, and excitation, activation or output values associated thereto, language models, Hidden Markov models, n-gram models, maximum entropy models, hybrid architecture, tandem architecture, and any other appropriate value, method or architecture").

As per claim 10, Printz discloses: 
a system for transcription, the system comprising: 
a memory; 
one or more processors coupled to the memory (Paragraph [0376] and Figure 42 - "computing system 1800 may include one or more central processing units ("processors") 1805, memory"), the one or more processors configured to: 
receive, from a first transcription engine, one or more transcribed portions of a media file (Paragraphs [0116]-[0124]); 
identify, by a transcription analyzer of the conductor, a first transcribed portion, from the one or more transcribed portions, with a confidence value below a predetermined threshold (Paragraphs [0116]-[0124], [0135]-[0139], [0200], [0206]); 
request analysis of a first audio segment corresponding to the first transcribed portion (Paragraphs [0135]-[0139], [0222]); 
receive, in response to request for analysis, an analysis result having a revised-transcription portion of the first audio segment, wherein the revised-transcription portion comprises one or more segments of the first transcribed potion that have been revised (Paragraphs [0135]-[0139]); 
analyze the revised-transcription portion, using a textual analyzer, to determine a probability of the revised-transcription portion is correct (Paragraphs [0084], [0135]-[0139], [0153]-[0154], [0173], [0210]-[0215], [0315]-[0327] and Figure 39); 
in response to the probability being below a threshold, construct a phoneme sequence of an audio segment corresponding to the first transcribed portion based on at least on a reward function (Paragraphs [0084], [0153]-[0154], [0173], [0210]-[0215], [0252]-[0257], [0315]-[0327] and Figure 39 -''secondary recognizer may generate an ASR confidence score for its output, and may be operated in "n-best mode" to generate up to a given number n of distinct outputs, each of which may bear an associated ASR confidence score"; "Proper selection of the span extent may improve secondary decoding"; "This may make the indicated secondary recognition less sensitive to the nominal start and end of the full span, and the secondary recognition can then choose freely just where the proper name entity itself begins and ends within the full span. Moreover, by properly structuring the grammar, it can be arranged that the audio corresponding to the prefix words immediately preceding the putative proper name entity words, and likewise the suffix words immediately following, can be absorbed into well-matching words within the active grammar");
 create a new audio waveform based at least on the constructed phoneme sequence (Paragraphs [0084], [0153]-[0154], [0173], [0210]-[0215], [0252]-[0257], [0315]-[0327] and Figure 39 -''secondary recognizer may generate an ASR confidence score for its output, and may be operated in "n-best mode" to generate up to a given number n of distinct outputs, each of which may bear an associated ASR confidence score"; "Proper selection of the span extent may improve secondary decoding"; "This may make the indicated secondary recognition less sensitive to the nominal start and end of the full span, and the secondary recognition can then choose freely just where the proper name entity itself begins and ends within the full span. Moreover, by properly structuring the grammar, it can be arranged that the audio corresponding to the prefix words immediately preceding the putative proper name entity words, and likewise the suffix words immediately following, can be absorbed into well-matching words within the active grammar"); and 
generating a new transcription using a transcription engine based on the new audio waveform (Paragraphs [0135]-[0139]).

As per claim 11, Printz discloses all of the limitations of claim 10 above. Printz further discloses:
wherein the one or more processors, after identifying the first transcribed portion and before requesting analysis on the first transcribed portion, are further configured to: 
send the first audio segment to a plurality of transcription engines; 
receive successive transcribed portions from the plurality of transcription engines; and 
(Paragraphs [0084], [0135]-[0139], [0153]-[0154], [0173], [0210]-[0215], [0315]-[0327] and Figure 39).

As per claim 12, Printz discloses all of the limitations of claim 11 above. Printz further discloses:
wherein the one or more processors are further configured to: 
train a machine learning model using a training data set from the low-confidence database; 
identify, by a transcription analyzer, a second transcribed portion having a third confidence value below a second predetermined threshold from the one or more transcribed portions; and 
using the trained machine learning model, re-transcribe a second audio segment of the media file that corresponds with the second transcribed portion (Paragraphs [0116]-[0124], [0135]-[0139], [0200], [0206], [0240]-[0242], [0315]-[0327]).

As per claim 13, Printz discloses all of the limitations of claim 11 above. Printz further discloses:
a ground truth engine configured to: 
identify one or more transcription errors in one or more parts of the first transcribed portion; 
correct the identified one or more transcription errors in the one or more parts; and 
label the one or more corrected transcription errors in the one or more parts (Paragraphs [0125], [0271]-[0275], [0297], [0301]-[0310], [0315]-[0327]).

As per claim 14, Printz discloses all of the limitations of claim 10 above. Printz further discloses:
wherein request analysis on the first transcribed portion further comprises instructions that cause the one or more processor to: 
construct a phoneme sequence of an audio segment corresponding to the first transcribed portion based on at least on a reward function; 
create a new audio waveform based at least on the constructed phoneme sequence; and 
generate a new transcription using a transcription engine based on the new audio waveform (Paragraphs [0084], [0153]-[0154], [0173], [0210]-[0215], [0252]-[0257], [0315]-[0327] and Figure 39).

As per claim 15, Printz discloses all of the limitations of claim 10 above. Printz further discloses:
wherein the one or more processors are further configured to: 
generate a string of cumulants comprising of one or more transcription portions preceding and following the low confidence of accuracy portion, wherein the constructed phoneme sequence is based at least one the string of cumulants: and 
generate a reward function based at least on one or more characteristics of the transcription engine (Paragraphs [0084], [0153]-[0154], [0173], [0200], [0206], [0210]-[0215], [0236]-[0239], [0252]-[0257], [0315]-[0327] and Figure 39).

As per claim 16, Printz discloses all of the limitations of claim 15 above. Printz further discloses:
wherein generate the reward function comprises learning characteristics of the transcription engine by computing a Shannon entropy (Paragraphs [0079], [0084], [0124]-[0126]).

As per claim 19, Printz discloses: 
a method for transcription, the method comprising: 
receiving one or more transcribed portions of a media file (Paragraphs [0116]-[0124]); 
determining a confidence of accuracy value for each of the one or more transcribed portions (Paragraphs [0116]-[0124], [0200], [0206]); 
identifying a first transcribed portion that has a first confidence value below a predetermined threshold (Paragraphs [0116]-[0124], [0135]-[0139], [0200], [0206]); 
constructing a phoneme sequence of an audio segment corresponding to the first transcribed portion based on at least on a reward function (Paragraphs [0084], [0153]-[0154], [0173], [0210]-[0215], [0252]-[0257], [0315]-[0327] and Figure 39 - "secondary recognizer may generate an ASR confidence score for its output, and may be operated in "n-best mode” to generate up to a given number n of distinct outputs, each of which may bear an associated ASR confidence score"; "secondary recognitions may be performed for each of the acoustic spans identified by the prior decoding stages, until a final transcription is obtained for the whole of the original utterance. If no competing alternative meaning hypotheses were proposed by the prior processing steps, then the decoding is complete. However, this may not always be the case. More likely, several alternative transcriptions, each with one or more associated meaning hypotheses, may have been generated, each hypothesis having NLU and ASR confidence scores. It remains to select the final preferred decoding, or at a minimum, assign a confidence score to each whole decoding, and provide a ranked list of alternatives"; "confidence values may be lower than a threshold, e.g., 300, indicating an incorrect association. In this example, "goose" mismatches "Guddu" and "karate" mismatches "Karahi" as the words are superficially similar. Accordingly, the 110 and 150 confidence levels reflect an unlikely match (e.g., because the spectral character of the waveform doesn't agree with the expected character of the phonemes in these words). However, if no better proper name match is found for the proposals 1115b-c, the system may accept this interpretation by default"; "By virtue of the phoneme loop within the left shim, as shown in FIG. 32, the same audio segment may be matched against the phoneme sequence"); 
creating a new audio waveform based at least on the constructed phoneme sequence (Paragraphs [0210]-[0215], [0315]-[0327] - "If the words adjacent to the placeholder are decoded with notably low confidence scores-or if an initial decoding of a given audio segment by the secondary recognizer yields an anomalously low confidence score-some embodiments perturb the nominal start and end times of the extracted audio segment, thereby producing multiple candidate segments for decoding. All of these may then be passed as variants to "Secondary Recognition" 915, which can decode them all and select the decoding with the highest confidence score as the nominal answer"; "each of these arcs bridges a portion of the waveform"); 
generating a new transcription using a transcription engine based on the new audio waveform; and 
replacing the first transcribed portion with the new transcription (Paragraphs [0315]-[0327] and Figure 33 - "traversing the phoneme loop yields a decoding that comprises a sequence of phonemes, rather than conventional words in the target language. In the example of FIG. 33, concatenating the tokens on the decoding path yields the final transcription"; "One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment "ER you coming tonight" yields "are you coming tonight." Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence").

As per claim 20, Printz discloses all of the limitations of claim 19 above. Printz further discloses:
wherein generating the new transcription further comprises: 
sending the new audio waveform to a plurality of transcription engines; 
receiving transcription results from the plurality of transcription engines; and 
replacing the first transcribed portion with one of the transcription results based on a second confidence value, wherein each transcription result includes a confidence value (Paragraphs [0116]-[0124], [0135]-[0139], [0205]-[0215], [0315]-[0327]).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 8, 9, 17 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Printz (U.S. Patent Application Publication 2018/0068661) in view of Lakshmanan et al. (U.S. Patent Application Publication 2012/0323827).
As per claims 8 and 17, Printz discloses all of the limitations of claims 6 and 15 above. Printz fails to disclose:
generating the reward function comprises solving a Bellman equation using backward induction.
However, Lakshmanan et al. in the same field of endeavor teaches:
generating the reward function comprises solving a Bellman equation using backward induction (Paragraphs [0093]-[0098], [0101 ]-[0107] - "Consider the extended state space Markov Chain (eMC). Let "S" be set of reachable states and T be the transition matrix"; "There are a total of n=|S| equations and "n" variables (g(s)), thus it is possible to compute g(s) exactly by solving following system of linear equation"; "expressed as a Bellman equation, recursive step-by-step faun with a value function in one period and a value function in the next period").
It would be obvious for a person having ordinary skill in the art at the effective filing date of the invention to utilize wherein generating the reward function comprises solving a Bellman equation using backward induction as taught by Lakshmanan in order to provide systems and methods for automatic recognition and understanding of fluent, natural human speech as taught by Printz capable of improved calculation of probabilistic distributions, optimal decisioning and prediction values. Printz and Lakshmanan et al. are both directed to probabilistic content processing in conjunction with machine learning.

As per claims 9 and 18, the combination of Printz and Lakshmanan et al. discloses all of the limitations of claims 8 and 17 above. Lakshmanan et al. further discloses:
the Bellman equation comprises a Dempster Shafer possibility transition matrix (Paragraphs [0093]-[0098], [0101]-[0107])

Response to Arguments
Applicant’s arguments, see page 7, filed 1/22/2021, with respect to the rejection of claims 6 and 9-12 under 35 U.S.C. 112(b) have been fully considered and are persuasive.  The rejection of claims 6 and 9-12 under 35 U.S.C. 112(b) has been withdrawn. 
Applicant’s arguments, see pages 7-8, filed 1/22/2021, with respect to the rejection(s) of claim(s) 1-7, 10-16 and 19-20 under 35 U.S.C. 102 have been fully considered and are not persuasive.  Since Prinz varies the boundaries of the audio segment based on the confidence value of the transcription, it creates a new audio waveform based at least on the constructed phoneme sequence.
 
Examiner Notes
The Examiner cites particular columns and line numbers in the references as applied to the claims above for the convenience of the Applicant.  Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well.  It is respectfully requested that, in preparing responses, the Applicant fully considers the references in its entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or as disclosed by the Examiner. 
Communications via Internet e-mail are at the discretion of the applicant and require written authorization. Should the Applicant wish to communicate via e-mail, including the following paragraph in their response will allow the Examiner to do so:
“Recognizing that Internet communications are not secure, I hereby authorize the USPTO to communicate with me concerning any subject matter of this application by electronic mail. I understand that a copy of these communications will be made of record in the application file.”
Should e-mail communication be desired, the Examiner can be reached at Edwin.Leland@USPTO.gov

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to EDWIN S LELAND III whose telephone number is (571)270-5678.  The examiner can normally be reached on 8:00 - 5:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached on (571) 272-7773.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 






/EDWIN S LELAND III/Primary Examiner, Art Unit 2677