DETAILED ACTION


Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1, 2, 4, 5-7, 10 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Hazen (Automatic Alignment and Error Correction of Human Generated Transcripts for Long Speech Recordings, 2006).


See the citations to Hazen:


identifying a plurality of non-aligned text fragments (section 3.2 ¶ 2), each located between respective two non-continuous aligned text-fragments of the automatically generated transcription, each aligned text-fragment matching a plurality of words of the manually generated transcription (section 3.2 ¶ 2); 
for each respective non-aligned text fragment:mapping a target keyword of the manually generated transcription to a plurality of phonemes (anchor points, section 3.2, ¶ 2); 
mapping the respective non-aligned text fragment to a corresponding audio-fragment of the audio recording (After obtaining anchor points from the first stage recognition, the second stage produces a pseudo-forced alignment of the manual transcript across the speech segments spanning between the first stage anchor points, Section 3.3); 
mapping the audio-fragment to a plurality of phonemes (alignment using a phonetic word filler model, Section 3.3, the alignment is on a phonetic model as taught Section 3.3); 
identifying at least some of the plurality of phonemes of the audio- fragment that correspond to the plurality of phonemes of the target keyword (the transcript is fully aligned against the speech, Section 3.4); and 

2. The method of claim 1, wherein the at least some of the plurality of phonemes of the audio-fragment are identified as corresponding to the plurality of phonemes of the target keyword according to a closest matched computed based on shortest phoneme weighted distance (The rates of the insertions,   substitutions and deletions can be controlled using penalty weights to insure that correctly transcribed
words are rarely replaced or deleted, Section 3.3 – here the phonetic based alignments penalty weights are interpreted as the same as a distance, where a lowest penalty and shortest distance are the same).  

4. The method of claim 1, wherein the matching comprises selecting the at least some of the plurality of phonemes of the audio-fragment, according a lowest value of a phoneme distance to the plurality of phonemes of the keyword of the manually generated transcript ().
5. The method of claim 4, wherein the phoneme distance is selected from the group consisting of: a binary phoneme distance that assigns a binary value indicative of whether each respective phenome is matched or is not matched (see Fig. 2 where the wA: wA and wB: wB are matched), and a weighted phoneme distance that assigns a non-binary value indicative of an amount of similarity between corresponding phonemes (see the penalty weights Section 3.3).  
6. The method of claim 1, further comprising feeding the target keyword of the manually generated transcription and the corresponding word of the automatically generated transcription for automatically updating a model that computes the automatically generated transcript (the ASR model is initially trained using the transcript, Section 3.2 ¶ 1).  
7. The method of claim 1, wherein the target keyword of the manually generated transcription and the corresponding word of the automatically generated transcription are used for adjusting the model for correctly automatically transcribing an audio-fragment corresponding to the audio-fragment of the audio recording to the target keyword of the manually generated transcription instead of to the corresponding word of the automatically generated transcript (see Utterance 1 in Table 4 for example, where the manual transcript is used to edit the ASR result).  
10. The method of claim 1, wherein each of the plurality of aligned text- fragments includes a sequence of at least 4 matching words (see utterance 1 and 2 in Table 4).  


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


Claim 3, 8, 9, 11, 13-14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hazen (Automatic Alignment and Error Correction of Human Generated Transcripts for Long Speech Recordings, 2006) in view of Park et al (Automatic Processing Of Audio Lectures For Information Retrieval: Vocabulary Selection And Language Modeling, 2005).

Regarding claim 3, Hazen does not each but Park teaches wherein words of the automatically generated transcription are associated with a timestamp indicating a mapping to the audio recording, and wherein the respective non-aligned text fragment is mapped to the corresponding audio-fragment according to the timestamp (Section 2, ¶ 2).
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies.



Regarding claim 8, Hazen does not each but Park teaches further comprising computing a value for precision and/or recall of transcription of the target keyword of the manually generated transcription in the automatically generated transcript (section 5, ¶ 3).  
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies.


Regarding claim 9, Hazen does not each but Park teaches wherein the automatically generated transcription is created by an acoustic model that extracts phonemes from the audio recording and assigns a probability value to each phoneme denoting likelihood of accurate extraction (section 5.1), and a language model that receives the extracted phonemes and outputs the automatically generated transcription by mapping phonemes to words and determines a word sequence probability (section 5.2).  
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies.

Regarding claim 11, Hazen teaches a computer implemented method of evaluating quality of an automatically generated transcription of an audio recording (the automatic method steps taught in Section 3 inherently require a computer), comprising:
receiving an audio recording (Section 3.2 ¶ 1);for each respective word selected from the lexicon, computing a phoneme distance between phonemes extracted from a portion of the acoustic recording corresponding to the respective selected word and phonemes of the respective selected word (see the phonetic based penalty weights in Section 3.3 which are interpreted the same as a distance measure); and 
generating an indication of likelihood of an error of the respective selected word when the computed phoneme distance is above a threshold (penalty weight will determine an insertion/ deletion/ substitution and indictate to the next step of the FST as taught in Section 3.3 and Fig. 1), the error indicative of at least one of: no correct word corresponding to the phonemes extracted from the portion 
Hazen does not teach but Park teaches computing the automatically generated transcription of the audio recording by an acoustic model that extracts phonemes from the audio recording and a language model that receives the extracted phonemes and outputs the automatically generated transcription by mapping phonemes to words selected from a lexicon, wherein each respective word is assigned a respective confidence value (sections 5.1 and 5.2).
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies.


Regarding claim 13, Hazen does not teach but Park teaches  wherein the respective confidence value of the respective word selected from the lexicon indicative of likelihood of error denotes the most likely match within the lexicon (best path, Section 5, ¶ 2). 
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies.
 
Regarding claim 14, Hazen teaches a computer implemented method of post-processing an automatically generated transcription of an audio recording to correct transcription errors (the automatic method steps taught in Section 3 inherently require a computer), comprising: 

Hazen does not teach but Park teaches computing the automatically generated transcription of the audio recording by an acoustic model that extracts phonemes from the audio recording, and a language model that receives the extracted phonemes and outputs the automatically generated transcription by mapping phonemes to words selected from a lexicon (sections 5.1 and 5.2); 
receiving a plurality of target words (sections 5.1 and 5.2);
computing a respective weighted phoneme distance that assigns a non-binary value indicative of an amount of similarity between corresponding phonemes, from an automatically transcribed word of the automatically generated transcription to each of the plurality of target words, and when the respective phoneme distance is according to a requirement, switching the respective automatically transcribed word to a certain target word of the plurality of target words corresponding to a lowest value of the respective phoneme distance (see the function of the diphone acoustic model as in section 5.1).
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies.


 Claim 12, 16-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hazen (Automatic Alignment and Error Correction of Human Generated Transcripts for Long Speech Recordings, 2006) in view of Park et al (Automatic Processing Of Audio Lectures For Information Retrieval: Vocabulary Selection And Language Modeling, 2005) in view of Gurbani et al (US 2021/0142789 – see provisional date of 62/932949).



It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies with Gurbani’s word disambiguation resolution to improve prediction of low confidence areas of transcription.


Regarding claim 16, Hazen and Park do not teach but Gurbani teaches wherein the requirement denotes that the automatically transcribed word is similar to but not identical to the plurality of target words ([0076]).  
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies with Gurbani’s word disambiguation resolution to improve prediction of low confidence areas of transcription.

Regarding claim 17, Hazen and Park do not teach but Gurbani teaches wherein the requirement is a range having an upper threshold value of the phoneme distance denoting identical words and a lower threshold value of the phoneme distance denoting similar but difference words (see the threshold values with respect to phonetic distance, [0078]).   
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio 

Regarding claim 18, Hazen and Park do not teach but Gurbani teaches wherein the respective automatically transcribed word and an indication of a switch to the certain target word are used to update the language model for improved accuracy in mapping phonemes to the certain target word (updating of the verification dataset to update the parametric values of the model [0097-0098]).  
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies with Gurbani’s word disambiguation resolution to improve prediction of low confidence areas of transcription.

Regarding claim 19, Hazen and Park do not teach but Gurbani teaches wherein wherein the automatically transcribed word is selected for inclusion in the automatically generated transcription when the automatically transcribed word is assigned a confidence value by the language model above a threshold (see the combined threshold requirements as in [0078]).
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies with Gurbani’s word disambiguation resolution to improve prediction of low confidence areas of transcription.

Regarding claim 20, Hazen and Park do not teach but Gurbani teaches wherein wherein further comprising confirming the switching when a phoneme distance computed between phonemes extracted 
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies with Gurbani’s word disambiguation resolution to improve prediction of low confidence areas of transcription.


Allowable Subject Matter
Claims 15 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. See PTO-892. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Matthew H Baker whose telephone number is (571)270-1856. The examiner can normally be reached Monday-Friday 9-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MATTHEW H BAKER/               Primary Examiner, Art Unit 2655