DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed with respect to claim 14 are convincing in view of the added amendment to the claim and the rejection is withdrawn.
Applicant’s argument that the target keyword as claimed is not the same as the anchor point disclosed by Hazen is fully considered however clarification is necessary.  Hazen discloses that the matching word sequences can be the anchor points (Section 3.2, see the matching sequences at the beginning or end of the utterances in Table 4).  Thus any word in between the matching sequences are considered the target keyword (for example w1:w1 in Fig. 2). These words between the ancrho points are considered the “non-aligned” text fragment.  Applicant argues that Hazen does not teach mapping the target word to a plurality of phones, however this is taught by the out of vocabulary (OOV) model as in Section 3.3.  
Applicant argues that Hazen does not teach the “identifying…” step because identification at the phoneme level is not taught by Hazen.  Hazen teaches phoneme level alignment with the OOV model as in stage 2 (Section 3.3) and later passes that to stage 3 (section 3.4).
Applicant argues that Hazen does not teach the “mapping…” step because Hazen does not teach or suggest mapping at a phoneme level.  Examiner disagrees because Hazen teaches “Instead, we assume that errors in the transcript are possible and we allow insertions of new words and substitutions for existing words through the use of a phonetic-based out-of-vocabulary (OOV) word filler model [3]. ” (Section 3.2) for the word sequence. 
Applicant argues that claim 11 as amended is not taught by the combination of references, however the previous citation of Hazen still applies.  The examiner disagrees that the penalty weights have “nothing to do with similarity between a spoken word in the recording and a transcribed word.  The rates of the insertions, substitutions, and deletions are applied to the transcription of the audio recording.
The examiner disagrees that Gurbani is inadmissible as prior art.  See especially Section IV step 4 of the provisional Specification for support of the cited portions of Gurbani.



Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1, 2, 4, 5-7, 10 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Hazen (Automatic Alignment and Error Correction of Human Generated Transcripts for Long Speech Recordings, 2006).


See the citations to Hazen:

1. A computer implemented method of aligning an automatically generated transcription of an audio recording to a manually generated transcription of the audio recording comprising: 
identifying a plurality of non-aligned text fragments (section 3.2 ¶ 2), each located between respective two non-continuous aligned text-fragments of the automatically generated transcription, each aligned text-fragment matching a plurality of words of the manually generated transcription (section 3.2 ¶ 2); 
for each respective non-aligned text fragment:

selecting a target keyword of the non-aligned text fragment of the manually generated transcription (see the transcript across the speech segments spanning between the first stage anchor points, section 3.3 – in this case each word is a target keyword, for example the first word such as w1 in Fig. 2); 
breaking down the target keyword of the manually generated transcription to a plurality of phonemes (anchor points, section 3.2, ¶ 2, also see Section 3.3 which allows for phonetic based out of vocabulary comparison); 
mapping the respective non-aligned text fragment that includes the target keyword and at least one non-target keyword to a corresponding audio-fragment of the audio recording (After obtaining anchor points from the first stage recognition, the second stage produces a pseudo-forced alignment of the manual transcript across the speech segments spanning between the first stage anchor points, Section 3.3, note that an out of vocabulary word can be aligned, this is interpreted as a non-target keyword); 
breaking down the audio-fragment to a plurality of phonemes (alignment using a phonetic word filler model, Section 3.3, the alignment is on a phonetic model as taught Section 3.3, see Section 3.3 which allows for phonetic based out of vocabulary comparison); 
identifying at least some of the plurality of phonemes of the audio- fragment that map to the plurality of phonemes of the target keyword (the transcript is fully aligned against the speech, Section 3.4); and 
mapping the identified at least some of the plurality of phonemes of the audio-fragment to a corresponding word of the automatically generated transcript, wherein the corresponding word is an incorrect automated transcription of the target word appearing in the manually generated transcription (After the second stage is complete, the transcript is fully aligned against the speech, and regions containing potential substitutions, deletions and insertions are marked, Section 3.4).  
2. The method of claim 1, wherein the at least some of the plurality of phonemes of the audio-fragment are identified as corresponding to the plurality of phonemes of the target keyword according to a closest matched computed based on shortest phoneme weighted distance (The rates of the insertions,   substitutions and deletions can be controlled using penalty weights to insure that correctly transcribed
words are rarely replaced or deleted, Section 3.3 – here the phonetic based alignments penalty weights are interpreted as the same as a distance, where a lowest penalty and shortest distance are the same).  

4. The method of claim 1, wherein the matching comprises selecting the at least some of the plurality of phonemes of the audio-fragment, according a lowest value of a phoneme distance to the plurality of phonemes of the keyword of the manually generated transcript ().
5. The method of claim 4, wherein the phoneme distance is selected from the group consisting of: a binary phoneme distance that assigns a binary value indicative of whether each respective phenome is matched or is not matched (see Fig. 2 where the wA: wA and wB: wB are matched), and a weighted phoneme distance that assigns a non-binary value indicative of an amount of similarity between corresponding phonemes (see the penalty weights Section 3.3).  
6. The method of claim 1, further comprising feeding the target keyword of the manually generated transcription and the corresponding word of the automatically generated transcription for automatically updating a model that computes the automatically generated transcript (the ASR model is initially trained using the transcript, Section 3.2 ¶ 1).  
7. The method of claim 1, wherein the target keyword of the manually generated transcription and the corresponding word of the automatically generated transcription are used for adjusting the model for correctly automatically transcribing an audio-fragment corresponding to the audio-fragment of the audio recording to the target keyword of the manually generated transcription instead of to the corresponding word of the automatically generated transcript (see Utterance 1 in Table 4 for example, where the manual transcript is used to edit the ASR result).  
10. The method of claim 1, wherein each of the plurality of aligned text- fragments includes a sequence of at least 4 matching words (see utterance 1 and 2 in Table 4).  


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 3, 8, 9, 11, 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hazen (Automatic Alignment and Error Correction of Human Generated Transcripts for Long Speech Recordings, 2006) in view of Park et al (Automatic Processing Of Audio Lectures For Information Retrieval: Vocabulary Selection And Language Modeling, 2005).

Regarding claim 3, Hazen does not each but Park teaches wherein words of the automatically generated transcription are associated with a timestamp indicating a mapping to the audio recording, and wherein the respective non-aligned text fragment is mapped to the corresponding audio-fragment according to the timestamp (Section 2, ¶ 2).
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies.



Regarding claim 8, Hazen does not each but Park teaches further comprising computing a value for precision and/or recall of transcription of the target keyword of the manually generated transcription in the automatically generated transcript (section 5, ¶ 3).  
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies.


Regarding claim 9, Hazen does not each but Park teaches wherein the automatically generated transcription is created by an acoustic model that extracts phonemes from the audio recording and assigns a probability value to each phoneme denoting likelihood of accurate extraction (section 5.1), and a language model that receives the extracted phonemes and outputs the automatically generated transcription by mapping phonemes to words and determines a word sequence probability (section 5.2).  
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies.

Regarding claim 11, Hazen teaches a computer implemented method of evaluating quality of an automatically generated transcription of an audio recording (the automatic method steps taught in Section 3 inherently require a computer), comprising:
receiving an audio recording (Section 3.2 ¶ 1);for each respective word selected from the lexicon, computing a phoneme distance between phonemes extracted from a portion of the acoustic recording corresponding to the respective selected word and phonemes of the respective selected word, wherein the phoneme distance quantifies similarity between at least one spoken word in the audio recording and the respective automatically transcribed word (see the phonetic based penalty weights in Section 3.3 which are interpreted the same as a distance measure); and 
generating an indication of likelihood of an error of the respective selected word when the computed phoneme distance is above a threshold (penalty weight will determine an insertion/ deletion/ substitution and indictate to the next step of the FST as taught in Section 3.3 and Fig. 1), the error indicative of at least one of: no correct word corresponding to the phonemes extracted from the portion of the acoustic recording exists in the lexicon (deletion as in Section 3.3), and an error in the automated transcription of the phonemes extracted from the portion of the acoustic recording (substitution as in Section 3.3).
Hazen does not teach but Park teaches computing the automatically generated transcription of the audio recording by an acoustic model that extracts phonemes from the audio recording and a language model that receives the extracted phonemes and outputs the automatically generated transcription by mapping phonemes to words selected from a lexicon, wherein each respective word is assigned a respective confidence value (sections 5.1 and 5.2).
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies.


Regarding claim 13, Hazen does not teach but Park teaches  wherein the respective confidence value of the respective word selected from the lexicon indicative of likelihood of error denotes the most likely match within the lexicon (best path, Section 5, ¶ 2). 
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies.
 

 Claim 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hazen (Automatic Alignment and Error Correction of Human Generated Transcripts for Long Speech Recordings, 2006) in view of Park et al (Automatic Processing Of Audio Lectures For Information Retrieval: Vocabulary Selection And Language Modeling, 2005) in view of Gurbani et al (US 2021/0142789 – see provisional date of 62/932949).


Regarding claim 12, Hazen and Park do not teach but Gurbani teaches wherein wherein further comprising receiving a correction of the respective selected word, and updating the lexicon and the model with the correction (updating of the verification dataset to update the parametric values of the model [0097-0098]).  
It would have been obvious to one of ordinary skill in the art before the filing/effective filing date to combine Hazen’s alignment model with Park’s speech recognition to increase accuracy of audio containing specific vocabularies with Gurbani’s word disambiguation resolution to improve prediction of low confidence areas of transcription.


Allowable Subject Matter
Claims 14, 16-20 are allowed.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 


Any inquiry concerning this communication or earlier communications from the examiner should be directed to Matthew H Baker whose telephone number is (571)270-1856. The examiner can normally be reached Monday-Friday 9-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MATTHEW H BAKER/               Primary Examiner, Art Unit 2655