99Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-22 are pending. Claims 1, 16 and 22 are independent.
This Application was published as U.S. 2021/0225389.
            Apparent priority: 17 January 2020.
	The “computer storage medium” of Claim 22 is defined as “[0086] … The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.”  Including the “non-transitory” qualifier in the language of the Claim removes the need to check the Specification for ascertaining that non-transitory media are excluded.
Information Disclosure Statement
The listing of references in the PCT international search report is not considered to be an information disclosure statement (IDS) complying with 37 CFR 1.98. 37 CFR 1.98(a)(2) requires a legible copy of: (1) each foreign patent; (2) each publication or that portion which caused it to be listed; (3) for each cited pending U.S. application, the application specification including claims, and any drawing of the application, or that portion of the application which caused it to be listed including any claims directed to that portion, unless the cited pending U.S. application is stored in the Image File Wrapper (IFW) system; and (4) all other information, or that portion which caused it to be listed. In addition, each IDS must include a list of all patents, publications, applications, or other information submitted for consideration by the Office (see 37 CFR 1.98(a)(1) and (b)), and MPEP § 609.04(a), subsection I. states, “the list ... must be submitted on a separate paper.” Therefore, the references cited in the international search report have not been considered. Applicant is advised that the date of submission of any item of information or any missing element(s) will be the date of submission for purposes of determining compliance with the requirements based on the time of filing the IDS, including all “statement” requirements of 37 CFR 1.97(e). See MPEP § 609.05(a).
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 6-11 and 19-21 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 1 includes “wherein: (1) N is at least 2 …”
Claim 6 which depends from 1 includes:
6. The method of claim 2, wherein calculating the intelligibility score comprises: 
if N is 1, setting the intelligibility score to the conditional intelligibility value of the most intelligible recognition result; 
otherwise, if the most intelligible recognition result is the 1-best recognition result, setting the intelligibility score to the conditional intelligibility value of the most intelligible recognition result; 
otherwise, setting the intelligibility score to a combined value, 
wherein the combined value is based on the conditional intelligibility value of the most intelligible recognition result and the conditional intelligibility value of the 1-best recognition result. 
Once N is set in Claim 1 to be “at least 2,” Claim 6 cannot go back and ask “if N is 1.”  This creates indefiniteness.
Claims 7-11 depend from Claim 6 and inherit the indefiniteness.

Claim 19 has language similar to the language of Claim 6 and is rejected under similar rationale (independent Claim 16 asks that N is at least 2).  Claims 20-21 depend from Claim 19 and inherit the indefiniteness.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

The following is a quotation of pre-AIA  35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA  35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

Claims 6-11 and 19-21 are rejected under 35 U.S.C. 112(d) or pre-AIA  35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends.  Applicant may cancel the claim(s), amend the claim(s) to place the claim(s) in proper dependent form, rewrite the claim(s) in independent form, or present a sufficient showing that the dependent claim(s) complies with the statutory requirements.
See the 112(b) rejection above.
A dependent Claim has to further limit the claim from which it depends.
A claim saying N at least 3 would be further limiting.  
Claims 6 and 19 that set N=1 divert from the scope of their independent Claims.  

(In this situation, for example, Loukina is sufficient to teach N=1 whereas a reference that teaches an N at least 2 had to be added for the rejection of Claims 1 and 16.) 
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


 Claims 1-2, 6, 8-9, 12, 14, 16-20, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Loukina (U.S. 20150248898) in view of Kurata (U.S. 2014/0358533) (“Kurata2014”) and further in view of Kurata (U.S. 2008/0221890) (“Kurata2008”).
Regarding Claim 1, Loukina teaches:
1. A speech intelligibility scoring method, [Loukina, Figures 1 or 2, “speech intelligibility determination engine 102/202.”]
comprising: 
receiving an acoustic signal encoding an utterance of a user, [Loukina, Figures 1 or 2, “non-native speech recording 104/204.”  Figure 6, 602.] 
wherein the utterance comprises a verbalization of a sample text by the user; [Loukina does not teach but suggests that the input speech is from a “sample text”:   “[0016] … With reference to FIG. 1, a computer implemented speech intelligibility determination engine 102 receives a speech sample to analyze, such as a recording of speech 104 by a non-native speaker….”]
generating, by an automatic speech recognition (ASR) module, [Loukina, Figure 2, “Automatic Speech Recognizer 206.”] [Figure 6, 604.]
an N-best output corresponding to the utterance, wherein the N-best output comprises N recognition results generated by the ASR module for the utterance, wherein N is a positive integer; and [N-best result generation is well-known in speech recognition but not mentioned in Loukina.]
generating an intelligibility score representing an intelligibility of the utterance based, at least in part, on the N-best output and the sample text, [Loukina, Figure 3, “Speech intelligibility score 322” which takes into account the “word acoustic score 322” and “string of words transcript 306.”] [See equation 1 at [0026]:  AMbowl is the “word acoustic score 322 for the word bowl,” CSbowl is “the automatic speech recognizer 304 confidence score outputted as a work likelihood score 310,” and equation 1 sets ISbowl which is the “intelligibility score 328 for the word “bowl”,” as equal to a function of AMbowl and CSbowl .]   [Figure 6, 610, 612.]
wherein: 
(1) N is at least 2, [Loukina generates one result which corresponds to N=1.  It just takes the top result.]
(2) the generating of the intelligibility score is further based on a confidence score, wherein the confidence score indicates a probability that a particular one of the recognition results is a correct transcription of the utterance, and/or [Loukina, Figure 3, “word likelihood scores 310.”  “[0020] …Word likelihood scores 310 are provided based on a confidence of the automatic speech recognizer 304 that it has identified the correct word in the transcript 306.”  “[0017] … The automatic speech recognizer 206 further provides a confidence score for the word (e.g., a score indicating how confident the recognizer 206 is that "bowl" is the word spoken in the recording 204)…..”  “15. The method of claim 1, wherein the automated speech recognizer further provides an acoustic model likelihood score for each phone within each word, a language model likelihood score for each word, and a confidence score for each word.”] [Figure 6, 604]
(3) the generating of the intelligibility score is further based on a pronunciation score for the utterance, wherein the pronunciation score indicates an extent to which the utterance exhibits correct pronunciation of the sample text. [Loukina, Figure 2, “Acoustic Scoring Model 214.”  Figure 3, “Acoustic Likelihood Scores for Phones of Words 308.”  “[0020] … The automatic speech recognizer provides a transcript 306 of a string of words recognized in the recording, acoustic likelihood scores 308 indicating confidence or quality of pronunciation of individual phones or words in the recording 302, and word likelihood scores indicating a confidence of the automatic speech recognizer 304 that it has correctly recognized those words in the recording 302….”  “15. The method of claim 1, wherein the automated speech recognizer further provides an acoustic model likelihood score for each phone within each word, a language model likelihood score for each word, and a confidence score for each word.”] [Figure 6, 608.]

Loukina uses the best recognition result and does not mention the n-best list.
Loukina does not expressly teach that the input is obtained by reading a text.
Regarding Claim 1, Kurata2014 teaches:
generating, by an automatic speech recognition (ASR) module, [Kurata2014, Figure 6, “600: Retrieve N-best List in Speech Recognition Results.”] an N-best output corresponding to the utterance, wherein the N-best output comprises N recognition results generated by the ASR module for the utterance, wherein N is a positive integer; and [Kurata2014, Figure 2, “Audio data” is input and N-Best List L is output from the “speech recognition system 200.”  “[0033] … In the embodiment of the present invention described below, the search unit 210 outputs a list of the word strings W with the top N speech recognition scores as an N-best list L.”  The “speech recognition score” teaches the “confidence score” of the Claim.]
…
wherein: 
(1) N is at least 2, [Kurata2014’s N-Best implies that N is 2 or more.  Otherwise, there would be no mention of N-best.  Figure 3 shows N=4 results for the same utterance.]
Loukina and Kurata2014 pertain to speech recognition and generation of an N-Best list which is taught in Kurata2014 is a common feature of speech recognition.  It would have been obvious to combine the N-Best generation of Kurata2014 with the system of Loukina that discusses the use of only the top result in the obtaining of the intelligibility measure in order to end with a set (N-Best List) of intelligibility values, one for each recognition result, so that other manipulations may be performed in the case that top result is not the correct result.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Neither reference mentions reading from a text.
Kurata2008 teaches:
receiving an acoustic signal encoding an utterance of a user, wherein the utterance comprises a verbalization of a sample text by the user; [Kurata2008, Figure 3.  Input speech and Input Text both being provided to the system and the input speech being a reading of the Input Text which is a training text.  See Figure 7: “select candidate character string S700” and “generation pronunciation candidate S710.”  “[0031] FIG. 3 shows the configuration of the word acquisition system 30 and an entire periphery thereof according to this embodiment. A speech and a text are inputted to the word acquisition system 30. These text and speech are of the content of a common event of a predetermined field. As for the predetermined fields, it is desirable to select one of fields expected to contain certain words that are to be registered in the dictionary for speech recognition or the like. For example, a text and a speech in a chemical field are used in a case where words in the chemical field are wished to be registered. Hereinafter, a speech and a text which have been inputted will be referred to as an input speech and an input text.”]
Loukina/Kurata2014 and Kurata2008 pertain to speech recognition and generating accurate pronunciations and it would have been obvious to combine the reading of a predetermined training text from Kurata2008 with the system of combination in order to have a measure of accuracy of the outcome by comparing the recognition result to the originally intended text.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 2, Loudina does not discuss the N-Best list.
Kurata2014 teaches:
2. The method of claim 1, wherein generating the intelligibility score comprises: 
calculating N conditional intelligibility values corresponding, respectively, to the N recognition results, wherein each of the conditional intelligibility values represents a conditional intelligibility of the utterance assuming that the recognition result corresponding to the respective conditional intelligibility value is the correct transcription of the utterance; [Kurata2014, Figures 2, 3 and 4, showing the N-Best list L of recognition candidates.  Figures 3 and 4 show the input of the N-best list L and output of scores for each member of the list.  Each output is conditioned on one of the members of the list L and is a “conditional value.”  “[0042] FIG. 4 is a function block diagram of the reading accuracy-improving system 400 according to an embodiment of the present invention. This reading accuracy-improving system 400 includes a reading conversion unit 402, a reading score calculating unit 404, and an outputted word string selecting unit 406.”  The “reading/pronunciation accuracy” and “reading/pronunciation score” of Kurata2014 teach the “intelligibility” and “intelligibility score” of the Claim.  The specification refers to “reading score … 404” and the figures use “pronunciation score ... 404.”] [Figure 6, 600:  “[0056] …The process begins in Step 600, and the reading conversion unit 402 retrieves an N-best list L as the speech recognition results.”]
determining, based on the conditional intelligibility values, which of the recognition results is most intelligible; and [Kurata2014, Figure 4, the table lists two “Pronunciation Score: Pv” values each for a different set of “word strings.”  “[0043] The reading conversion unit 402 receives an N-best list L as speech recognition results from the speech recognition system 200, and determines the reading for each listed candidate word string W….  phoneme strings are used as readings ….  phoneme string is used to handle the reading, the reading conversion unit 402 can determine the reading from a dictionary using the word notation in the speech recognition results as the key. When each candidate word string W in an N-best list L is listed with both a reading and a speech recognition score, the reading conversion unit 402 can extract the reading of each candidate word string W from the N-best list L.”] [Figure 6, 602, 604, “[0056] … Next, the reading conversion unit 402 determines the reading for each candidate word string listed on the N-best list L (Step 602).”  “[0057] Then, the reading score calculating unit 404 receives the reading information for each candidate word string from the reading conversion unit 402, and calculates the reading score for each of one or more candidate word strings with the same reading on the basis of Equation (2) (Step 604)….”] [Kurata2014 additionally teaches a “Mixed Score: PM” which can teach the “intelligibility score” of the Claim as well.  Equation 3 in [0051].]
calculating the intelligibility score based, at least in part, on the conditional intelligibility value of the most intelligible recognition result. [Kurata2014, Figure 4, one of the “Pronunciation Scores: PY” is higher which indicates the most readable/intelligible results from the N-Best list L.  See equation 2 in [0044].] [Figure 6, 606: “[0057] …Next, the candidate word string selecting unit 406 receives the reading score for each candidate word string from the reading score calculating unit 404, and the final candidate word string is selected from among the candidate word strings listed in the N-best list L based on the speech recognition score and reading score (Step 606)….”] [When the mixed score PM is used the candidate word string with the highest mixed score PM is output.  [0053].]
Rationale as provided for Claim 1 because this Claim expands on the feature that was mapped to the secondary reference and the details of the feature come from the same reference under the same rationale.

Regarding Claim 6, Loukina teaches:
6. The method of claim 2, wherein calculating the intelligibility score comprises: 
if N is 1, setting the intelligibility score to the conditional intelligibility value of the most intelligible recognition result; [Loukina generates one result which corresponds to N=1.  It just takes the top result.]
otherwise, if the most intelligible recognition result is the 1-best recognition result, setting the intelligibility score to the conditional intelligibility value of the most intelligible recognition result; [Loukina does not teach having an N-best recognition list all evaluated for intelligibility but selecting the highest score is a design choice.  (Also taught by Kurata2014.)
otherwise, setting the intelligibility score to a combined value, 
wherein the combined value is based on the conditional intelligibility value of the most intelligible recognition result and the conditional intelligibility value of the 1-best recognition result. 
Loukina does not teach an N-Best list of results.
Kurata2104 teaches or suggests:
if N is 1, setting the intelligibility score to the conditional intelligibility value of the most intelligible recognition result; [Kourata2014 generates results for each of the N-Best list L.  If N=1 then there will be only a single result.]
otherwise, if the most intelligible recognition result is the 1-best recognition result, setting the intelligibility score to the conditional intelligibility value of the most intelligible recognition result; [Kurata2014 teaches having an N-best recognition list all evaluated for intelligibility and selects the member of the list with either the highest pronunciation score PY or the highest mixed score PM.]
otherwise, setting the intelligibility score to a combined value, [Kurata2014 teaches a mixed/combined intelligibility score: “[0051] In the first selection method, the outputted word string selecting unit 406 may weight and add together the speech recognition score PT (W, X) and corresponding reading score PY (Yomi(W), X) for each candidate word string W using weight a to obtain a new score (referred to below as the "mixed score") PM(W, X), and select the candidate word string W with the highest mixed score PM(W, X). ….”]
wherein the combined value is based on the conditional intelligibility value of the most intelligible recognition result and the conditional intelligibility value of the 1-best recognition result. [Kurata2014 suggests this option by teaching that its readability/pronunciation/intelligibility scores may be combinations of values for “two or more candidate word strings.”   “[0045] Instead of using this reading score PY (Y, X) calculating method, the reading score calculating unit 404 may calculate the reading score PY (Y, X) in Equation 2 using the same reading for two or more candidate word strings with partial tolerable differences in reading….”  The difference between the two candidate word strings is small and thus it is likely that if the “most intelligible” does not coincide with the first best ASR result, it is close to it and the reading/pronunciation/intelligibility score will also be close.]
Rationale for combination as provided for Claim 1.  N-Best list was from Kurata2014 and the particulars come from the same reference under the same rationale.

Regarding Claim 8, Loukina teaches:
8. The method of claim 6, 
wherein the confidence score indicates the probability that the most intelligible recognition result is the correct transcription of the utterance, and [Loukina, Figure 2, this is the definition of “confidence score” and “[0017] …The automatic speech recognizer 206 further provides a confidence score for the word (e.g., a score indicating how confident the recognizer 206 is that "bowl" is the word spoken in the recording 204)….”]
wherein calculating the intelligibility score further comprises adjusting the intelligibility score based on the confidence score and/or the pronunciation score. [Loukina, equation 1 at [0026]:  AMbowl is the “word acoustic score 322 for the word bowl,” CSbowl is “the automatic speech recognizer 304 confidence score outputted as a work likelihood score 310,” and equation 1 sets ISbowl which is the “intelligibility score 328 for the word “bowl”,” as equal to a function of AMbowl and CSbowl.  Figure 6, 610, 612.]

Regarding Claim 9, Loukina teaches:
9. The method of claim 8, 
wherein adjusting the intelligibility score comprises changing the intelligibility score by a first penalty value, [Loukina, equation 1 at [0026]

    PNG
    media_image1.png
    181
    438
    media_image1.png
    Greyscale


    PNG
    media_image2.png
    36
    431
    media_image2.png
    Greyscale


    PNG
    media_image3.png
    238
    436
    media_image3.png
    Greyscale

wherein the first penalty value is determined based on the larger of the confidence score and the pronunciation score. [Loukina, equation 1 at [0026], the two factors are the confidence score CS and the acoustic score AM and either of them which is larger can be considered the “first penalty value” of the Claim.  This limitation is merely presenting a definition.]

Regarding Claim 12, Loukina teaches:
12. The method of claim 1, further comprising: 
generating, by the automatic speech recognition (ASR) module, respective confidence scores for each of the N recognition results, wherein each respective confidence score indicates a probability that the corresponding recognition result is a correct transcription of the utterance. [Loukia generates a confidence score for the recognition result: CS.  “[0017] … The automatic speech recognizer 206 further provides a confidence score for the word (e.g., a score indicating how confident the recognizer 206 is that "bowl" is the word spoken in the recording 204). …”  See also [0026] provided above:  “CSbowl is the automatic speech recognizer 304 confidence score outputted as a word likelihood score 310.”]
Loukina does not teach generation of an N-Best list.
Kurata2014 teaches:
generating, by the automatic speech recognition (ASR) module, respective confidence scores for each of the N recognition results, wherein each respective confidence score indicates a probability that the corresponding recognition result is a correct transcription of the utterance. [Kurata2014, Figures 3 and 4, each of the members of the N-Best list L has its own confidence score associated.  “[0033] The search unit 210 integrates these models to find a word string hypothesis with the greatest observed likelihood for the feature value X of a time series outputted by the feature extraction unit 202. More specifically, the search unit 210 determines the final word string W to be outputted on the basis of a final score (referred to below as the "speech recognition score") obtained from a comprehensive evaluation of the acoustic probability provided by the acoustic model and the linguistic probability provided by the linguistic model. In the embodiment of the present invention described below, the search unit 210 outputs a list of the word strings W with the top N speech recognition scores as an N-best list L.”]
Rationale for combination as provided for Claim 1.

Regarding Claim 14, Loukina teaches:
14. The method of claim 1, further comprising: 
generating, by a pronunciation assessment module, the pronunciation score for the utterance. [Loukina, Figure 3, “[0020] … In an example, the acoustic likelihood scores 308 are computed as average raw values of acoustic model likelihood scores for each phone or more complex measures such as goodness of pronunciation scores based on acoustic model likelihood scores and prior probabilities for each phone….”]

Claim 16 is a system claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.  Additionally:
16. A system comprising: 
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: [Loukina, Figures &A, 7B, and 7C, “[0034] FIG. 7C shows a block diagram of exemplary hardware for a standalone computer architecture 750, such as the architecture depicted in FIG. 7A that may be used to include and/or implement the program instructions of system embodiments of the present disclosure. A bus 752 may serve as the information highway interconnecting the other illustrated components of the hardware. A processing system 754 labeled CPU (central processing unit) (e.g., one or more computer processors at a given computer or at multiple computers), may perform calculations and logic operations required to execute a program. A non-transitory processor-readable storage medium, such as read only memory (ROM) 758 and random access memory (RAM) 759, may be in communication with the processing system 754 and may include one or more programming instructions for performing the method of generating an intelligibility score for speech. Optionally, program instructions may be stored on a non-transitory computer-readable storage medium such as a magnetic disk, optical disk, recordable memory device, flash memory, or other physical storage medium.”]
…

Claim 17 is a system claim with limitations corresponding to the limitations of Claim 2 and is rejected under similar rationale.

Regarding Claim 18, Loukina teaches:
18. The system of claim 17, wherein calculating the intelligibility score comprises: initializing the intelligibility score based, at least in part, on the conditional intelligibility value of the most intelligible recognition result. [Loukina calculates an intelligibility score based on the single recognition score and this Claim does not define “initializing” and leaves it to Claim 19.  Claim 18 includes a single limitation of “initializing the intelligibility score” where this limitation is defined in Claim 19, which has limitations similar to the limitations of Claim 6.]

Claim 19 is a system claim with limitations corresponding to the limitations of Claim 6 and is rejected under similar rationale.
19. The system of claim 18, wherein initializing the intelligibility score comprises: 
…

Claim 20 is a system claim with limitations corresponding to the limitations of Claim 8 and is rejected under similar rationale.

Claim 22 is a system claim with limitations corresponding to the limitations of Claim 1 and is rejected under similar rationale.  Additionally:
22. A computer storage medium having instructions stored thereon that, when executed by a data processing apparatus, cause the data processing apparatus to perform operations comprising: [Loukina, Figure 7C, “[0034] FIG. 7C shows a block diagram of exemplary hardware for a standalone computer architecture 750, such as the architecture depicted in FIG. 7A that may be used to include and/or implement the program instructions of system embodiments of the present disclosure. A bus 752 may serve as the information highway interconnecting the other illustrated components of the hardware. A processing system 754 labeled CPU (central processing unit) (e.g., one or more computer processors at a given computer or at multiple computers), may perform calculations and logic operations required to execute a program. A non-transitory processor-readable storage medium, such as read only memory (ROM) 758 and random access memory (RAM) 759, may be in communication with the processing system 754 and may include one or more programming instructions for performing the method of generating an intelligibility score for speech. Optionally, program instructions may be stored on a non-transitory computer-readable storage medium such as a magnetic disk, optical disk, recordable memory device, flash memory, or other physical storage medium.”]
…
Claims 3-5 are rejected under 35 U.S.C. 103 as being unpatentable over Loukina, Kurata2014, and Kurata2008 and further in view of Stoimenov (U.S. 2005/0114131).
Regarding Claim 3, Loukina, Kurata2014, and Kurata2008 do not mention normalizing or alignment.
Stoimenov teaches:
3. The method of claim 2, wherein calculating the N conditional intelligibility values comprises: 
normalizing the sample text and the N recognition results; and [Stoimenov, Text Normalization.  Figure 1, generation of “Sounds Like” text and Figure 4A, input of “50m” converted to “fifty meters” as the normalized text. “[0021] As the user enters the alphanumeric input in the voice-tag field 52, the parser operates automatically on the alphanumeric input and returns normalized text in a parsed text field 62. A "sounds-like" field 64 is initially automatically filled in with text identical to the alphanumeric input entered in the voice-tag field 52. The user may view the normalized text to determine if the parser correctly parsed the alphanumeric input and select a desired entry from the parsed text field 62. In other words, the user may wish that the voice tag "50 m" be associated with the spoken input "fifty meters." Therefore, the user selects "fifty meters" from the parsed text field 62. ...”] (Note the definition of the “normalization” from the instant Application:  “[0052] … Any suitable normalization techniques may be used, including but not limited to converting numeric text (representing numbers) into alphabetic text (representing the same numbers), converting text representing dates and/or times to a uniform format, converting text representing currency amounts to a uniform format, etc.”)
for each of the N normalized recognition results: 
aligning the respective normalized recognition result to the normalized sample text on a word-for-word basis, [Stoimenov mentions <AlignType> in the programming appendix A which presents the program for measuring the distance between two transcriptions.  “[0029] … Source code for an exemplary measure distance method is provided at Appendix A….”  Also implies word by word alignment in the following paragraph:  “[0028] One known problem in speech recognition is confusable speech entries. In the context of voice-tagging, confusable speech entries are phrases in the lexicon that are very close in pronunciation. In one scenario, one or more isolated words such as "car" and "card" may have confusingly similar pronunciations. Similarly, certain combinations of words may have confusingly similar pronunciations. Another problem of speech recognition is unbalanced phrase lengths. Unbalanced phrase lengths can occur when there are some phrases in the lexicon that are very short and some phrases that are very long….”  “[0025] …For example, as shown in FIG. 4B, the word "individual" may have more than one possible pronunciation….”  ]
calculating an error rate of the aligned recognition result based on a number of errors in the aligned recognition result relative to the normalized sample text, and [Stoimenov measures the error based on “edit distance.”  “[0029] In order to compensate for confusingly similar entries, the present invention may incorporate technology to measure the similarity of two or more transcriptions. For example, a measure distance may be generated that indicates the similarity of two or more transcriptions. A measure distance of zero indicates that two confusingly similar entries are identical. In other words, measure distance increases as similarity decreases….”]
calculating the respective conditional intelligibility value based on the error rate of the aligned recognition result. [Stoimenov, Figure 5, “Problem? 124” occurs when the pronunciation is ambiguous (intelligibility score is low) and the method that Stoimenov uses for determining ambiguity of pronunciation is edit distance.  ‘[0033] An exemplary disambiguating process 120 for a voice-tag editor is shown in FIG. 5. The user selects a voice-tag at step 122. The voice-tag editor determines whether the selected voice-tag is problematic at step 124. For example, the voice-tag editor may determine if the selected voice-tag is confusingly similar with another voice-tag, has an unbalanced phrase length, or is hard-to-pronounce as described above. If the selected voice-tag is not problematic, the user may proceed to add the selected voice-tag to the lexicon at step 126. If the selected voice-tag is problematic, the voice-tag editor proceeds to step 128. At step 128, the voice-tag editor notifies the user of the problem with the selected voice-tag. For example, the disambiguate button 112 of FIG. 4C may be initially unavailable to the user.”]

Loukina/Kurata2014/Kurata2008 and Stoimenov pertain to the field of evaluation of pronunciation and intelligibility and it would have been obvious to combine the evaluation method of Stoimenov with the system of combination as one known method of evaluating the accuracy of speech recognition.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 4, Loukina, Kurata2014, and Kurata2008 do not mention alignment or edit distance.
Stoimenov teaches: 
4. The method of claim 3, wherein aligning the respective normalized recognition result to the normalized sample text on a word-for-word basis comprises: calculating a distance between the normalized sample text and the respective normalized recognition result. [Stoimenov, aligns the transcriptions in order to find the edit distance between the two.  “[0029] In order to compensate for confusingly similar entries, the present invention may incorporate technology to measure the similarity of two or more transcriptions. For example, a measure distance may be generated that indicates the similarity of two or more transcriptions. A measure distance of zero indicates that two confusingly similar entries are identical. In other words, measure distance increases as similarity decreases. The measure distance may be calculated using a variety of suitable methods. Source code for an exemplary measure distance method is provided at Appendix A….”]
Rationale as provided for Claim 3.  The Claim expands upon a feature mapped to Stoimenov in the preceding Claim.

Regarding Claim 5, Loukina, Kurata2014, and Kurata2008 do not mention edit distance.
Stoimenov teaches:
5. The method of claim 4, wherein the distance is an edit distance. [Stoimenov, “[0029] …One method measures the number of edits that would be necessary to make a first transcription identical to a second transcription. "Edits" refers to insert, delete, and replace operations. Each particular edit may have a corresponding penalty….”]
Rationale as provided for Claim 3.  The Claim expands upon a feature mapped to Stoimenov in the preceding Claim.

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Loukina, Kurata2014, and Kurata2008 further in view of Kobal (U.S. 2012/0166193).

Regarding Claim 7, Loukina, Kurata2014, and Kurata2008 do not teach combining N-Best values.  Kurata2104 teaches in Figure 4, Mixed Score PM, shown by Equation 3 in [0051], which is a weighted sum of the speech recognition score and readability/pronunciation score for a candidate word string.  It does not teach combining the scores of the N-best results or combining the top intelligibility /pronunciation/readability score with the intelligibility/pronunciation/readability score of the top speech recognition result (the 1-best out of the N-best).
Kobal teaches:
7. The method of claim 6, 
wherein the combined value is a weighted sum of the conditional intelligibility value of the most intelligible recognition result and the conditional intelligibility value of the 1-best recognition result. [Kobal, Figure 1 showing generation of N-Best results.  Figure 3, 302, “accuracy rating” as a weighted sum of the speech recognition confidence values of the N-Best results.  “[0012] … The method can also include determining an accuracy rating for determining a transcription priority. The accuracy rating, more particularly, can provide a weighting of a confidence score by confidence measures of closest matching neighbor recognition results….”  “[0034] At step 206, a transcription of the information can be prioritized based on a category. Notably, the prioritizing identifies spoken utterances having a transcription priority based on recognition results. For example, referring to FIG. 1, the prioritizing is based on a category such as an accuracy rating, wherein the accuracy rating is a weighting of the confidence score by the N-best matches….”  See also [0036]-[0037].]
Loukina/Kurata2014/Kurata2008 and Kobal pertain to or include speech recognition and it would have been obvious to combine the method of Kobal which combines its speech recognition confidence values as a weighted sum in order to hedge its bets against other factors that may affect the confidence value.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claims 10-11, 13, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Loukina, Kurata2014, and Kurata2008 further in view of Strope (U.S. 2010/0004930).
Regarding Claim 10, as provided in the rejection of Claim 9, the intelligibility score of Loukina is a product of several factors (equation 1 at [0026]) each of which could be mapped to a “penalty value” and because there are more than one multiplier factors, both first and second penalty values are taught by Loukina.  Loukina, however, does not teach conditioning the multiplication on the pronunciation score being less than a threshold.  Neither do Kurata2014 or Kurata2008.
Strope teaches:
10. The method of claim 9, 
wherein adjusting the intelligibility score further comprises changing the intelligibility score by a second penalty value if the pronunciation score is less than a threshold value, [Strope combines recognition results from different SRSs (speech recognition systems) which may be parallel processes within the same speech recognizer such that while the plurality of results are not an N-Best list from the same recognition process, they resemble an N-Best list from a same recognizer.  Figure 9 of Strope shows a graph of confidence scores vs. a weight associated with the confidence score and indicates the results of the recognizers that have a high error rate (Low Confidence) is further discounted by a negative/discounting weight.  A vertical line on the graph indicates a “Threshold” between values that are discounted and those that are boosted:    “[0126] The graph 900 illustrates an example function, or algorithm, for assigning weights to particular SRS. The y-axis of the graph 900 indicates the error rates associated with SRS's, and the x-axis indicates the weight associated with the SRS's. In this example, a discounting weight (e.g., 0.9, 0.95, 0.8) is used to weight SRS's (e.g., SRSA, SRSE, SRSC) that have an error rate above a determined threshold. A boost weight (e.g., 1.01, 1.04, 1.1) is used to weight SRS's (e.g., SRSB) that have an error rate below the threshold. In this example, a neutral weight (e.g., 1) is used to weight SRS's that fall on the error threshold (e.g., SRSD).”]
wherein the second penalty value is determined based on the pronunciation score. [Strope, Figure 9 shows that the “Negative Weight (Discount)” changes as a function of the confidence score.] [See also: for dynamic varying of the weight:  “[0127] In some implementations, the error rate associated with each SRS may be updated based on confirmation that the recognition result is incorrect (e.g., the result is selected as the final recognition result and is rejected by a user, a first result is selected as the final recognition result and is determined to be correct based on a user's acceptance so the unselected results are recorded as erroneous results, etc.). The selection module 113 can dynamically vary the weight based on the updated error rate associated with each SRS.”]
Loukina/Kurata2014/Kurata2008 and Strope pertain to or include speech recognition and it would have been obvious to combine the method of Strope which is directed to various schemes for combining confidence scores from a plurality of speech recognition processes with the system of combination which includes a plurality of confidence scores from an N-Best list.  The situations while not identical are quite similar and the same considerations apply.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 11, this Claim has scope similar to Claim 10 except that Claim 11 depends from Claim 8 as opposed to Claim 10 which depends from Claim 9.
11. The method of claim 8, 
wherein adjusting the intelligibility score comprises changing the intelligibility score by a penalty value if the pronunciation score is less than a threshold value, 
wherein the penalty value is determined based on the pronunciation score. 

Regarding Claim 13, Loukina does not teach an N-Best list and Kurata2014 and Kurata208 do not teach raising the confidence level of a recognition result to a threshold level if the recognition result falls in the N-Best list. 
Strope teaches and therefore suggests:
13. The method of claim 12, further comprising: 
for each of the confidence scores, determining whether the respective confidence score is less than a threshold value, and if so, setting the respective confidence score to the threshold value. [Strope, Figure 5B generates a running average of confidence values for each of the results.  This means that it is not considering a threshold below which the result is discarded.  Thus, the embodiment of Figure 5B is similar to the situation of the Claim where even if a result has a below threshold confidence associated, it would not be discarded.  Another way of looking at this Claim is that it removes the threshold requirement and sets the threshold to whatever the confidence happens to be.  Either way  no one is out and all of the N-Best get to play no matter how bad they are.]
Rationale similar to that provided for Claim 10.  Strope is directed to various methods of combining the confidence scores of a plurality of results of a plurality of speech recognizers as well as methods of keeping some confidence values in and discarding others.  Claims 10-11 and 13 are directed to various methods of keeping some confidence values in and others out as well as combining the confidence values in a meaningful way.

Claim 21 is a system claim with limitations corresponding to the limitations of Claims 9 and 10 and is rejected under similar rationale.

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Loukina and Kurata2014 and Kurata2008 further in view of Evermann (U.S. 2014/0365209).
Regarding Claim 15, Loukina does not teach generation of an N-Best list and Kurata2014 which was cited for the teaching of this feature does not teach N=5.
Evermann teaches:
15. The method of claim 1, wherein N is greater than 1 and less than or equal to 5. [Evermann, Figure 4, “[0157] … In some implementations, the plurality of candidate text strings correspond to the 5 best, 10 best, or any other appropriate number of text strings.”]
Loukina/ Kurata2014/Kurata2008 and Evermann pertain to or include speech recognition and generation of an N-Best list and it would have been obvious to combine the express teaching of N=5 from Evermann which indicates that 5 is a number being used to generate this list for completeness.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Evermann (U.S. 2014/0365209): N-Best
generating, by an automatic speech recognition (ASR) module, [Evermann, Figure 4, “speech input 400” is converted to text by the “speech to text processing.”  “[0122] FIG. 4 illustrates a speech input 400, corresponding to the speech input "show times for Argo," undergoing speech-to-text (STT) processing. (In some implementations, the STT processing is performed by the STT processing module 330.) The results 404 of the STT processing include a plurality of candidate text strings 406 each associated with a speech recognition confidence score 408. The speech recognition score represents a confidence that the candidate text string is a correct transcription of the speech input….”] an N-best output corresponding to the utterance, wherein the N-best output comprises N recognition results generated by the ASR module for the utterance, wherein N is a positive integer; and [Evermann, Figure 4, an n-best list of “candidate text strings 406” is generated each having its corresponding “recognition score.”  “[0157] In some implementations, the plurality of candidate text strings correspond to the n-best text strings generated by the speech recognition process, as determined by the speech recognition confidence scores of the text strings generated by the speech recognition process. ….”]
…
wherein: 
(1) N is at least 2, [Evermann, Figure 4, “[0157] … In some implementations, the plurality of candidate text strings correspond to the 5 best, 10 best, or any other appropriate number of text strings.”]

Peng (U.S. 2015/0161985).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499. The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659