DETAILED ACTION
The present application is being examined under the pre-AIA  first to invent provisions. 

Specification
The disclosure is objected to because of the following informalities:
In ¶[0001], Application Serial No. 16/158,900 should be updated as “now U.S. Patent No. 10,748,527 issued 18 August 2020”.  
Appropriate correction is required.

Information Disclosure Statement
The Information Disclosure Statement filed on 07 August 2020 is being edited for citations of Non-Patent Literature NPL 12, NPL 17, and NPL 19.  Here, publication dates and number of pages are being added to the citations.  Generally, any citation of non-patent literature must properly include at least a year of publication in order for it to be considered.

Claim Rejections - 35 USC § 103
The following is a quotation of pre-AIA  35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject 

Claims 1 to 3, 10 to 14, and 19 to 20 are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Aronowitz (U.S. Patent Publication 2005/0010412) in view of Bickley et al. (U.S. Patent Publication 2003/0069729).
Concerning independent claims 1, 12, and 20, Aronowitz discloses a method and system for speech recognition comprising:
“generating an empirically derived acoustic confusability measure by processing example utterances and iterating from an initial estimate of the acoustic confusability measure to improve the measure” – phoneme confusion matrix training mechanism 800 includes a confusion matrix initializer 810 to initialize diagonal elements of the confusion matrix to a positive value close to but less than one and other elements with a small positive value; phoneme lattice constructor 830 may accept speech signals from a training database 820 (“example utterances”) and construct phoneme lattices for those speech signals; training database 820 may comprise speech signals (“example utterances”) and actual phoneme sequences for those signals; phoneme lattice search mechanism 840 may search a phoneme lattice to produce a phoneme sequence hypothesis for a speech signal in a forced alignment manner; confusion matrix updater 860 may comprise a confusion probability estimator to estimate confusion probabilities between phonemes (“an empirically derived acoustic confusability measure”) based on statistics obtained from force-aligned comparisons between actual and hypothetical phoneme sequences; these estimated confusion probabilities replace initial elements of the confusion matrix so that the confusion matrix may be updated (“iterating from an initial estimate of the acoustic confusability measure to improve the measure”); a 
Concerning independent claims 1, 12, and 20, Aronowitz discloses generating a confusion matrix comprising confusion probabilities (“confusability measure”) from a training database of speech signals.  These speech signals in a training database are “example utterances”, i.e., speech is an utterance.  A phoneme confusion matrix training mechanism 800 iterates from an initial estimate of a confusion matrix obtained from confusion matrix initializer 810 using confusion matrix updater 860 so as to meet a limitation of “iterating from an initial estimate of the acoustic confusability measure to improve the measure”.  Broadly, speech signals in a training database provide that a confusion matrix is “empirically derived” because training data is ‘empirical’.  However, Aronowitz omits a limitation of “using the acoustic confusability measure to selectively limit phrases to make recognizable by a speech recognition algorithm.”  Generally, Aronowitz subsequently uses this phoneme confusion matrix to recognize speech, but does not disclose an application to using a phoneme confusion matrix to recognize ‘phrases’, or to “selectively limit phrases to make recognizable”.
Concerning independent claims 1, 12, and 20, Bickley et al. teaches this limitation of “using the acoustic confusability measure to selectively limit phrases to make recognizable by a speech recognition algorithm.”  Specifically, Bickley et al. teaches that spoken phrases are deemed confusable if they sound alike.  Typical voice application software uses information on when the speech recognizer will confuse spoken phrases, i.e., also referred to as acoustic confusability by a speech recognizer.  A capability of predicting acoustic confusability can be used by a voice application to alert a user to choose a different name when voice enrolling names in an address book, thereby reducing a risk of inefficient or inaccurate voice command processing by a voice application.  (¶[0007])  Avoiding adding phrases to a list that may be confusable with each other reduces the possibility that a speech recognizer will erroneously recognize a given spoken phrase, i.e., one that appears on a list, as a different spoken phrase on the list, a mistake known as a substitution error.  (¶[0009])  A voice user interface, i.e., a call flow is developed using text phrases representing voice commands.  Acoustic confusability predictability information can be used when developing voice applications to avoid using a voice command in a call flow that may be confusable with other voice commands, for efficient, accurate, and reliable call flow processing by a voice application speech recognizer.  (¶[0014])  Bickley et al., then, is directed to determining acoustic confusability of ‘phrases’ to “selective limit phrases to make recognizable by a speech recognition application” because a voice application alerts a user to choose a Aronowitz to selectively limit phrases to make recognizable by a speech recognition application as taught by Bickley et al. for a purpose of providing efficient, accurate, and reliable voice commands to avoid errors.

Concerning claims 2 and 13, Bickley et al. teaches an embodiment of where text representations of utterances can be directly used for predicting confusability.  (Abstract)  Predicting when a speech recognizer will confuse spoken phrases can include any combination of text form (text phrase/spelled form) and an audio file (audio data/phrase).  A ‘text form’ refers to a textual form of an utterance (text phrases) to be recognized by a speech recognizer.  A pair of spoken phrase representations is received, where any combination of text form of a spoken phrase and an audio file of a spoken phrase to be recognized by a speech recognizer is used for representing a pair of spoken phrase representations.  (¶[0019])  The invention can predict when a speech recognizer will confuse spoken phrases by directly using at least a text form of a spoken phrase.  (¶[0020])  The invention uses a confusability threshold value to determine when a speech recognizer will confuse two spoken text phrases.  (¶[0030])  A confusability of two text phrases is defined as a cost of transforming a string of phonemes corresponding to a first text phrase into another string of phonemes corresponding to a second text phrase.  (¶[0046])  Bickley et al., then, can determine a e.g., English.
Concerning claims 3 and 14, Aronowitz discloses that a phoneme lattice is constructed using speech recognition.  (¶[0018])  A phoneme lattice based speech processing system 100 may comprise a phoneme lattice constructor 110.  A system may transform an input speech signal into text by speech recognition.  (¶[0020]: Figure 1)  A phoneme lattice may be searched to produce at least one candidate textual representation of the input speech signal by speech recognition.  (¶[0021]: Figure 2)  Aronowitz, then, provides for “using the speech recognition application” to derive “the acoustic confusability measure”.
Concerning claims 10 to 11 and 19, Aronowitz discloses that a phoneme confusion matrix may comprise elements representing probabilities of one phoneme being confused with another.  (¶[0023]: Figure 3(b))  A phoneme confusion matrix may comprise a plurality of elements, which represent probabilities of each phoneme being confused with another.  (¶[0031]: Figure 7)  Confusion matrix updater 860 may comprise a confusion matrix estimator to estimate confusion probabilities between phonemes based on statistics obtained by force-aligned comparisons.  (¶[0035]: Figure 8)  Broadly, a confusion matrix represents a set (“a family”) of probabilities that a phoneme, ph1, is confused with a phoneme, ph2.  A confusion matrix, then, represents “a family of probability models” that ‘models’ confusability between ‘a family’ of phonemes in a given language.  Applicants’ “model π = {p(d|t)}” for phonemes d and t, is simply an abstract mathematical way of expressing this confusion matrix, where d t are any two phonemes, e.g., ph1 and ph2, p(d|t) is a probability of confusion of a specific phoneme d with a specific phoneme t, and {p(d|t)} represents a set of all of these values of confusability for all of the phonemes d and t in a given language.

Claims 4 to 8 and 15 to 18 are rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Aronowitz (U.S. Patent Publication 2005/0010412) in view of Bickley et al. (U.S. Patent Publication 2003/0069729) as applied to claims 1 and 12 above, and further in view of He et al. (U.S. Patent Publication 2006/0287856).
Concerning claims 4 and 15, Aronowitz arguably discloses the limitations of these claims directed to “generating a recognized corpus comprising, for each recognized utterance, at least one decoded phoneme sequence and at least one true phoneme sequence” and “generating the confusability measure from the recognized corpus.”  Here, Applicants’ claim language defines “a recognized corpus” as “a decoded phoneme sequence” and “a true phoneme sequence”.  Specifically, Aronowitz discloses that a phoneme confusion matrix may be trained from a database that comprises both correct phoneme sequences and their corresponding phoneme sequences outputted from a phoneme lattice search mechanism.  (¶[0023]: Figure 3(b))  Phoneme lattice constructor 830 may accept speech signals from training database 810 and construct phoneme lattices for these speech signals.  The training database 810 may comprise speech signals and actual phoneme sequences for these signals.  (¶[0033]: Figure 8)  Phoneme confusion updater 860 may comprise a confusion probability estimator to estimate confusion probabilities between phonemes based on statistics obtained from force-aligned comparisons between actual and hypothetical phonemes sequences.  Aronowitz, then, discloses a training database 810 that provides a “corpus” of utterances and actual or correct (“true”) phoneme sequences, and then confusion matrix training mechanism calculates a confusion matrix by comparing hypothetical phoneme sequences obtained as phoneme lattices from a phoneme lattice constructor 830 and these actual or correct (“true”) phoneme lattices.  The only element of these claims that is not expressly disclosed by Aronowitz is that a hypothetical phoneme sequence obtained as a phoneme lattice is a “decoded” phoneme sequence.  Still, it is reasonable to conclude that a hypothetical phoneme sequence is a “decoded” phoneme sequence because a phoneme lattice is constructed from speech signals.  That is, a phoneme lattice can be considered a “decoded” version of the speech signals obtained by techniques of speech recognition.
Concerning claims 4 and 15, even if this element of a “decoded” phoneme sequence is omitted by Aronowitz, it is taught by He et al.  Generally, He et al. teaches training speech models by identifying a distance between an actual recognition result and a recognition result that is known to be correct.  Given a correct phone/event transcription, a distance is minimized between the actual and known, correct recognition results.  (¶[0016])  A decision boundary is located between two models 304 and 306 for phones ‘a’ and ‘o’, so that training data representing a senome is far enough away from data representing other confusable senomes.  (¶[0056])  Training data includes a set of utterances, and a transcript of an utterances is expanded to a senome sequence.  An initial training component 352 receives training data, where the training data can be a feature vector sequence representing an utterance 362 and a correct transcription 364 of those feature vectors.  (¶[0059] - ¶[0060]: Figure 4)  Specifically, a feature vector He et al., then, provides a training mechanism similar to Aronowitz, where a recognized transcription is aligned to a correct or “true” transcription to perform training, but teaches that this recognized transcription can be designated as a “decoded” transcription because it is obtained by recognizing speech using a speech recognition decoder 354.  An objective is to train models by minimizing a distance between an actual recognition result and a recognition result that is known to be correct to better align with actual training data.  (¶[0016])  It would have been obvious to one having ordinary skill in the art that a hypothetical phoneme sequence represented by a phoneme lattice of Aronowitz can be designated as a “decoded phoneme sequence” as taught by He et al.  for a purpose of minimizing a distance between actual and correct recognition results to better align with actual training data.

Aronowitz discloses “recognizing, from a corpus comprising a set of utterances with corresponding transcriptions, at least one utterance to yield a recognized utterance comprising at least one decoded frame sequence”.  An input signal is segmented into frames (“frame sequences”).  (¶[0020] - ¶[0021]: Figures 1 to 2)  A phoneme lattice constructor 830 may accept speech signals from training database 810 and construct phoneme lattices for these speech signals, where training database 810 may comprise actual phoneme sequences (“corresponding transcriptions”).  (¶[0033]: Figure 8)  Implicitly, a phoneme lattice is “a recognized utterance” and “a decoded frame sequence”.  That is, a phoneme lattice is obtained by performing speech recognition to ‘decode’ a speech signal and obtain a transcription of that speech signal.  Specifically, He et al. teaches that an utterance is segmented into short frames, a feature vector is generated for each frame, and a feature vector sequence 362 representing an utterance is input into speech recognition decoder 354 to output a recognized transcription (“recognizing, from a corpus comprising a set of utterances . . . at least one utterance comprising at least one decoded frame sequence”).  (¶[0059] and ¶[0062]: Figures 4 to 5)  Moreover, Aronowitz discloses “coalescing identical sequential phonemes” because repetition of phonemes may be allowed when searching a phoneme lattice.  (¶[0018])  A path score may be adjusted by allowing a repetition of phonemes, e.g., a phoneme sequence ‘d-d-ay-l-l’ may be interpreted as a word ‘dial’, even though a correct phoneme representing of ‘dial’ is ‘d-ay-l’.  Allowing repetition of phonemes may help solve a problem with a phoneme lattice, where a phoneme with a long duration is broken into repetition of a same phoneme but with a shorter duration.  (¶[0032]: Figure 7: Step 740)  Here, taking into 
Concerning claims 6 to 7 and 17 to 18, Aronowitz discloses that a phoneme lattice may provide multiple phoneme sequence representations for an input speech signal.  A phoneme lattice may be searched to produce at least one candidate textual representation of an input speech signal to determine how likely that input speech signal contains targeted keywords.  A plurality of models may be used to help search among multiple phoneme sequences.  An output result may be a single best textual representation or a plurality of top best textual representations of the input speech signal.  (¶[0021]: Figure 2)  Phoneme path estimator 420 may estimate a plurality of phoneme paths ending of a frame.  Phoneme path estimator 420 may comprise a likelihood score evaluator to evaluate a likelihood score for each phoneme path ending at the frame.  Global score evaluator 430 may evaluate the K phoneme path hypotheses found by phoneme path estimator 420, globally.  The global score evaluator may comprise a score computing component to compute a global score for each of the K phoneme path hypotheses.  (¶[0025] - ¶[0026]: Figure 4)  Likelihood scores for all potential phoneme paths leading to a frame may be sorted and phoneme paths corresponding to the top K likelihood scores may be selected as the K-best phoneme paths for the frame.  Global scores may be computed for K-best phonemes paths leading to a frame.  (¶[0030]: Figure 6)  Aronowitz, then, discloses “producing a plurality of decodings for each recognized utterance”, where a phoneme lattice represents a plurality of phoneme paths, or “decodings”.  These K phoneme path hypotheses are a 
Concerning claim 8, Aronowitz discloses “a corpus comprising a set of utterances and corresponding transcriptions” and “including at least one true phoneme sequence” because training database 810 may comprise speech signals and actual phoneme sequences for these speech signals.  (¶[0033]: Figure 8)  These actual phoneme sequences are “at least one true phoneme sequence” and are “corresponding transcriptions” of the speech signals (“a set of utterances”).  Similarly, He et al. teaches a true or correct transcription 364.  (¶[0063] - ¶[0065])  This true transcription represents “at least one true phoneme sequence” corresponding to an utterance recognized by speech recognition decoder 354.  Implicitly, a transcription includes “determining . . .  at least one pronunciation”.  Specifically, Bickley et al. teaches that similarities of pronunciation of words, e.g., ‘Jill’ and ‘Phil’, reflects confusion of spoken phrases, and that similarities in pronunciation can be more apparent from comparisons between phonetic transcriptions of utterances.  (¶[0008] and ¶[0010])  Here, Applicants’ claim language only requires one pronunciation, and a pronunciation of words of a transcription is taught by Bickley et al.

Claim 9 is rejected under pre-AIA  35 U.S.C. 103(a) as being unpatentable over Aronowitz (U.S. Patent Publication 2005/0010412) in view of Bickley et al. (U.S. Patent Publication 2003/0069729) and He et al. (U.S. Patent Publication 2006/0287856) as applied to claims 1, 4, and 8 above, and further in view of Schneider (U.S. Patent No. 7,974,843).
Bickley et al. teaches “at least one pronunciation”, and uses a minimum string edit distance algorithm to obtain a distance between strings of confusable phonemes.  (¶[0008] and ¶[0010]; ¶[0044] - ¶[0045]  However, Bickley et al. omits “wherein determining at least one pronunciation for each transcription comprises “any of steps of: for each word of the transcription, utilizing the most popular pronunciation; for each word of the transcription, utilizing a pronunciation selected at random; for each word of the transcription, utilizing the pronunciation that is closest by string edit distance to the at least one decoded phoneme sequence for the respective word within the said at least one decoded phoneme sequence; or for each word of the transcription, utilizing each of the plurality of pronunciations from the set of all pronunciations of the word.”  
Still, Schneider teaches automatic speech recognition for recognition of words in different languages that determines phonetic transcripts for words in N various languages in order to obtain N phonetic sequences per word corresponding to N pronunciations corresponding to N pronunciation variants.  A language recognition vocabulary is created with the N phoneme sequences per word for a language recognizer.  (Column 2, Lines 16 to 32)  Suitable distances, particularly a Levenshtein distance, are used to classify and analyze pronunciation variants of the N phoneme sequences for each word.  The N phoneme sequences are reduced to a few, preferably two or three phoneme sequences so that the phoneme sequences that are least similar as pronunciation variants are omitted.  (Column 3, Line 60 to Column 4, Line 11)  Here, Schneider teaches at least an alternative directed to “for each word of each transcription, utilizing the pronunciation that is closest by string edit distance” because only pronunciation variants corresponding to closest distances are retained for a Bickley et al.  Additionally, Schneider teaches at least an alternative directed to “for each word of each transcription, utilizing each of a plurality of pronunciations from the set of all pronunciations of the word” because all of the N pronunciation variants are at least initially utilized by Schneider.  An objective is to enable phoneme sequences to be created from different languages in multilingual systems for speech recognition.  (Column 1, Lines 64 to 67)  It would have been obvious to one having ordinary skill in the art to determine pronunciations of transcriptions as taught by Schneider to generate a confusion matrix from phoneme sequences of Aronowitz for a purpose of providing phoneme sequences for different languages in multilingual systems for speech recognition.

Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicants’ disclosure.
Printz et al. (U.S. Patent No. 10,748,527) is Applicants’ parent patent.
Holzapfel, Harengel et al., and Yao disclose related prior art.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 272-7608.  The examiner can normally be reached on Monday-Thursday 8:30 AM-6:00 PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571) 272-5551.  The fax phone 
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/MARTIN LERNER/Primary Examiner
Art Unit 2657                                                                                                                                                                                                        October 21, 2021