DETAILED ACTION
This Office Action is in response to the correspondence filed by the applicant on 4/27/2022.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Receipt is acknowledged of certified copies of papers submitted under 35 U.S.C. 119(a)-(d), which papers have been placed of record in the file.

Response to Arguments
Applicant’s arguments with respect to rejections of Claims 1-5, 8-19, ans 21-22 under 102 (a)(1) have been fully considered, but they are not persuasive.
On page 9, the Applicant asserts, “Nevertheless, as established below, the outstanding rejection(s) should be withdrawn because Printz fails to disclose, at least, the feature of "disclose ‘obtaining, from the received speech data ... a phonetic symbol sequence associated
with a pronunciation of a target word included in the received speech data, and an identifier pair
indicating a start and an end of the phonetic symbol sequence ... the identifier pair indicating a
category of the target word,’ as recited in amended independent claim 1.”  However, the Examiner respectfully disagrees.
PRINTZ clearly discloses obtaining, from the received speech data ... a phonetic symbol sequence (PRINTZ Fig. 39; Par 359 – “It should be noted that as previously discussed, and as will be known to those skilled in the art, the phonemes used in the adaptation objects, which are derived from the primary recognizer output, may be the context-independent phonemes typically associated with baseforms, the specific context-dependent phonemes decoded by the primary recognizer, if these are present in the primary recognizer output, or some admixture of the two. Accordingly, all such embodiments, whether they use context-independent or context-dependent phonemes, are also included within the scope of the invention.”) associated
with a pronunciation of a target word included in the received speech data (PRINTZ Fig. 39; Par 353 – “As before the line in FIG. 39 labeled primary recognizer output (baseforms) shows the sequence of baseforms for the whole utterance decoded by the primary recognizer; immediately beneath this the line primary recognizer output (phonemes) shows the actual phoneme sequence corresponding to each indicated baseform.”; Par 62 – “A “baseform” refers to a triple that associates: (1) a word as a lexical object (that is, a sequence of letters as a word is typically spelled); (2) an index that can be used to distinguish many baseforms for the same word from one another; and (3) a pronunciation for the word, comprising a sequence of phonemes. A given word may have several associated baseforms, distinguished by their pronunciation.”), and an identifier pair indicating a start and an end of the phonetic symbol sequence (PRINTZ Figs. 12 and 13 – “Acoustic span [1520, 1750]”; Fig. 16 – “C – detailsApnrCommand[<business name span>[2100,2600]]”; Par 89 – “A “span” is a contiguous section of the input utterance, identified by its start time and end time within the whole of the input utterance (hereafter called the “span extent”), hypothesized to comprise a proper name entity, and labeled with the putative type of this entity (hereafter called the “span type”).”; Par 146 – “This yields a transcription of the utterance (and possibly alternate transcriptions as well) in the vocabulary of the open dictation recognizer, plus nominal start and end times for each transcribed word.”; Par 148 –“The output of the primary recognition step, comprising (1) a nominal transcription, (2) the start time and end time within the waveform of each transcribed word at the granularity of a single frame and (3) possibly other information, described further below, of use in determining the extent and type of any acoustic spans, may then be passed to the understanding step.”; Par 216 – “FIG. 13 is an example of a second hypothesis breakdown based upon the third proposal 1115 c in the example of FIG. 11 as may occur in some embodiments. With regard to the third proposal 1115 c, the system may recognize that the “Guddu de Karahi” portion between 1220 ms and 1750 ms could not be recognized. Accordingly, a hypothesis 1205 having an acoustic span between 1220 ms and 1750 ms may be generated, and the prefix and suffix portions adjusted accordingly. The NLU may again identify the proper noun as a “Business Name” in the Putative Type but may instead consider the general inquiry as an “Address Book” query, limiting the search to only the address book contents.”; Par 353 – “Note the two nested structures of epsilon arcs—those labeled εls″, εls′ and εls within the left shim and εrs, εrs′ and εrs″ within the right shim. These yield the desired property of freeing the secondary recognizer to match or exclude from matching only portions of the waveform corresponding to contiguous sequences of left shim phonemes or right shim phonemes, thereby permitting the secondary recognizer to expand the now-reduced target acoustic span, at the granularity of an individual phoneme, to obtain the best possible match of the target section. Moreover, to find this best possible match, only the phonemes that appeared in the primary recognizer output, in the order in which they appeared, need be considered.”) ... the identifier pair indicating a category of the target word (PRINTZ Figs. 12 and 13 – “Putative type of span: “Business Name”; Fig. 16 – “C – detailsApnrCommand[<business name span>[2100,2600]]”; Par 89 – “A “span” is a contiguous section of the input utterance, identified by its start time and end time within the whole of the input utterance (hereafter called the “span extent”), hypothesized to comprise a proper name entity, and labeled with the putative type of this entity (hereafter called the “span type”).”; Par 92 – “A “span type” is the putative type of the proper name entity believed to be present within the span; thus a personal name, business name, numbered street address, etc.”).
For more details, see the rejections below.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees.  A nonstatutory double patenting rejection is appropriate where the claims at issue are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); and In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on a nonstatutory double patenting ground provided the reference application or patent either is shown to be commonly owned with this application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b).
The USPTO internet Web site contains terminal disclaimer forms which may be used.  Please visit http://www.uspto.gov/forms/.  The filing date of the application will determine what form should be used.  A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission.  For more information about eTerminal Disclaimers, refer to http://www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.  

Claims 1-22 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over Claims 1-20 of co-pending Application No.15/931,949. Although the claims, at issue are not identical, they are not patentably distinct from each other because the claims of the instant application are rejected as being unpatentable over the claims of the co-pending application.  Please see below for the mapping in the table, where the bolded limitations indicate the corresponding limitations between the co-pending application and instant application.  
Instant application

Co-pending application: 15/931,949
(Amended on 5/4/2022)
1. A speech recognition method comprising: 


receiving speech data; 


obtaining, from the received speech data, a candidate text including at least one word, a phonetic symbol sequence associated with a pronunciation of a target word included in the received speech data, and an identifier pair indicating a start and an end of the phonetic symbol sequence, using a speech recognition model, the identifier pair indicating a category of the target word; 

replacing the phonetic symbol sequence with a replacement word corresponding to the phonetic symbol sequence and the category; and 




determining a target text corresponding to the received speech data based on a result of the replacing.
1. A processor-implemented method comprising: 

performing, by a processor,  speech recognition of a speech signal; 

generating, by a processor, a plurality of first candidate sentences as a result of the performing of the speech recognition; identifying, by a processor, a respective named entity in each of the plurality of first candidate sentences; determining, by a processor, a standard expression corresponding to the identified respective named entity using phonemes of the corresponding named entity; 

determining, by a processor, whether to replace the identified named entity in each of the plurality of first candidate sentences with the determined standard expression based on a similarity between the named entity and the standard expression corresponding to the named entity, and 

determining a plurality of second candidate sentences based on a result; and outputting, by a processor, a final sentence selected from the plurality of second candidate sentences.

determining a standard expression corresponding to each respective named entity identified in the identifying of the named entity based on an inverted index search performed using phonemes included in the corresponding named entity, wherein the inverted index search is a term frequency-inverse document frequency (TD- IDF)- based search.



With respect to the other claims, each of the claims maps to a corresponding claims 1-20 of the co-pending application or are found within the scope of the independent claim.
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-3, 5, 8-19, and 21-22 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by PRINTZ (US 2017/0133010 A1).

REGARDING CLAIM 1, PRINTZ discloses a speech recognition method comprising: 
receiving speech data (PRINTZ Fig. 10 – “Receive utterance waveform from user 1005”; Par 204 – “FIG. 10 is a flow diagram depicting various steps in a proper name recognition process as may occur in some embodiments. At block 1005, the system may receive an utterance waveform from a user.”); 
obtaining, from the received speech data, a candidate text (PRINTZ FIG. 39 – “PRIMARY RECOGNIZER OUTPUT (baseforms): send(01) a(02) message(01) tupac(01) shakur(01) you(03) coming(01) tonight(01)”) including at least one word (PRINTZ Fig. 10 – “Apply modified open dictation ASR to acquire textual representation and ASR confidence values to generate results D”; Par 205 – “At block 1010, a “standard” open dictation ASR may be applied to the waveform. This may produce a complete textual word for every aspect of the waveform, even when the confidence levels are exceptionally low. However, some embodiments further contemplate applying a modified version of the open dictation ASR to the waveform to achieve one or more textual readings that explicitly identify words that may reflect proper names (e.g., based on the highest possible confidence level for a word still failing to exceed a threshold). These modified systems may indicate placeholder words for the potential proper names (e.g., fna, lna, and sa designations as discussed herein). Block 1010 may roughly correspond to the “Primary Recognition” step 905. Block 1020 may roughly correspond to the “Understanding” step 910.”), a phonetic symbol sequence (PRINTZ Fig. 39; Par 359 – “It should be noted that as previously discussed, and as will be known to those skilled in the art, the phonemes used in the adaptation objects, which are derived from the primary recognizer output, may be the context-independent phonemes typically associated with baseforms, the specific context-dependent phonemes decoded by the primary recognizer, if these are present in the primary recognizer output, or some admixture of the two. Accordingly, all such embodiments, whether they use context-independent or context-dependent phonemes, are also included within the scope of the invention.”) associated with a pronunciation of a target word included in the received speech data (PRINTZ Fig. 39; Par 353 – “As before the line in FIG. 39 labeled primary recognizer output (baseforms) shows the sequence of baseforms for the whole utterance decoded by the primary recognizer; immediately beneath this the line primary recognizer output (phonemes) shows the actual phoneme sequence corresponding to each indicated baseform.”; Par 62 – “A “baseform” refers to a triple that associates: (1) a word as a lexical object (that is, a sequence of letters as a word is typically spelled); (2) an index that can be used to distinguish many baseforms for the same word from one another; and (3) a pronunciation for the word, comprising a sequence of phonemes. A given word may have several associated baseforms, distinguished by their pronunciation.”), and an identifier pair indicating a start and an end of the phonetic symbol sequence  (PRINTZ Figs. 12 and 13 – “Acoustic span [1520, 1750]”; Fig. 16 – “C – detailsApnrCommand[<business name span>[2100,2600]]”; Par 89 – “A “span” is a contiguous section of the input utterance, identified by its start time and end time within the whole of the input utterance (hereafter called the “span extent”), hypothesized to comprise a proper name entity, and labeled with the putative type of this entity (hereafter called the “span type”).”; Par 146 – “This yields a transcription of the utterance (and possibly alternate transcriptions as well) in the vocabulary of the open dictation recognizer, plus nominal start and end times for each transcribed word.”; Par 148 –“The output of the primary recognition step, comprising (1) a nominal transcription, (2) the start time and end time within the waveform of each transcribed word at the granularity of a single frame and (3) possibly other information, described further below, of use in determining the extent and type of any acoustic spans, may then be passed to the understanding step.”; Par 216 – “FIG. 13 is an example of a second hypothesis breakdown based upon the third proposal 1115 c in the example of FIG. 11 as may occur in some embodiments. With regard to the third proposal 1115 c, the system may recognize that the “Guddu de Karahi” portion between 1220 ms and 1750 ms could not be recognized. Accordingly, a hypothesis 1205 having an acoustic span between 1220 ms and 1750 ms may be generated, and the prefix and suffix portions adjusted accordingly. The NLU may again identify the proper noun as a “Business Name” in the Putative Type but may instead consider the general inquiry as an “Address Book” query, limiting the search to only the address book contents.”; Par 353 – “Note the two nested structures of epsilon arcs—those labeled εls″, εls′ and εls within the left shim and εrs, εrs′ and εrs″ within the right shim. These yield the desired property of freeing the secondary recognizer to match or exclude from matching only portions of the waveform corresponding to contiguous sequences of left shim phonemes or right shim phonemes, thereby permitting the secondary recognizer to expand the now-reduced target acoustic span, at the granularity of an individual phoneme, to obtain the best possible match of the target section. Moreover, to find this best possible match, only the phonemes that appeared in the primary recognizer output, in the order in which they appeared, need be considered.”), using a speech recognition model (PRINTZ Fig. 39; Par 359 – “It should be noted that as previously discussed, and as will be known to those skilled in the art, the phonemes used in the adaptation objects, which are derived from the primary recognizer output, may be the context-independent phonemes typically associated with baseforms, the specific context-dependent phonemes decoded by the primary recognizer, if these are present in the primary recognizer output, or some admixture of the two. Accordingly, all such embodiments, whether they use context-independent or context-dependent phonemes, are also included within the scope of the invention.”), the identifier pair indicating a category of the target word (PRINTZ Figs. 12 and 13 – “Putative type of span: “Business Name”; Fig. 16 – “C – detailsApnrCommand[<business name span>[2100,2600]]”; Par 89 – “A “span” is a contiguous section of the input utterance, identified by its start time and end time within the whole of the input utterance (hereafter called the “span extent”), hypothesized to comprise a proper name entity, and labeled with the putative type of this entity (hereafter called the “span type”).”; Par 92 – “A “span type” is the putative type of the proper name entity believed to be present within the span; thus a personal name, business name, numbered street address, etc.”); 
replacing the phonetic symbol sequence (PRINTZ Fig. 39 –“target baseforms and phonemes TUWPAOK SHAAKER”) with a replacement word corresponding to the phonetic symbol sequence (PRINTZ Fig. 39 – “target section match pak shak [c_id=2]”; Par 327 – “One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.” Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence. It will be apparent to one skilled in the art that this language model score may itself be conditioned upon one or more of: the putative command type, the putative span type, the putative span decoding, the location of the phoneme sequence with respect to the target span (viz., immediately preceding or immediately following the target span), one or more adjacent decoded words, or other known or hypothesized characteristics of the utterance.”; Par 354 – “This functionality is illustrated in the FIG. 39 graphic labeled ww-3 -lsp3 -rsp3 -3-slotted-contact-name.g: (decoding path). Reading left to right, the exhibited decoding path first comprises a forced alignment of the ww prefix section against the ww prefix acoustic span; this of course is required by the grammar structure. Next, the secondary recognizer chooses the path T->UW->εla->pak shak [c_id=2]-> εrs′->ER.”) and the category (PRINITZ Par 234 – “Following “Primary Recognition” 905, the system has identified not only the start and end time of each such segment, but the likely type of the name in question—that is, a person's first name, a person's last name, a street name, and so on. A specialized grammar may be used for each such name type.”; Par 327 – “…. replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.” Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence. …It will be apparent to one skilled in the art that this language model score may itself be conditioned upon one or more of: the putative command type, the putative span type, the putative span decoding, the location of the phoneme sequence with respect to the target span (viz., immediately preceding or immediately following the target span), one or more adjacent decoded words, or other known or hypothesized characteristics of the utterance.”); and 
determining a target text corresponding to the received speech data based on a result of the replacing (PRINTZ Fig. 39 – “send(01) a(02) message(01) .. pak shak … ”; Par 316 – “This, in turn, frees the secondary recognizer to find the best acoustic match, permitted by the target section, to the now effectively reduced target acoustic span. As shown in the graphic labeled ww-3-ls2-rs1-3-slotted-contact-name.g: (decoding path) this best acoustic match is the literal sequence “pak shak.” The secondary recognizer traverses the target section via this arc, thereby yielding the final (correct) transcription “send a message to pak shak are you coming tonight SIL,” along with the semantic meaning variable assignment c_id=2.”;).

REGARDING CLAIM 2, PRINTZ discloses the speech recognition method of claim 1, wherein the at least one word comprises at least one subword (PRINTZ Fig. 39 –“target baseforms and phonemes TUWPAOK SHAAKER”; Par 354 – “This functionality is illustrated in the FIG. 39 graphic labeled ww-3 -lsp3 -rsp3 -3-slotted-contact-name.g: (decoding path). Reading left to right, the exhibited decoding path first comprises a forced alignment of the ww prefix section against the ww prefix acoustic span; this of course is required by the grammar structure. Next, the secondary recognizer chooses the path T->UW->εla->pak shak [c_id=2]-> εrs′-> ER.”; Par 359 – “But unlike previous embodiments, the left shim and right shim each now contain a linear sequence of three arcs, respectively labeled with the first three and last three phonemes output by the primary recognizer, for the putative target span. Thus, the arcs of the left shim are labeled “T” “UW” “P” and the arcs of the right shim are labeled “AA” “K” “ER”. The choice of three phonemes for each shim is arbitrary and reflects a design that is known to work well in practice. Designs with a larger or smaller number of phonemes are possible and also fall within the scope of the invention, as do designs with differing numbers of phonemes in the left and right shims.”), and the candidate text (PRINTZ Fig. 39 – “time [300 … [1145  [1272  [525 … […. 2900 …; primary recognizer output (basforms) .. send(01) …..tonight SIL; primary recognizer output (phonemes) … target baseforms and phonemes: SEHND ….. TUWPAOK SHAAKER …””) comprises the at least one subword (PRINTZ Fig. 39 – “TUWPAOK SHAAKER”), and the phonetic symbol sequence (PRINTZ Fig. 39 – “TUWPAOK SHAAKER”).

REGARDING CLAIM 3, PRINTZ discloses the speech recognition method of claim 2, wherein the replacing comprises: replacing, with the replacement word, the phonetic symbol sequence identified by the identifier pair (PRINTZ Fig. 39 – “target section match pak shak [c_id=2]”; Par 327 – “One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.”; Par 354 – “This functionality is illustrated in the FIG. 39 graphic labeled ww-3 -lsp3 -rsp3 -3-slotted-contact-name.g: (decoding path). Reading left to right, the exhibited decoding path first comprises a forced alignment of the ww prefix section against the ww prefix acoustic span; this of course is required by the grammar structure. Next, the secondary recognizer chooses the path T->UW->εla->pak shak [c_id=2]->εrs′->ER.”; Par 89 – “A “span” is a contiguous section of the input utterance, identified by its start time and end time within the whole of the input utterance (hereafter called the “span extent”), hypothesized to comprise a proper name entity, and labeled with the putative type of this entity (hereafter called the “span type”).”).


REGARDING CLAIM 5, PRINTZ discloses the speech recognition method of claim 1, further comprising: 
determining the replacement word corresponding to the phonetic symbol sequence using dictionary data (PRINTZ Fig. 39 – “target section match pak shak [c_id=2]”; Par 327 – “One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.” Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence. It will be apparent to one skilled in the art that this language model score may itself be conditioned upon one or more of: the putative command type, the putative span type, the putative span decoding, the location of the phoneme sequence with respect to the target span (viz., immediately preceding or immediately following the target span), one or more adjacent decoded words, or other known or hypothesized characteristics of the utterance.”; Par 374 – “For example the command “look up José Altuve's stats” may be misrecognized “look up José I'll to the stats.” But this misrecognition may be easily corrected by a secondary recognizer for which the vocabulary is narrowed to only Major League Baseball™ players when decoding the acoustic span corresponding to the words “José I'll to the” as emitted by the primary recognizer”)  including information associated with a plurality of words and phonetic symbol sequences respectively corresponding to the words (PRINTZ Par 97 – “A “vocabulary” is, informally, a list of the words with associated pronunciations, which forms part of the input to an ASR system, and which defines the words that could in principle be recognized by such a system. Formally, the term may refer to a list of baseforms. Also sometimes called a “lexicon.””; Par 62 – “(3) a pronunciation for the word, comprising a sequence of phonemes.”; Par 132 – “Compilation may involve (1) obtaining one or more pronunciations for each indicated word in the grammar (this may typically be done by first searching a vocabulary, but if this search fails any required pronunciations may be automatically generated by a “grapheme to phoneme” or “g2p” processing module, which applies the standard rules of English language pronunciation to the given word spelling to produce one or more plausible pronunciations),”; Par 349 – “Indeed, we will shortly consider another variant, wherein the grammar arcs are labeled with, or equivalently the slotted grammar slots are populated with, individual phonemes drawn from the secondary recognizer's phoneme alphabet. For economy of reference in the discussion we will sometimes refer to the different objects that may be used as arc labels or slot contents—which are words (specifically literals), baseforms and phonemes—as “grammar labels” or simply “labels.””).


REGARDING CLAIM 8, PRINTZ discloses the speech recognition method of claim 4, further comprising: 
determining the replacement word corresponding to the phonetic symbol sequence included in the candidate text (PRINTZ Fig. 39 – “target section match pak shak [c_id=2]”; Par 327 – “One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.”; Par 354 – “This functionality is illustrated in the FIG. 39 graphic labeled ww-3 -lsp3 -rsp3 -3-slotted-contact-name.g: (decoding path). Reading left to right, the exhibited decoding path first comprises a forced alignment of the ww prefix section against the ww prefix acoustic span; this of course is required by the grammar structure. Next, the secondary recognizer chooses the path T->UW->εla->pak shak [c_id=2]-> εrs′->ER.”), using dictionary data (PRINTZ Par 327 – “lexicon”; Fig. 10 – “Apply grammar-based ASR with user-specific proper name dictionary 1050”; Fig. 16 – “businesses retrieved by previous query … -> business-names.g”; Par 374 – “For example the command “look up José Altuve's stats” may be misrecognized “look up José I'll to the stats.” But this misrecognition may be easily corrected by a secondary recognizer for which the vocabulary is narrowed to only Major League Baseball™ players when decoding the acoustic span corresponding to the words “José I'll to the” as emitted by the primary recognizer”; Par 132 – “This may be done by preparing a grammar, illustrated in graphical form in FIG. 2, that contains exactly these names, and compiling it into a binary form so that it is ready for use by the secondary recognizer. This operation, which may typically take a few hundred milliseconds, may be performed immediately upon receiving from Yelp® the list of names to be shown on the tablet display. Compilation may involve (1) obtaining one or more pronunciations for each indicated word in the grammar (this may typically be done by first searching a vocabulary, but if this search fails any required pronunciations may be automatically generated by a “grapheme to phoneme” or “g2p” processing module, which applies the standard rules of English language pronunciation to the given word spelling to produce one or more plausible pronunciations), (2) creating a computational structure that permits words to be decoded only in the order allowed by the grammar, (3) attaching to this structure operations to be performed on indicated meaning variables when a given decoding is obtained (which may typically comprise assigning values to these variables), and (4) emitting this structure in such form that it may be immediately loaded by a suitable grammar-based ASR system and used to guide its decoding of audio input. This compiled grammar, denoted “business-names.g” in FIG. 3, may be labeled with its type (in this case effectively business-names) and held for possible future use. In some embodiments, this comprises the adaptation object generation step.”) corresponding to the category indicated by the identifier pair (PRINTZ Figs. 12 and 13 – “Acoustic span [1520, 1750]”; Fig. 16 – “C – detailsApnrCommand[<business name span>[2100,2600]]”; Par 89 – “A “span” is a contiguous section of the input utterance, identified by its start time and end time within the whole of the input utterance (hereafter called the “span extent”), hypothesized to comprise a proper name entity, and labeled with the putative type of this entity (hereafter called the “span type”).”; Par 92 – “A “span type” is the putative type of the proper name entity believed to be present within the span; thus a personal name, business name, numbered street address, etc.”; Par 374 – “For example the command “look up José Altuve's stats” may be misrecognized “look up José I'll to the stats.” But this misrecognition may be easily corrected by a secondary recognizer for which the vocabulary is narrowed to only Major League Baseball™ players when decoding the acoustic span corresponding to the words “José I'll to the” as emitted by the primary recognizer”) among sets of dictionary data corresponding to different categories (Fig. 10 – “Apply grammar-based ASR with user-specific proper name dictionary 1050”; Par 194 – “Thus, the server may infer the presence of proper names in the text as described below and prepare one or more hypotheses 815 for their resolution. The hypotheses 815 may be submitted to the client. The client may then identify proper name entities from the various components 840 a-e of the user device. For example, a GPS 840 a component may provide relevant street names near the user's location, an address book 840 c may store the user's 805 contacts, a search cache 840 d may reflect recent inquiries and operations performed by the user 805, and a calendar 840 b may reflect meetings and events associated with user 805. The content from one or more of these components may be considered when identifying proper name entities as discussed herein.”; Par 207 – “At block 1050, the system, e.g., the client device, may decode each probable first name segment against its first name grammar. Block 1050 may generally correspond to the “Secondary Recognition” 915 step. In the some embodiments, the “Secondary Recognition” 915 step reduces to little more than inserting the most likely grammar decoding result in the appropriate location in the text output by “Primary Recognition” 905 and/or “Understanding” 910 operations.”; Par 275 – “Upon receipt of the output of the understanding step, the adaptation object preparation step uses this information to construct the adaptation object, comprising the grammar ww-contact-name.g, as shown within FIG. 20. This is done by assembling the indicated sections, respectively the whole waveform prefix section, the target section, and the whole waveform suffix section, each as depicted in FIG. 20. … Note that this target section may have been constructed separately, at the time of registration of the user contact names. That the user contact name target section is incorporated into the adaptation object, as opposed to some other kind of target section (for example, registered business names, numbered street addresses within Menlo Park, Calif., geographically proximate business names, or some other type appropriate to a different instance of adaptive proper name recognition), is a consequence of the putative target span type user-contact-name provided by the language understanding step.”); and 
replacing the phonetic symbol sequence included in the candidate text (PRINTZ Fig. 39 –“target baseforms and phonemes TUWPAOK SHAAKER”) with the determined replacement word (PRINTZ Fig. 39 – “target section match pak shak [c_id=2]”; Par 327 – “One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.” Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence. It will be apparent to one skilled in the art that this language model score may itself be conditioned upon one or more of: the putative command type, the putative span type, the putative span decoding, the location of the phoneme sequence with respect to the target span (viz., immediately preceding or immediately following the target span), one or more adjacent decoded words, or other known or hypothesized characteristics of the utterance.”; Par 354 – “This functionality is illustrated in the FIG. 39 graphic labeled ww-3 -lsp3 -rsp3 -3-slotted-contact-name.g: (decoding path). Reading left to right, the exhibited decoding path first comprises a forced alignment of the ww prefix section against the ww prefix acoustic span; this of course is required by the grammar structure. Next, the secondary recognizer chooses the path T->UW->εla->pak shak [c_id=2]-> εrs′->ER.”), wherein each of the sets of dictionary data corresponding to the different categories comprises information associated with a phonetic symbol sequence corresponding to each of words (PRINTZ Par 97 – “A “vocabulary” is, informally, a list of the words with associated pronunciations, which forms part of the input to an ASR system, and which defines the words that could in principle be recognized by such a system. Formally, the term may refer to a list of baseforms. Also sometimes called a “lexicon.””; Par 62 – “(3) a pronunciation for the word, comprising a sequence of phonemes.”; Par 132 – “This may be done by preparing a grammar, illustrated in graphical form in FIG. 2, that contains exactly these names, and compiling it into a binary form so that it is ready for use by the secondary recognizer. This operation, which may typically take a few hundred milliseconds, may be performed immediately upon receiving from Yelp® the list of names to be shown on the tablet display. Compilation may involve (1) obtaining one or more pronunciations for each indicated word in the grammar (this may typically be done by first searching a vocabulary, but if this search fails any required pronunciations may be automatically generated by a “grapheme to phoneme” or “g2p” processing module, which applies the standard rules of English language pronunciation to the given word spelling to produce one or more plausible pronunciations), (2) creating a computational structure that permits words to be decoded only in the order allowed by the grammar, (3) attaching to this structure operations to be performed on indicated meaning variables when a given decoding is obtained (which may typically comprise assigning values to these variables), and (4) emitting this structure in such form that it may be immediately loaded by a suitable grammar-based ASR system and used to guide its decoding of audio input. This compiled grammar, denoted “business-names.g” in FIG. 3, may be labeled with its type (in this case effectively business-names) and held for possible future use. In some embodiments, this comprises the adaptation object generation step.”) in each of the categories (Fig. 10 – “Apply grammar-based ASR with user-specific proper name dictionary 1050”; Par 374 – “For example the command “look up José Altuve's stats” may be misrecognized “look up José I'll to the stats.” But this misrecognition may be easily corrected by a secondary recognizer for which the vocabulary is narrowed to only Major League Baseball™ players when decoding the acoustic span corresponding to the words “José I'll to the” as emitted by the primary recognizer”; Par 194 – “Thus, the server may infer the presence of proper names in the text as described below and prepare one or more hypotheses 815 for their resolution. The hypotheses 815 may be submitted to the client. The client may then identify proper name entities from the various components 840 a-e of the user device. For example, a GPS 840 a component may provide relevant street names near the user's location, an address book 840 c may store the user's 805 contacts, a search cache 840 d may reflect recent inquiries and operations performed by the user 805, and a calendar 840 b may reflect meetings and events associated with user 805. The content from one or more of these components may be considered when identifying proper name entities as discussed herein.”).


REGARDING CLAIM 9, PRINTZ discloses the speech recognition method of claim 1, further comprising: 
obtaining a plurality of candidate target texts by replacing the phonetic symbol sequence with each of the words, in response to the phonetic symbol sequence corresponding to a plurality of words (PRINTZ Fig. 10 – “Confidence values indicate that one or more resolutions were identified? 1055”; Par 208 – “At block 1055, the system may determine which of the proposed proper entities for the acoustic spans (and/or the confidence levels associated with a hypothesis without acoustic spans) best corresponds to the utterance. For example, the system may identify the resolution with the highest cumulative confidence values. This determination may be made by considering one or more of the original, open dictation ASR confidence values, the original NLU confidence values, the ASR grammar-based confidence values determined at block 1050, and possibly a second NLU determination using the ASR grammar-based results, as part of a “Score Fusion” 920.”); 
calculating a score of each of the candidate target texts using a language model (PRINTZ Par 327 – “Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence. It will be apparent to one skilled in the art that this language model score may itself be conditioned upon one or more of: the putative command type, the putative span type, the putative span decoding, the location of the phoneme sequence with respect to the target span (viz., immediately preceding or immediately following the target span), one or more adjacent decoded words, or other known or hypothesized characteristics of the utterance.”); and 
determining, to be the target text, a candidate target text having a greatest score among calculated scores of the candidate target texts (PRINTZ Fig. 10 – “Confidence values indicate that one or more resolutions were identified? 1055”; Par 208 – “At block 1055, the system may determine which of the proposed proper entities for the acoustic spans (and/or the confidence levels associated with a hypothesis without acoustic spans) best corresponds to the utterance. For example, the system may identify the resolution with the highest cumulative confidence values. This determination may be made by considering one or more of the original, open dictation ASR confidence values, the original NLU confidence values, the ASR grammar-based confidence values determined at block 1050, and possibly a second NLU determination using the ASR grammar-based results, as part of a “Score Fusion” 920.”).


REGARDING CLAIM 10, PRINTZ discloses the speech recognition method of claim 1, wherein the phonetic symbol sequence is associated with a pronunciation of the target word (PRINTZ Fig. 39; Par 353 – “As before the line in FIG. 39 labeled primary recognizer output (baseforms) shows the sequence of baseforms for the whole utterance decoded by the primary recognizer; immediately beneath this the line primary recognizer output (phonemes) shows the actual phoneme sequence corresponding to each indicated baseform.”; Par 62 – “A “baseform” refers to a triple that associates: (1) a word as a lexical object (that is, a sequence of letters as a word is typically spelled); (2) an index that can be used to distinguish many baseforms for the same word from one another; and (3) a pronunciation for the word, comprising a sequence of phonemes. A given word may have several associated baseforms, distinguished by their pronunciation.”) corresponding to a proper noun (PRINTZ Par 92 – “A “span type” is the putative type of the proper name entity believed to be present within the span; thus a personal name, business name, numbered street address, etc.”; Par 208 – “At block 1055, the system may determine which of the proposed proper entities for the acoustic spans (and/or the confidence levels associated with a hypothesis without acoustic spans) best corresponds to the utterance.”).


REGARDING CLAIM 11, PRINTZ discloses the speech recognition method of claim 1, wherein the speech recognition model (PRINTZ Fig. 39; Par 359 – “It should be noted that as previously discussed, and as will be known to those skilled in the art, the phonemes used in the adaptation objects, which are derived from the primary recognizer output, may be the context-independent phonemes typically associated with baseforms, the specific context-dependent phonemes decoded by the primary recognizer, if these are present in the primary recognizer output, or some admixture of the two. Accordingly, all such embodiments, whether they use context-independent or context-dependent phonemes, are also included within the scope of the invention.”) comprises: 
an encoder configured to extract a vector value from the received speech data (PRINTZ Par 65 – “A “feature vector” is a multi-dimensional vector, with elements that are typically real numbers, comprising a processed representation of the audio in one frame of speech. A new feature vector may be computed for each 10 ms advance within the source utterance. See “frame.”p A “frame” is the smallest individual element of a waveform that is matched by an ASR system's acoustic model, and may typically comprise approximately 200 ms of speech. For the purpose of computing feature vectors, successive frames of speech may overlap, with each new frame advancing, e.g., 10 ms within the source utterance.”; Par 366 –“While these two recognizers may operate on entirely different principles, they may equally well share significant internal operating details, notably including the so-called front end and the associated feature vectors or other intermediate representations of the speech signal that it produces, an acoustic model, neural network or other computational device for evaluating the quality of a given acoustic match, or some other internal device or mechanism.”); and 
a decoder configured to output the candidate text corresponding to the received speech data based on the vector value (PRINTZ Par 79 – “A “primary recognizer” or “primary decoder” is a conventional open dictation automatic speech recognition (ASR) system, in principle capable of transcribing an utterance comprised of an arbitrary sequence of words in the system's large but nominally fixed vocabulary.”; Par 65 – “A “feature vector” is a multi-dimensional vector, with elements that are typically real numbers, comprising a processed representation of the audio in one frame of speech. A new feature vector may be computed for each 10 ms advance within the source utterance. See “frame.”p A “frame” is the smallest individual element of a waveform that is matched by an ASR system's acoustic model, and may typically comprise approximately 200 ms of speech. For the purpose of computing feature vectors, successive frames of speech may overlap, with each new frame advancing, e.g., 10 ms within the source utterance.”).


REGARDING CLAIM 12, PRINTZ discloses a non-transitory computer-readable storage medium storing instructions that, when executed by a processor (PRINTZ Fig. 42; Par 376 – “Processors”), cause the processor to perform the speech recognition method of claim 1; thus, it is rejected under the same rationale.


REGARDING CLAIM 13, PRINTZ discloses a speech recognition apparatus comprising: 
a processo (PRINTZ Fig. 42; Par 376 – “Processors”) configured to: perform the method of Claim 1; thus, it is rejected under the same rationale.

Claim 14 is similar to the method of Claim 2; thus it is rejected under the same rationale.

Claim 15 is similar to the method of Claim 3; thus it is rejected under the same rationale.

REGARDING 16, PRINTZ discloses the speech recognition apparatus of claim 14, wherein the processor is further configured to: 
determine the replacement word corresponding to the phonetic symbol sequence (PRINTZ Fig. 39 – “target section match pak shak [c_id=2]”; Par 327 – “One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.”; Par 354 – “This functionality is illustrated in the FIG. 39 graphic labeled ww-3 -lsp3 -rsp3 -3-slotted-contact-name.g: (decoding path). Reading left to right, the exhibited decoding path first comprises a forced alignment of the ww prefix section against the ww prefix acoustic span; this of course is required by the grammar structure. Next, the secondary recognizer chooses the path T->UW->εla->pak shak [c_id=2]-> εrs′->ER.”) using dictionary data (PRINTZ Par 327 – “lexicon”; Fig. 10 – “Apply grammar-based ASR with user-specific proper name dictionary 1050”; Fig. 16 – “businesses retrieved by previous query … -> business-names.g”; Par 374 – “For example the command “look up José Altuve's stats” may be misrecognized “look up José I'll to the stats.” But this misrecognition may be easily corrected by a secondary recognizer for which the vocabulary is narrowed to only Major League Baseball™ players when decoding the acoustic span corresponding to the words “José I'll to the” as emitted by the primary recognizer”; Par 132 – “This may be done by preparing a grammar, illustrated in graphical form in FIG. 2, that contains exactly these names, and compiling it into a binary form so that it is ready for use by the secondary recognizer. This operation, which may typically take a few hundred milliseconds, may be performed immediately upon receiving from Yelp® the list of names to be shown on the tablet display. Compilation may involve (1) obtaining one or more pronunciations for each indicated word in the grammar (this may typically be done by first searching a vocabulary, but if this search fails any required pronunciations may be automatically generated by a “grapheme to phoneme” or “g2p” processing module, which applies the standard rules of English language pronunciation to the given word spelling to produce one or more plausible pronunciations), (2) creating a computational structure that permits words to be decoded only in the order allowed by the grammar, (3) attaching to this structure operations to be performed on indicated meaning variables when a given decoding is obtained (which may typically comprise assigning values to these variables), and (4) emitting this structure in such form that it may be immediately loaded by a suitable grammar-based ASR system and used to guide its decoding of audio input. This compiled grammar, denoted “business-names.g” in FIG. 3, may be labeled with its type (in this case effectively business-names) and held for possible future use. In some embodiments, this comprises the adaptation object generation step.”) corresponding to the category indicated by the identifier pair (PRINTZ Figs. 12 and 13 – “Acoustic span [1520, 1750]”; Fig. 16 – “C – detailsApnrCommand[<business name span>[2100,2600]]”; Par 89 – “A “span” is a contiguous section of the input utterance, identified by its start time and end time within the whole of the input utterance (hereafter called the “span extent”), hypothesized to comprise a proper name entity, and labeled with the putative type of this entity (hereafter called the “span type”).”; Par 92 – “A “span type” is the putative type of the proper name entity believed to be present within the span; thus a personal name, business name, numbered street address, etc.”; Par 374 – “For example the command “look up José Altuve's stats” may be misrecognized “look up José I'll to the stats.” But this misrecognition may be easily corrected by a secondary recognizer for which the vocabulary is narrowed to only Major League Baseball™ players when decoding the acoustic span corresponding to the words “José I'll to the” as emitted by the primary recognizer”) among sets of dictionary data respectively corresponding to different categories (Fig. 10 – “Apply grammar-based ASR with user-specific proper name dictionary 1050”; Par 194 – “Thus, the server may infer the presence of proper names in the text as described below and prepare one or more hypotheses 815 for their resolution. The hypotheses 815 may be submitted to the client. The client may then identify proper name entities from the various components 840 a-e of the user device. For example, a GPS 840 a component may provide relevant street names near the user's location, an address book 840 c may store the user's 805 contacts, a search cache 840 d may reflect recent inquiries and operations performed by the user 805, and a calendar 840 b may reflect meetings and events associated with user 805. The content from one or more of these components may be considered when identifying proper name entities as discussed herein.”; Par 207 – “At block 1050, the system, e.g., the client device, may decode each probable first name segment against its first name grammar. Block 1050 may generally correspond to the “Secondary Recognition” 915 step. In the some embodiments, the “Secondary Recognition” 915 step reduces to little more than inserting the most likely grammar decoding result in the appropriate location in the text output by “Primary Recognition” 905 and/or “Understanding” 910 operations.”; Par 275 – “Upon receipt of the output of the understanding step, the adaptation object preparation step uses this information to construct the adaptation object, comprising the grammar ww-contact-name.g, as shown within FIG. 20. This is done by assembling the indicated sections, respectively the whole waveform prefix section, the target section, and the whole waveform suffix section, each as depicted in FIG. 20. … Note that this target section may have been constructed separately, at the time of registration of the user contact names. That the user contact name target section is incorporated into the adaptation object, as opposed to some other kind of target section (for example, registered business names, numbered street addresses within Menlo Park, Calif., geographically proximate business names, or some other type appropriate to a different instance of adaptive proper name recognition), is a consequence of the putative target span type user-contact-name provided by the language understanding step.”); and 
replace the phonetic symbol sequence included in the candidate text (PRINTZ Fig. 39 –“target baseforms and phonemes TUWPAOK SHAAKER”) with the determined replacement word  (PRINTZ Fig. 39 – “target section match pak shak [c_id=2]”; Par 327 – “One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.” Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence. It will be apparent to one skilled in the art that this language model score may itself be conditioned upon one or more of: the putative command type, the putative span type, the putative span decoding, the location of the phoneme sequence with respect to the target span (viz., immediately preceding or immediately following the target span), one or more adjacent decoded words, or other known or hypothesized characteristics of the utterance.”; Par 354 – “This functionality is illustrated in the FIG. 39 graphic labeled ww-3 -lsp3 -rsp3 -3-slotted-contact-name.g: (decoding path). Reading left to right, the exhibited decoding path first comprises a forced alignment of the ww prefix section against the ww prefix acoustic span; this of course is required by the grammar structure. Next, the secondary recognizer chooses the path T->UW->εla->pak shak [c_id=2]-> εrs′->ER.”), 
wherein each of the sets of dictionary data corresponding to the different categories comprises information associated with a phonetic symbol sequence corresponding to each of words (PRINTZ Par 97 – “A “vocabulary” is, informally, a list of the words with associated pronunciations, which forms part of the input to an ASR system, and which defines the words that could in principle be recognized by such a system. Formally, the term may refer to a list of baseforms. Also sometimes called a “lexicon.””; Par 62 – “(3) a pronunciation for the word, comprising a sequence of phonemes.”; Par 132 – “This may be done by preparing a grammar, illustrated in graphical form in FIG. 2, that contains exactly these names, and compiling it into a binary form so that it is ready for use by the secondary recognizer. This operation, which may typically take a few hundred milliseconds, may be performed immediately upon receiving from Yelp® the list of names to be shown on the tablet display. Compilation may involve (1) obtaining one or more pronunciations for each indicated word in the grammar (this may typically be done by first searching a vocabulary, but if this search fails any required pronunciations may be automatically generated by a “grapheme to phoneme” or “g2p” processing module, which applies the standard rules of English language pronunciation to the given word spelling to produce one or more plausible pronunciations), (2) creating a computational structure that permits words to be decoded only in the order allowed by the grammar, (3) attaching to this structure operations to be performed on indicated meaning variables when a given decoding is obtained (which may typically comprise assigning values to these variables), and (4) emitting this structure in such form that it may be immediately loaded by a suitable grammar-based ASR system and used to guide its decoding of audio input. This compiled grammar, denoted “business-names.g” in FIG. 3, may be labeled with its type (in this case effectively business-names) and held for possible future use. In some embodiments, this comprises the adaptation object generation step.”) in each of the categories (Fig. 10 – “Apply grammar-based ASR with user-specific proper name dictionary 1050”; Par 374 – “For example the command “look up José Altuve's stats” may be misrecognized “look up José I'll to the stats.” But this misrecognition may be easily corrected by a secondary recognizer for which the vocabulary is narrowed to only Major League Baseball™ players when decoding the acoustic span corresponding to the words “José I'll to the” as emitted by the primary recognizer”; Par 194 – “Thus, the server may infer the presence of proper names in the text as described below and prepare one or more hypotheses 815 for their resolution. The hypotheses 815 may be submitted to the client. The client may then identify proper name entities from the various components 840 a-e of the user device. For example, a GPS 840 a component may provide relevant street names near the user's location, an address book 840 c may store the user's 805 contacts, a search cache 840 d may reflect recent inquiries and operations performed by the user 805, and a calendar 840 b may reflect meetings and events associated with user 805. The content from one or more of these components may be considered when identifying proper name entities as discussed herein.”).


REGARDING 17, PRINTZ discloses the speech recognition apparatus of claim 13, wherein the processor is further configured to: 
determine the replacement word corresponding to the phonetic symbol sequence, using dictionary data (PRINTZ Fig. 39 – “target section match pak shak [c_id=2]”; Par 327 – “One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.” Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence. It will be apparent to one skilled in the art that this language model score may itself be conditioned upon one or more of: the putative command type, the putative span type, the putative span decoding, the location of the phoneme sequence with respect to the target span (viz., immediately preceding or immediately following the target span), one or more adjacent decoded words, or other known or hypothesized characteristics of the utterance.”; Par 374 – “For example the command “look up José Altuve's stats” may be misrecognized “look up José I'll to the stats.” But this misrecognition may be easily corrected by a secondary recognizer for which the vocabulary is narrowed to only Major League Baseball™ players when decoding the acoustic span corresponding to the words “José I'll to the” as emitted by the primary recognizer”) including information associated with a plurality of words and a phonetic symbol sequence corresponding to each of the words (PRINTZ Par 97 – “A “vocabulary” is, informally, a list of the words with associated pronunciations, which forms part of the input to an ASR system, and which defines the words that could in principle be recognized by such a system. Formally, the term may refer to a list of baseforms. Also sometimes called a “lexicon.””; Par 62 – “(3) a pronunciation for the word, comprising a sequence of phonemes.”; Par 132 – “Compilation may involve (1) obtaining one or more pronunciations for each indicated word in the grammar (this may typically be done by first searching a vocabulary, but if this search fails any required pronunciations may be automatically generated by a “grapheme to phoneme” or “g2p” processing module, which applies the standard rules of English language pronunciation to the given word spelling to produce one or more plausible pronunciations),”; Par 349 – “Indeed, we will shortly consider another variant, wherein the grammar arcs are labeled with, or equivalently the slotted grammar slots are populated with, individual phonemes drawn from the secondary recognizer's phoneme alphabet. For economy of reference in the discussion we will sometimes refer to the different objects that may be used as arc labels or slot contents—which are words (specifically literals), baseforms and phonemes—as “grammar labels” or simply “labels.””).


Claim 18 is similar to the method of Claim 9; thus it is rejected under the same rationale.

REGARDING 19, PRINTZ discloses a speech recognition method comprising: 
receiving speech data (PRINTZ Fig. 10 – “Receive utterance waveform from user 1005”; Par 204 – “FIG. 10 is a flow diagram depicting various steps in a proper name recognition process as may occur in some embodiments. At block 1005, the system may receive an utterance waveform from a user.”); 
obtaining, from the received speech data, a target phonetic symbol sequence (PRINTZ Fig. 39 – “primary recognizer output (phonemes): SEHND AH MEHSAHJH TUWPAOK SHAAKER YUH KAHMIHNG TAHNAYT”; Par 359 – “It should be noted that as previously discussed, and as will be known to those skilled in the art, the phonemes used in the adaptation objects, which are derived from the primary recognizer output, may be the context-independent phonemes typically associated with baseforms, the specific context-dependent phonemes decoded by the primary recognizer, if these are present in the primary recognizer output, or some admixture of the two. Accordingly, all such embodiments, whether they use context-independent or context-dependent phonemes, are also included within the scope of the invention.”)  that represents a pronunciation of a target word included in the received speech data (PRINTZ Fig. 39; Par 353 – “As before the line in FIG. 39 labeled primary recognizer output (baseforms) shows the sequence of baseforms for the whole utterance decoded by the primary recognizer; immediately beneath this the line primary recognizer output (phonemes) shows the actual phoneme sequence corresponding to each indicated baseform.”; Par 62 – “A “baseform” refers to a triple that associates: (1) a word as a lexical object (that is, a sequence of letters as a word is typically spelled); (2) an index that can be used to distinguish many baseforms for the same word from one another; and (3) a pronunciation for the word, comprising a sequence of phonemes. A given word may have several associated baseforms, distinguished by their pronunciation.”) and an identifier pair that identifies a category of the target word (PRINTZ Figs. 12 and 13 – “Putative type of span: “Business Name”; Fig. 16 – “C – detailsApnrCommand[<business name span>[2100,2600]]”; Par 89 – “A “span” is a contiguous section of the input utterance, identified by its start time and end time within the whole of the input utterance (hereafter called the “span extent”), hypothesized to comprise a proper name entity, and labeled with the putative type of this entity (hereafter called the “span type”).”; Par 92 – “A “span type” is the putative type of the proper name entity believed to be present within the span; thus a personal name, business name, numbered street address, etc.”);
comparing the target phonetic symbol sequence with one or more other phonetic symbol sequences (PRINTZ Fig. 39 –“target baseforms and phonemes TUWPAOK SHAAKER”; Par 254 – “This may be accomplished by comparing the nominal phoneme sequence of the primary decoding with the contents of the vocabulary, and using the language model to hypothesize plausible alternate word divisions, which can then be reflected in the associated grammar structure.”; Par 327 – “Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence. It will be apparent to one skilled in the art that this language model score may itself be conditioned upon one or more of: the putative command type, the putative span type, the putative span decoding, the location of the phoneme sequence with respect to the target span (viz., immediately preceding or immediately following the target span), one or more adjacent decoded words, or other known or hypothesized characteristics of the utterance.”), based on the category of the target word (PRINTZ Figs. 12 and 13 – “Putative type of span: “Business Name”; Fig. 16 – “C – detailsApnrCommand[<business name span>[2100,2600]]”; Par 89 – “A “span” is a contiguous section of the input utterance, identified by its start time and end time within the whole of the input utterance (hereafter called the “span extent”), hypothesized to comprise a proper name entity, and labeled with the putative type of this entity (hereafter called the “span type”).”; Par 92 – “A “span type” is the putative type of the proper name entity believed to be present within the span; thus a personal name, business name, numbered street address, etc.”), each of the other phonetic symbol sequences (PRINTZ Par 254 – “This may be accomplished by comparing the nominal phoneme sequence of the primary decoding with the contents of the vocabulary, and using the language model to hypothesize plausible alternate word divisions, which can then be reflected in the associated grammar structure.”) being associated with a replacement word (PRINTZ Fig. 39 – “target section match pak shak [c_id=2]”; Par 327 – “One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.”; Par 354 – “This functionality is illustrated in the FIG. 39 graphic labeled ww-3 -lsp3 -rsp3 -3-slotted-contact-name.g: (decoding path). Reading left to right, the exhibited decoding path first comprises a forced alignment of the ww prefix section against the ww prefix acoustic span; this of course is required by the grammar structure. Next, the secondary recognizer chooses the path T->UW->εla->pak shak [c_id=2]-> εrs′->ER.”); and 
outputting a target text corresponding to the received speech data by replacing the target word with one of the replacement words, based on the comparing (PRINTZ Fig. 39 – “send(01) a(02) message(01) .. pak shak … ”; Par 316 – “This, in turn, frees the secondary recognizer to find the best acoustic match, permitted by the target section, to the now effectively reduced target acoustic span. As shown in the graphic labeled ww-3-ls2-rs1-3-slotted-contact-name.g: (decoding path) this best acoustic match is the literal sequence “pak shak.” The secondary recognizer traverses the target section via this arc, thereby yielding the final (correct) transcription “send a message to pak shak are you coming tonight SIL,” along with the semantic meaning variable assignment c_id=2.”; Par 206 – “At block 1040, the system may attempt fulfilment using the symbolic representation and return any results to the user.”).


REGARDING 21, PRINTZ discloses the speech recognition method of claim 19, wherein the target phonetic symbol sequence is included in candidate text (PRINTZ FIG. 39 – “PRIMARY RECOGNIZER OUTPUT (baseforms): send(01) a(02) message(01) tupac(01) shakur(01) you(03) coming(01) tonight(01)”; Fig. 39; Par 359 – “It should be noted that as previously discussed, and as will be known to those skilled in the art, the phonemes used in the adaptation objects, which are derived from the primary recognizer output, may be the context-independent phonemes typically associated with baseforms, the specific context-dependent phonemes decoded by the primary recognizer, if these are present in the primary recognizer output, or some admixture of the two. Accordingly, all such embodiments, whether they use context-independent or context-dependent phonemes, are also included within the scope of the invention.”) comprising at least one subword that precedes or follows the target phonetic symbol sequence (PRINTZ Fig. 39 –“target baseforms and phonemes TUWPAOK SHAAKER”; Par 354 – “This functionality is illustrated in the FIG. 39 graphic labeled ww-3 -lsp3 -rsp3 -3-slotted-contact-name.g: (decoding path). Reading left to right, the exhibited decoding path first comprises a forced alignment of the ww prefix section against the ww prefix acoustic span; this of course is required by the grammar structure. Next, the secondary recognizer chooses the path T->UW->εla->pak shak [c_id=2]-> εrs′-> ER.”; Par 359 – “But unlike previous embodiments, the left shim and right shim each now contain a linear sequence of three arcs, respectively labeled with the first three and last three phonemes output by the primary recognizer, for the putative target span. Thus, the arcs of the left shim are labeled “T” “UW” “P” and the arcs of the right shim are labeled “AA” “K” “ER”. The choice of three phonemes for each shim is arbitrary and reflects a design that is known to work well in practice. Designs with a larger or smaller number of phonemes are possible and also fall within the scope of the invention, as do designs with differing numbers of phonemes in the left and right shims.”).


REGARDING 22, PRINTZ discloses the speech recognition method of claim 21, wherein the identifier pair comprises a first character that indicates a start of the target phonetic symbol sequence and a second character that that indicates an end of the target phonetic symbol sequence (PRINTZ Figs. 12 and 13 – “Acoustic span [1520, 1750]”; Fig. 16 – “C – detailsApnrCommand[<business name span>[2100,2600]]”; Par 89 – “A “span” is a contiguous section of the input utterance, identified by its start time and end time within the whole of the input utterance (hereafter called the “span extent”), hypothesized to comprise a proper name entity, and labeled with the putative type of this entity (hereafter called the “span type”).”; Par 146 – “This yields a transcription of the utterance (and possibly alternate transcriptions as well) in the vocabulary of the open dictation recognizer, plus nominal start and end times for each transcribed word.”; Par 148 –“The output of the primary recognition step, comprising (1) a nominal transcription, (2) the start time and end time within the waveform of each transcribed word at the granularity of a single frame and (3) possibly other information, described further below, of use in determining the extent and type of any acoustic spans, may then be passed to the understanding step.”; Par 216 – “FIG. 13 is an example of a second hypothesis breakdown based upon the third proposal 1115 c in the example of FIG. 11 as may occur in some embodiments. With regard to the third proposal 1115 c, the system may recognize that the “Guddu de Karahi” portion between 1220 ms and 1750 ms could not be recognized. Accordingly, a hypothesis 1205 having an acoustic span between 1220 ms and 1750 ms may be generated, and the prefix and suffix portions adjusted accordingly. The NLU may again identify the proper noun as a “Business Name” in the Putative Type but may instead consider the general inquiry as an “Address Book” query, limiting the search to only the address book contents.”; Par 353 – “Note the two nested structures of epsilon arcs—those labeled εls″, εls′ and εls within the left shim and εrs, εrs′ and εrs″ within the right shim. These yield the desired property of freeing the secondary recognizer to match or exclude from matching only portions of the waveform corresponding to contiguous sequences of left shim phonemes or right shim phonemes, thereby permitting the secondary recognizer to expand the now-reduced target acoustic span, at the granularity of an individual phoneme, to obtain the best possible match of the target section. Moreover, to find this best possible match, only the phonemes that appeared in the primary recognizer output, in the order in which they appeared, need be considered.”;), and the at least one subword is separated from the target phonetic symbol sequence in the candidate text by one or both of the first character and the second character (PRINTZ Par 354 – “This functionality is illustrated in the FIG. 39 graphic labeled ww-3 -lsp3 -rsp3 -3-slotted-contact-name.g: (decoding path). Reading left to right, the exhibited decoding path first comprises a forced alignment of the ww prefix section against the ww prefix acoustic span; this of course is required by the grammar structure. Next, the secondary recognizer chooses the path T->UW->εla->pak shak [c_id=2]-> εrs′->ER.”; Par 356 – “Likewise, by traversing the right shim arc εrs, the secondary recognizer matches the audio of the waveform corresponding to the phoneme sequence beneath it, AA K, within the target section, while matching the audio corresponding to the phoneme ER outside it. The remainder of the decoding path comprises a forced alignment of the ww suffix section against the ww suffix acoustic span.”; Par 93 – “A “target span” is the portion of the acoustic span, decoded by a secondary recognition step, that nominally contains the words of the proper name entity. Thus, the term refers to the acoustic span, exclusive of the acoustic prefix words and acoustic suffix words.”; Note that the boundaries of the target phonetic sequence corresponds to “pak shak” are indicated by εla and εrs′ and the subwords TUW  and ER are separated from the target.).



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 6 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over PRINTZ (US 2017/0133010 A1), and further in view of SKOBELTSYN (US 2015/0371632 A1).

REGARDING CLAIM 6, PRINTZ discloses the speech recognition method of claim 5, further comprising: 
calculating a similarity between the phonetic symbol sequence included in the candidate text and each of the phonetic symbol sequences included in the dictionary data (PRINTZ Par 97 – “A “vocabulary” is, informally, a list of the words with associated pronunciations, which forms part of the input to an ASR system, and which defines the words that could in principle be recognized by such a system. Formally, the term may refer to a list of baseforms. Also sometimes called a “lexicon.””; Par 62 – “(3) a pronunciation for the word, comprising a sequence of phonemes.”; Par 327 – “One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.” Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence.”; Par 254 – “This may be accomplished by comparing the nominal phoneme sequence of the primary decoding with the contents of the vocabulary, and using the language model to hypothesize plausible alternate word divisions, which can then be reflected in the associated grammar structure.”); and 
determining, as the replacement word corresponding to the phonetic symbol sequence included in the candidate text, a word corresponding to a phonetic symbol sequence having a [greatest] similarity among calculated similarities of the phonetic symbol sequences included in the dictionary data (PRINTZ Par 327 – “One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.” Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence.”; Par 254 – “This may be accomplished by comparing the nominal phoneme sequence of the primary decoding with the contents of the vocabulary, and using the language model to hypothesize plausible alternate word divisions, which can then be reflected in the associated grammar structure.”).
PRINTZ does not explicitly teach the [square] limitation.  In other words, PRINTZ teaches determining a word as the replacement word by comparing the phoneme sequence of a proper noun to phoneme sequences of words in the lexicon/vocabulary/dictionary/grammar.  The match is found by “a similar search of the lexicon.”  However, PRINTZ does not explicitly teach finding the word with a greatest similarity.

SKOBELTSYN discloses the [square-bracketed] limitations. SKOBELTSYN discloses a method/system for entity name recognition comprising:
calculating a similarity between the phonetic symbol sequence included in the candidate text and each of the phonetic symbol sequences (SKOBELTSYN Par 9 – “n some aspects, determining that the phonetic representation of the second term matches a particular phonetic representation of a particular canonical name of a set of canonical names associated with a particular entity includes determining a match score based on a distance between the phonetic representation of the second term and the particular phonetic representation and determining that the match score satisfies a predetermined threshold match score.”) included in the dictionary data (SKOBELTSYN Fig. 1 – “Entity Type Specific Geo-Localized Database 132”); and 
determining, as the replacement word corresponding to the phonetic symbol sequence included in the candidate text, a word corresponding to a phonetic symbol sequence having a [greatest] similarity among calculated similarities of the phonetic symbol sequences included in the dictionary data (SKOBELSYN Par 37 – “The transcription verifier 180 may determine the match score based on one or more of (i) a phonetic distance between the phonetic representation of the name term and the particular phonetic representation for the entity, (ii) a popularity of the entity, and (iii) a proximity of the entity to the geographic location associated with the utterance 152. For example, a greater phonetic distance may result in a lower match score, a greater popularity of the entity may result in a greater match score, and a greater proximity of the entity to the geographic location associated with the utterance 152 may result in a greater match score.”; Par 39 – “The transcription verifier 180 may determine that a phonetic representation of the name term matches a particular phonetic representation for an entity stored in the entity type-specific, geo-localized entity database 132 determined to correspond to the candidate transcription based on a match score by determining an entity that is associated with the highest match score with the utterance 152 and determining a match if the match score for that entity satisfies a predetermined criteria. In some implementation, the predetermined criteria may be that the highest match score, e.g., 95%, is above a predetermined match score threshold, e.g., 90%.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of PRINTZ to include selecting a word with a highest score, as taught by SKOBELTSYN.
One of ordinary skill would have been motivated to include selecting a word with a highest score, in order to accurately transcribe an utterance (SKOBELTSYN Par 4).


REGARDING 20, PRINTZ discloses the speech recognition method of claim 19, further comprising determining the replacement word to replace the target word as one of the respective replacements words that is associated with one of the other phonetic symbol sequences that is [most similar] to the target phonetic symbol sequence (PRINTZ Par 327 – “One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.” Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence.”; Par 254 – “This may be accomplished by comparing the nominal phoneme sequence of the primary decoding with the contents of the vocabulary, and using the language model to hypothesize plausible alternate word divisions, which can then be reflected in the associated grammar structure.”).
PRINTZ does not explicitly teach the [square] limitation.  In other words, PRINTZ teaches determining a word as the replacement word by comparing the phoneme sequence of a proper noun to phoneme sequences of words in the lexicon/vocabulary/dictionary/grammar.  The match is found by “a similar search of the lexicon.”  However, PRINTZ does not explicitly teach finding the word with a greatest similarity.

SKOBELTSYN discloses the [square-bracketed] limitations. SKOBELTSYN discloses a method/system for entity name recognition further comprising determining the replacement word to replace the target word as one of the respective replacements words (SKOBELTSYN Par 9 – “n some aspects, determining that the phonetic representation of the second term matches a particular phonetic representation of a particular canonical name of a set of canonical names associated with a particular entity includes determining a match score based on a distance between the phonetic representation of the second term and the particular phonetic representation and determining that the match score satisfies a predetermined threshold match score.”) that is associated with one of the other phonetic symbol sequences that is [most similar] to the target phonetic symbol sequence (SKOBELSYN Par 37 – “The transcription verifier 180 may determine the match score based on one or more of (i) a phonetic distance between the phonetic representation of the name term and the particular phonetic representation for the entity, (ii) a popularity of the entity, and (iii) a proximity of the entity to the geographic location associated with the utterance 152. For example, a greater phonetic distance may result in a lower match score, a greater popularity of the entity may result in a greater match score, and a greater proximity of the entity to the geographic location associated with the utterance 152 may result in a greater match score.”; Par 39 – “The transcription verifier 180 may determine that a phonetic representation of the name term matches a particular phonetic representation for an entity stored in the entity type-specific, geo-localized entity database 132 determined to correspond to the candidate transcription based on a match score by determining an entity that is associated with the highest match score with the utterance 152 and determining a match if the match score for that entity satisfies a predetermined criteria. In some implementation, the predetermined criteria may be that the highest match score, e.g., 95%, is above a predetermined match score threshold, e.g., 90%.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of PRINTZ to include selecting a word with a highest score, as taught by SKOBELTSYN.
One of ordinary skill would have been motivated to include selecting a word with a highest score, in order to accurately transcribe an utterance (SKOBELTSYN Par 4).



Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over PRINTZ (US 2017/0133010 A1), and further in view of HAN (US 2014/0379335 A1).

REGARDING CLAIM 7, PRINTZ discloses the speech recognition method of claim 5, wherein [the dictionary data is of a trie or hashmap data structure], and the determining comprises: 
retrieving a phonetic symbol sequence corresponding to the phonetic symbol sequence included in the candidate text (PRINTZ Fig. 39 – “target section match pak shak [c_id=2]”; Par 327 – “One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.” Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence. It will be apparent to one skilled in the art that this language model score may itself be conditioned upon one or more of: the putative command type, the putative span type, the putative span decoding, the location of the phoneme sequence with respect to the target span (viz., immediately preceding or immediately following the target span), one or more adjacent decoded words, or other known or hypothesized characteristics of the utterance.”; Par 374 – “For example the command “look up José Altuve's stats” may be misrecognized “look up José I'll to the stats.” But this misrecognition may be easily corrected by a secondary recognizer for which the vocabulary is narrowed to only Major League Baseball™ players when decoding the acoustic span corresponding to the words “José I'll to the” as emitted by the primary recognizer”) from the phonetic symbol sequences included in the dictionary data, using the data structure (PRINTZ Par 97 – “A “vocabulary” is, informally, a list of the words with associated pronunciations, which forms part of the input to an ASR system, and which defines the words that could in principle be recognized by such a system. Formally, the term may refer to a list of baseforms. Also sometimes called a “lexicon.””; Par 62 – “(3) a pronunciation for the word, comprising a sequence of phonemes.”; Par 132 – “Compilation may involve (1) obtaining one or more pronunciations for each indicated word in the grammar (this may typically be done by first searching a vocabulary, but if this search fails any required pronunciations may be automatically generated by a “grapheme to phoneme” or “g2p” processing module, which applies the standard rules of English language pronunciation to the given word spelling to produce one or more plausible pronunciations),”; Par 349 – “Indeed, we will shortly consider another variant, wherein the grammar arcs are labeled with, or equivalently the slotted grammar slots are populated with, individual phonemes drawn from the secondary recognizer's phoneme alphabet. For economy of reference in the discussion we will sometimes refer to the different objects that may be used as arc labels or slot contents—which are words (specifically literals), baseforms and phonemes—as “grammar labels” or simply “labels.””); and 
determining a word (PRINTZ Fig. 39 – “send(01) a(02) message(01) .. pak shak … ”; Par 316 – “This, in turn, frees the secondary recognizer to find the best acoustic match, permitted by the target section, to the now effectively reduced target acoustic span. As shown in the graphic labeled ww-3-ls2-rs1-3-slotted-contact-name.g: (decoding path) this best acoustic match is the literal sequence “pak shak.” The secondary recognizer traverses the target section via this arc, thereby yielding the final (correct) transcription “send a message to pak shak are you coming tonight SIL,” along with the semantic meaning variable assignment c_id=2.”) corresponding to the retrieved phonetic symbol sequence to be the replacement word corresponding to the phonetic symbol sequence included in the candidate text (PRINTZ Fig. 39 – “target section match pak shak [c_id=2]”; Par 327 – “One means of compensating for this is to post-process any such user-visible transcription, by which is meant any portion of the secondary transcription that is to be shown to a human user of the system or consumer of its output, and replace phonemes or phoneme sequences with the closest matching word or words present in the lexicon. This strategy, applied to the secondary recognizer transcription fragment “ER you coming tonight” yields “are you coming tonight.” Other more elaborate methods might involve a similar search of the lexicon, and include a language model score as well, when selecting the ordinary-language word or words to replace a phoneme or phoneme sequence. It will be apparent to one skilled in the art that this language model score may itself be conditioned upon one or more of: the putative command type, the putative span type, the putative span decoding, the location of the phoneme sequence with respect to the target span (viz., immediately preceding or immediately following the target span), one or more adjacent decoded words, or other known or hypothesized characteristics of the utterance.”; Par 354 – “This functionality is illustrated in the FIG. 39 graphic labeled ww-3 -lsp3 -rsp3 -3-slotted-contact-name.g: (decoding path). Reading left to right, the exhibited decoding path first comprises a forced alignment of the ww prefix section against the ww prefix acoustic span; this of course is required by the grammar structure. Next, the secondary recognizer chooses the path T->UW->εla->pak shak [c_id=2]-> εrs′->ER.”).
PRINTZ does not explicitly teach the [square-bracketed] limitations.

HAN discloses the [square-bracketed] limitations.  HAN discloses a method/system for speech recognition, wherein [the dictionary data is of a trie or hashmap data structure] (HAN Par 37 – “In some embodiments, after Step 201, the method can also include: establishing a dictionary Trie for the respective pinyin and at least one approximate pinyin corresponding to the mentioned each word in the word database. Through establishing the Trie and searching the pinyin corresponding to the input word and the pinyin matching the corresponding approximate pinyin in the Trie, the matching efficiency of word input can be further improved.”; Par 38 – “In some embodiments, the method includes classifying the saved words in the word database in advance. For example, the word database can be divided into words that are contact names of a contact list and words that are song names for songs saved in the terminal device as two different types. Then, respective dictionary tries can be established for matching words with pinyin and approximate pinyin in each category of words. This can further improve the matching efficiency of word input. Further, in some embodiments, the last letter of a pinyin for a word can be tagged with the similarity grade for the pinyin (e.g., full match, 1st grade approximation, 2nd grade approximation, etc.).”; Par 51 – “In some embodiments, Step 204 can include: first, according to the degrees of pronunciation similarity in a descending order, the device successively determines whether the pinyin corresponding to the word and at least one similar pinyin corresponding to the word has a pinyin match in the mentioned dictionary Trie. If there exists one or more pinyin in the mentioned Trie to match the pinyin and approximate pinyin of the word, the device obtains the additional words corresponding to the mentioned matching pinyin according to the mentioned mapping relationship table. Through conducting pinyin matching and searching in the Trie, it can also further improve the matching efficiency of word input.”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the method/system of PRINTZ to include a trie, as taught by HAN.
One of ordinary skill would have been motivated to include a trie, in order to improve the matching efficiency (HAN Par 37).

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONATHAN C KIM whose telephone number is (571)272-3327. The examiner can normally be reached Monday to Friday 8:00 AM thru 4:00 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew C Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JONATHAN C KIM/Primary Examiner, Art Unit 2655