Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 2,  10, 11 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Maergner (US 20170287474 A1) in further view of Bromand (US 20200357390 A1).

With respect to claim 1 Maergner teaches A natural language method extracting a first phoneme string corresponding to one named entity (NE) from a grapheme-based text corpus including texts of different accents or languages for the one NE ([0034] The recognition dictionary 210 may be implemented as a data store, database [corpus], RAM, ROM, and/or other computer-readable media accessible by one or more components in system 200. The recognition dictionary 210 may store a plurality of named entities and a pronunciation for each of the named entities. A pronunciation may comprise a sequence of phonemes [phoneme string], wherein a phoneme represents the smallest distinctive unit of a spoken language… In some embodiments, the recognition dictionary 210 may include multiple pronunciations for the same named entity. For example, the recognition dictionary 210 may store a native pronunciation and one or more foreign pronunciations (e.g., in different languages) of the same named entity. The ASR engine 204 may access the recognition dictionary 210 to obtain a pronunciation for a named entity in a particular language); 
Maergner fails to explicitly disclose, however, Bromand  teaches   generating a phoneme-based training data set by labeling at least one of the NE or speech intention in the first phoneme string ([0041] As one example, if the band name “Bl!nk” [named entity]was included in the training data, then the machine-learning model 107 would learn that an exclamation mark character can be pronounced with the “ih” sound (e.g., the machine-learning model 107 would have modified weights tending to correspond “!” with the pronunciation “ih”). As a result, when the machine-learning model 107 receives input of other terms that include an exclamation mark (e.g., “s!nk”), the machine-learning model 107 would be more likely to produce an output indicating that the pronunciation of the term includes the “ih” sound. As a specific example, the text “bl!nk” can be used as training data and a phonetic representation of the utterance can be used as the label to be applied to the data. This results in the data-label pair: {“bl!nk”, [blink].sup.P}, where [bl!nk].sup.P is a phonetic representation [phoneme-based] of the word “bl!nk”.); and 
generating an artificial neural network-based learning model (LM) using the phoneme-based training data set ([0011] In accordance with various embodiments, in an example, there is a computer-implemented method comprising: receiving text data including at least one character that is a non-letter character; providing the text data as input into a trained machine-leaning model; and receiving as output from the trained machine-learning model, output data indicative of a pronunciation of the text data and [0012] In an example, the trained machine-learning model comprises a neural network. In an example, the method includes providing the output data to a text-to-speech system for producing speech output based on the output data. In an example, there is a system comprising a memory storing instructions that, when executed by one or more processors, cause the one or more processors to perform the method. In an example, the system further includes media streaming application instructions stored in a non-transitory memory of a voice-interactive device executable to cause operation of a media streaming application on the voice-interactive device)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Maergner in view of Bromand, in order to generate a phoneme-based training data set by labeling at least one of the NE or speech intention in the first phoneme string  to provide a standard way of identifying sounds (phonemes) ([0041], Bromand).

With respect to claims 2 and 11 Maergner further teaches wherein the text corpus includes at least two languages ([0034 For example, the recognition dictionary 210 may store a native pronunciation and one or more foreign pronunciations (e.g., in different languages) of the same named entity. The ASR engine 204 may access the recognition dictionary 210 to obtain a pronunciation for a named entity in a particular language). 

With respect to claim 10 Maergner teaches A natural language processing apparatus, comprising: 
a memory configured to store a grapheme-based text corpus including texts of different accents or languages for one named entity (NE) ([0034] The recognition dictionary 210 may be implemented as a data store, database [corpus], RAM, ROM, and/or other computer-readable media accessible by one or more components in system 200.) and 
a processor configured to ([0030] Each component 103, 105, 107, 109 may be any type of known computer, server, or data processing device. Data server 103, e.g., may include a processor 111 controlling overall operation of the data server 103.): 
extract a first phoneme string corresponding to the one NE from the grapheme-based text corpus ([0034] The recognition dictionary 210 may be implemented as a data store, database [corpus], RAM, ROM, and/or other computer-readable media accessible by one or more components in system 200. The recognition dictionary 210 may store a plurality of named entities and a pronunciation for each of the named entities. A pronunciation may comprise a sequence of phonemes [phoneme string], wherein a phoneme represents the smallest distinctive unit of a spoken language… In some embodiments, the recognition dictionary 210 may include multiple pronunciations for the same named entity. For example, the recognition dictionary 210 may store a native pronunciation and one or more foreign pronunciations (e.g., in different languages) of the same named entity. The ASR engine 204 may access the recognition dictionary 210 to obtain a pronunciation for a named entity in a particular language); 
Maergner fails to explicitly disclose, however, Bromand  teaches generate a phoneme-based training data set by labeling at least one of the NE or speech intention in the first phoneme string ([0041] As one example, if the band name “Bl!nk” [named entity]was included in the training data, then the machine-learning model 107 would learn that an exclamation mark character can be pronounced with the “ih” sound (e.g., the machine-learning model 107 would have modified weights tending to correspond “!” with the pronunciation “ih”). As a result, when the machine-learning model 107 receives input of other terms that include an exclamation mark (e.g., “s!nk”), the machine-learning model 107 would be more likely to produce an output indicating that the pronunciation of the term includes the “ih” sound. As a specific example, the text “bl!nk” can be used as training data and a phonetic representation of the utterance can be used as the label to be applied to the data. This results in the data-label pair: {“bl!nk”, [blink].sup.P}, where [bl!nk].sup.P is a phonetic representation [phoneme-based] of the word “bl!nk”.)and 
generate an artificial neural network-based learning model (LM) using the phoneme-based training data set ([0011] In accordance with various embodiments, in an example, there is a computer-implemented method comprising: receiving text data including at least one character that is a non-letter character; providing the text data as input into a trained machine-leaning model; and receiving as output from the trained machine-learning model, output data indicative of a pronunciation of the text data and [0012] In an example, the trained machine-learning model comprises a neural network. In an example, the method includes providing the output data to a text-to-speech system for producing speech output based on the output data. In an example, there is a system comprising a memory storing instructions that, when executed by one or more processors, cause the one or more processors to perform the method. In an example, the system further includes media streaming application instructions stored in a non-transitory memory of a voice-interactive device executable to cause operation of a media streaming application on the voice-interactive device)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Maergner in view of Bromand, in order to generate a phoneme-based training data set by labeling at least one of the NE or speech intention in the first phoneme string  to provide a standard way of identifying sounds (phonemes) ([0041], Bromand).

With respect to claim 16 Maergner further teaches A computer-readable recording medium on which a program for implementing the method according to claim 1 is recorded ([0059]n other embodiments, process 800 illustrated in FIG. 8 and/or one or more steps thereof may be embodied in computer-executable instructions that are stored in a computer-readable medium, such as a non-transitory computer-readable memory).  


Claims 3 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Maergner and Bromand as applied to claims 1 and 10 respectively,  and in further view of Podmajersky  (US 20180210872 A1).

With respect to claims 3 and 12 Maergner and Bromand fail to explicitly disclose, however, Podmajersky teaches wherein the text corpus includes at least one dialect ([0047] In an example, the language corpus data 302 is organized around certain kinds of text data, such as dialects associated with particular geographic, social, or other groups.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Maergner and Bromand in view  Podmajersky, in order for the text corpus to include at least one dialect  to prompt the user to provide specific information regarding a style of speech ([0051], Podmajersky).


Claims 4 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Maergner and Bromand as applied to claims as applied to claims 1 and 10 respectively and in further view of Arel  (US 10559299 B1).
With respect to claims 4 and 13 Maergner and Bromand fail to explicitly disclose, however, Arel teaches generating an output by extracting a first feature from the text corpus, and applying the first feature to a first model for generating a phoneme (Col 7 ll 1-13: In one embodiment, the acoustic model 110 is a combination of a neural network (e.g., an RNN) and a hidden markov model. In one embodiment, the acoustic model has two main parts, including a Hidden Markov Model (HMM) and a Long Short Term Memory (LSTM) inside the HMM which models feature statistics. Alternatively, the AM may be based on a combination of a Gaussian Mixture Model (GMM) and an HMM (e.g., a GMM-HMM). In one embodiment, the acoustic model 110 is an implementation based on the Kaldi® framework to output phonemes (and optionally non-phonemic or prosodic features) rather than text. Other machine learning models may also be used for the acoustic model 110 [Fig. 1B shows acoustic feature  input to the model] ); and 
generating a phoneme corresponding to each syllable included in the text corpus based on the output (Col 8 ll40-41 The sequences of symbols (y) may be sequences of words, sequences of syllables, sequences of phonemes, and so on [Fib.1B shows symbols feeding into the model 110] ).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Maergner and Bromand in view  Arel, in order to generate an output by extracting a first feature from the text corpus, and applying the first feature to a first model for generating a phoneme to improve the performance of the NLU ([Col 10 ll 66-67], Arel).

Claims 5 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Maergner, Bromand and Arel, as applied to claims as applied to claims 4 and13 respectively and in further view of Lee  (US 20020087315 A1).
With respect to claims 5 and 14 Maergner, Bromand and Arel fail to explicitly disclose, however, Lee teaches wherein when the texts of different accents or languages for the one NE exist among texts included in the text corpus, the first model is an artificial neural network-based LM trained to generate an output representing the same phoneme string when the texts of different accents or languages are applied to the first model ([0023] The multi-language model creation unit 132 has access to the application dictionary 134, containing the corpus of the domain in use, the application corpora 136, and the web summary information database 138 containing the corpus from web sites, and [0029] FIG. 6 depicts the phonetic knowledge unit 240 that forms one of the recognition assisting databases 48 [. The phonetic knowledge unit 240 encompasses the degree of similarity 242 between pronunciations for distinct terms 244 and 246. The phonetic knowledge unit 240 understands basic units of sound for the pronunciation of words and sound to letter conversion rules. If, for example, a user requested information on the weather in Tahoma, the phonetic knowledge unit 240 is used to generate a subset of names with similar pronunciation to Tahoma. Thus, Tahoma, Sonoma, and Pomona may be grouped together in a node specific language model for terms with similar sounds [In the BRI sense, two similar sounding words are being considered different-accented words], and [0034] FIG. 9 depicts an embodiment of the present invention for selecting language models. This embodiment utilizes a combination of statistical modeling and conceptual pattern matching with both semantic and phonetic information. The multi-scan control unit 32 receives an initially recognized utterance 40 from the user as a word sequence. The output is first normalized to a standard format. Next semantic and phonetic features are extracted from the normalized word sequence. Then the acoustic features of the input utterance, in the form of Mel-Frequency Cepstral Coefficients (mfcc) 49, of each frame of the input utterance is mapped against the code book models 50 of each of the phonetic segment of the recognized words to calculate their confidence levels. The semantic feature of the recognized words is represented as attribute-and-value matrices. These include semantic category, syntactic category, application-relevancy, topic-indicator, etc. This representation is then fed into a multi-layer perceptron-based neural network decision layer 51, which has been trained by the learning module 52 to map feature structures to sub-language models 36 ).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Maergner, Bromand and Arel in view  Lee , in order for the texts of different accents or languages for the one NE exist among texts included in the text corpus, the first model is an artificial neural network-based LM trained to generate an output representing the same phoneme string when the texts of different accents or languages are applied to the first model to increase the accuracy of word recognition ([0024], Lee).

Claims 6 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Maergner and Bromand as applied to claims as applied to claims 1 and 10 respectively and in further view of Golipour (US 20160125872 A1).
With respect to claims 6 and 15 Maergner and Bromand fail to explicitly disclose, however, Golipour teaches wherein the generating the phoneme-based training data set includes: generating an output by extracting a second feature from the first phoneme string, and applying the second feature to a second model for labeling at least one of the NE or the speech intention ([0010] A system operating per this disclosure defines simple “atomic” tokens that are processed by a MaxEnt-based classifier [model] trained on labeled text data. The labels correspond to pronunciations rather than any predefined Named Entity categories [Fig. 1 shows ’letters’. Individual letters are considered to be Named Entities]. The annotation of the training data is a relatively simple task for non-experts. For each class, the system uses a distinct text conversion process to provide normalized text that can be spoken by a synthesizer or used for ASR text normalization purposes.); and 
tagging at least one of the NE or the speech intention in the first phoneme string based on the output ([0027] These feature extraction module 212 can compute and extract features either from the token or the word from which the token originates…The feature extraction module 212 generates training data 214 which can be used to train an automatic labeler [labeler tags the letters which are the NE], tokenizer, or other component of a text processing system.)
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Maergner, Bromand in view  Golipour for generating the phoneme-based training data set includes: generating an output by extracting a second feature from the first phoneme string, and applying the second feature to a second model for labeling at least one of the NE or the speech intention to enrich the training data with examples intended to reduce inter-class confusion ([0040], Golipour).

Claims 7, 8 is rejected under 35 U.S.C. 103 as being unpatentable over Maergner and Bromand as applied to claims as applied to claim 1 and 7 respectively,   and in further view of Aryal (US 11276389 B1) and Torres (US 20210149993 A1).
With respect to claims 7  Maergner and Bromand fail to explicitly disclose, however, Aryal teaches receiving a speech voice (Col 2 ll 39-49 Database 210 comprises utterance recorded from a plurality of speakers along with the corresponding text.); 
transcribing a text from the received speech voice (Col 2 ll 39-49: The voice actor is referred to herein as the base speaker. For each target voice, the database 210 contains a few minutes of speech from one or more target speakers along with the transcription of that speech.); 
extracting a second phoneme string from the transcribed text, and extracting a third feature from the second phoneme string (Col 2 ll 61-63: The linguistic feature extraction module 222 is configured to extract phoneme level linguistic feature vectors from a given transcription.); 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Maergner, Bromand in view  Aryal for extracting a second phoneme string from the transcribed text, and extracting a third feature from the second phoneme string  to to generate phoneme level linguistic feature vectors corresponding to a given input sentence ([Col 3 ll 14-17], Aryal).

Maergner, Bromand and Arya fail to explicitly disclose, however, Torres teaches generating an output for determining the NE or the speech intention by applying the third feature to the LM [0023] The machine learning models in the example embodiments employ multiple layers of processing based on deep learning models. The deep learning models may be based on neural network (NN) models, and [0052] FIG. 7 shows example data classification and confidence modeling processes in accordance with several aspects of example embodiments in this disclosure. At 710, one or more processors may obtain a document comprising a plurality of text tokens. The one or more processors may determine word embeddings corresponding to the text tokens [third feature] based on a pretrained language model. At 730, based on the word embeddings, the one or more processor may determine named entities corresponding to the tokens and may determine accuracy predictions corresponding to the named entities. At 740, the one or more processors may compare the one or more accuracy predictions with one or more thresholds. At 750, the one or more processors may associate the named entities with one or more confidence levels. At 760, the one or more processors may deliver the named entities and the one or more confidence levels.).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Maergner, Bromand, Aryal in view  of Torres in order to generate an output for determining the NE or the speech intention by applying the third feature to the LM to In order to make predictions or decisions machine learning models are used which are based on sample data ([0016], Torres).
With respect to claim 8 Torres further teaches generating a response including the NE or the speech intention based on the output ([0051] FIG. 6 shows example outputs of a data classifier model and a confidence model for example input text tokens. The input text tokens may be, for example, from a receipt issued by a vendor. The text tokens may be input to the BERT model and the output word embeddings may be fed to a data classifier decoder and a confidence modeling decoder. The data classifier may associate each token with one of a plurality of labels/named entities (in this example, Vendor name, Total money value, credit card (CC) number, and Date).

It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Maergner, Bromand, Aryal in view  of Torres in order to generate an output for determining the NE or the speech intention by applying the third feature to the LM to In order to make predictions or decisions machine learning models are used which are based on sample data ([0016], Torres).


Claims 9 is rejected under 35 U.S.C. 103 as being unpatentable over Maergner and Bromand as applied to claims as applied to claim 2, and in further view of Choi (US 20160196257 A1).
With respect to claims 7  Maergner and Bromand fail to explicitly disclose, however, Choi teaches wherein the LM includes an acoustic model for predicting a confidence score of the NE or a language model for predicting the speech intention ([0040] The decoder 123 generates, based on the language model 132, a speech recognized sentence of which an intention is semantically appropriate. The language model 132 may include an n-gram language model, a bidirectional recurrent neural network language model, and the like. In an example, the decoder 123 generates a speech recognized sentence by appropriately combining, based on the language model 132, words generated based on the acoustic model 131.).  
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify Maergner, Bromand in view  of Choi  in order for the LM to includ an acoustic model for predicting a confidence score of the NE or a language model for predicting the speech intention so the degree to which the corresponding word is appropriate in the sentence is increased. ([0043], Choi).



Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ATHAR N PASHA whose telephone number is (408)918-7675.  The examiner can normally be reached on Monday-Thursday Alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.   Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ATHAR N PASHA/Examiner, Art Unit 2657     

/LAMONT M SPOONER/Primary Examiner, Art Unit 2657                                                                                                                                                                                                        
7/6/2022