Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
This office action is in response to application 16/891,593, which was filed 06/03/20. Claims 1-20 are pending in the application and have been considered.

Foreign Priority
Receipt is acknowledged of certified copies of papers submitted under 35 U.S.C. 119(a)-(d), which papers have been placed of record in the file.

Specification
On page 1, paragraph [0003], line 3 should “encounter” be “encounters”?

Claim Objections
Claim 5 is objected to because of the following informalities:  In lines 2-3, “a fourth correspondence” lacks proper antecedent basis, since this claim is dependent on claim 1 instead of claim 4, which mentions a “first”, “second” and “third” correspondences.  For examining purposes, the examiner will assume Applicant intended this claim to depend on claim 4 rather than claim 1. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claims 1-3, 8-11, 13, and 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over Prabhavalkar et al. (2020/0043483) in view of Liu et al. (2015/0134332).

Consider claim 1, Prabhavalkar discloses a method for speech recognition (speech recognition, [0003]), comprising: 
determining speech features of speech data by feature extraction on the speech data (the feature extraction module processes the audio data to extract a set of feature values indicative of acoustic characteristics of the utterance, [0064]); 
determining syllable data corresponding to each of the speech features based on a plurality of feature extraction layers and a softmax function layer included in an acoustic model (the speech recognition model is implementing using neural network layers, one being a softmax layer, [0066]), wherein the acoustic model is configured to convert the speech feature into the syllable data (the speech recognition model receives extracted features and provides output indicative of likelihoods of language units, e.g., phonetic units, considered “syllable data”, [0065]); 
determining text data corresponding to the speech data based on a language model, a pronunciation model and the syllable data (end-to-end model includes functions of, and there is 
outputting the text data (displaying the transcription, [0075]). 
Prabhavalkar does not specifically mention a pronouncing dictionary.
Bangalore discloses a pronouncing dictionary (pronunciation dictionary, [0036]).
It would have been obvious to one of ordinary skill in the art to modify the invention of Prabhavalkar by utilizing a pronouncing dictionary such as taught by Bangalore, in addition to or in place of the pronunciation model disclosed by Prabhavalkar, in order to improve automatic speech recognition of difficult non-native words, as suggested by Bangalore ([0006]-[0007]).

Consider claim 9, Prabhavalkar discloses an electronic device for speech recognition (speech recognition, [0003]), comprising a processor and a memory, the memory storing at least one instruction, and the instruction being loaded and executed by the processor (processor and memory storing instructions, [0083]) to perform: determining speech features of speech data by feature extraction on the speech data (the feature extraction module processes the audio data to extract a set of feature values indicative of acoustic characteristics of the utterance, [0064]); 

determining text data corresponding to the speech data based on a language model, a pronunciation model and the syllable data (end-to-end model includes functions of, and there is considered to be, an acoustic model, language model, and pronunciation model, the acoustic model producing the phonetic units, or syllable data, [0065], generating a set of output labels, such as words, [0073]) wherein the pronunciation model is configured to convert the syllable data into the text data, and the language model is configured to evaluate the text data (the decoder network, which is analogous to the pronunciation and language modeling components in a traditional ASR system, [0039], where each output step i represents the prediction of a different output element of an utterance being recognized, where the output elements are graphemes (e.g., characters), wordpieces, and/or whole words [0068], thus the attention context vector represents a weighted summary of the current and previous encodings and is considered “evaluate the text data” with a “language model”, [0068]); and 
outputting the text data (displaying the transcription, [0075]). 
Prabhavalkar does not specifically mention a pronouncing dictionary.
Bangalore discloses a pronouncing dictionary (pronunciation dictionary, [0036]).
It would have been obvious to one of ordinary skill in the art to modify the invention of Prabhavalkar by utilizing a pronouncing dictionary such as taught by Bangalore, in addition to or in place of the pronunciation model disclosed by Prabhavalkar, for reasons similar to those for claim 1.


determining syllable data corresponding to each of the speech features based on a plurality of feature extraction layers and a softmax function layer included in an acoustic model (the speech recognition model is implementing using neural network layers, one being a softmax layer, [0066]), wherein the acoustic model is configured to convert the speech feature into the syllable data (the speech recognition model receives extracted features and provides output indicative of likelihoods of language units, e.g., phonetic units, considered “syllable data”, [0065]); 
determining text data corresponding to the speech data based on a language model, a pronunciation model and the syllable data (end-to-end model includes functions of, and there is considered to be, an acoustic model, language model, and pronunciation model, the acoustic model producing the phonetic units, or syllable data, [0065], generating a set of output labels, such as words, [0073]) wherein the pronunciation model is configured to convert the syllable data into the text data, and the language model is configured to evaluate the text data (the decoder network, which is analogous to the pronunciation and language modeling components in a traditional ASR system, [0039], where each output step i represents the prediction of a different output element of an utterance being recognized, where the output elements are graphemes (e.g., characters), wordpieces, and/or whole words [0068], thus the attention context vector represents a weighted summary of the current and previous encodings and is considered “evaluate the text data” with a “language model”, [0068]); and 
outputting the text data (displaying the transcription, [0075]). 

Consider claim 2, Prabhavalkar discloses determining the syllable data comprises: inputting each of the speech features to the acoustic model (extracted feature values are provided as inputs to the encoder, [0067]); determining an intermediate speech feature extracted from each of the speech features based on the feature extraction layers (mapping the features to a higher level feature representation, [0067]); determining, based on the softmax function, a probability that the intermediate speech feature corresponds to each piece of syllable data in the acoustic model (output indicative of likelihood of language units such as phones, [0065], using output of a softmax layer for the immediately previous time step, [0070]); and determining syllable data with a maximum probability as the syllable data (examining the probabilities and selecting orthographic elements using a beam search, [0073-0074]). 

Consider claim 3, Prabhavalkar discloses g: acquiring at least one piece of sample data ( a set of speech utterances, [0037]), wherein each piece of sample data includes a sample speech feature and truth syllable data corresponding to the sample speech feature (the ground-truth label sequence is used as input during training, [0044]); determining the acoustic model by training an initial acoustic model based on predicted syllable data and the truth syllable data, wherein the predicted syllable data are obtained by inputting the sample speech feature into the initial acoustic model (training using a set of 15M hand-transcribed anonymized utterances [0052], for the two attention models for encoder network 110, [0054]). 

It would have been obvious to one of ordinary skill in the art to modify the invention of Prabhavalkar such that the pronouncing dictionary comprises a pronouncing dictionary of a first language and a pronouncing dictionary of a second language for reasons similar to those for claim 1.
Consider claim 10, Prabhavalkar discloses determining the syllable data comprises: inputting each of the speech features to the acoustic model (extracted feature values are provided as inputs to the encoder, [0067]); determining an intermediate speech feature extracted from each of the speech features based on the feature extraction layers (mapping the features to a higher level feature representation, [0067]); determining, based on the softmax function, a probability that the intermediate speech feature corresponds to each piece of syllable data in the acoustic model (output indicative of likelihood of language units such as phones, [0065], using output of a softmax layer for the immediately previous time step, [0070]); and determining syllable data with a maximum probability as the syllable data (examining the probabilities and selecting orthographic elements using a beam search, [0073-0074]).

Consider claim 11, Prabhavalkar discloses acquiring at least one piece of sample data ( a set of speech utterances, [0037]), wherein each piece of sample data includes a sample speech feature and truth syllable data corresponding to the sample speech feature (the ground-truth label sequence is used as input during training, [0044]); determining the acoustic model by training an initial acoustic model based on predicted syllable data and the truth syllable data, wherein the predicted syllable data are obtained by inputting the sample speech feature into the initial acoustic model (training using a set of 

Consider claim 13, Prabhavalkar discloses determining the text data comprises: determining preset text data corresponding to the syllable data based on a correspondence between the syllable data and the text data in the pronouncing model (the decoder and associated softmax layer may be trained to output a set of values indicative of the likelihood of occurrence of each of a predetermined set of output labels, [0073]); determining a probability of each piece of preset text data based on the language model (e.g. 100 different probability values, [0073]); and determining the preset text data with a maximum probability as the text data (examining the probabilities and selecting orthographic elements using a beam search, [0073-0074]). 
Prabhavalkar does not specifically mention a pronouncing dictionary.
Bangalore discloses a pronouncing dictionary (pronunciation dictionary, [0036]).
It would have been obvious to one of ordinary skill in the art to modify the invention of Prabhavalkar by utilizing a pronouncing dictionary such as taught by Bangalore, in addition to or in place of the pronunciation model disclosed by Prabhavalkar, for reasons similar to those for claim 1.

Consider claim 16, Prabhavalkar does not, but Bangalore discloses the pronouncing dictionary comprises a pronouncing dictionary of a first language and a pronouncing dictionary of a second language (a three-part second language-to-phoneme to first language spelling database, [0036]). 
It would have been obvious to one of ordinary skill in the art to modify the invention of Prabhavalkar such that the pronouncing dictionary comprises a pronouncing dictionary of a first language and a pronouncing dictionary of a second language for reasons similar to those for claim 1.


Consider claim 19, Prabhavalkar discloses: acquiring at least one piece of sample data ( a set of speech utterances, [0037]), wherein each piece of sample data includes a sample speech feature and truth syllable data corresponding to the sample speech feature (the ground-truth label sequence is used as input during training, [0044]); determining the acoustic model by training an initial acoustic model based on predicted syllable data and the truth syllable data, wherein the predicted syllable data are obtained by inputting the sample speech feature into the initial acoustic model (training using a set of 15M hand-transcribed anonymized utterances [0052], for the two attention models for encoder network 110, [0054]).

Consider claim 20, Prabhavalkar discloses determining the text data comprises: determining preset text data corresponding to the syllable data based on a correspondence between the syllable data and the text data in the pronouncing model (the decoder and associated softmax layer may be trained to output a set of values indicative of the likelihood of occurrence of each of a predetermined 
Prabhavalkar does not specifically mention a pronouncing dictionary.
Bangalore discloses a pronouncing dictionary (pronunciation dictionary, [0036]).
It would have been obvious to one of ordinary skill in the art to modify the invention of Prabhavalkar by utilizing a pronouncing dictionary such as taught by Bangalore, in addition to or in place of the pronunciation model disclosed by Prabhavalkar, for reasons similar to those for claim 1.


Claims 6, 7, 14, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Prabhavalkar et al. (2020/0043483) in view of Liu et al. (2015/0134332), in further view of Moore (2006/0287847).

Consider claim 6, Prabhavalkar discloses determining the language model by storing each sample word and the occurrence probability corresponding to each sample word into an initial language model (e.g. 100 different probability values, [0073]).
Prabhavalkar and Liu do not specifically mention acquiring sample text corpuses, wherein the sample text corpuses include text corpuses of a first language and text corpuses of a second language; determining a plurality of sample words by segmenting sample text corpuses based on a preset algorithm of word segmentation; and determining an occurrence probability that each sample word occurs in the sample text corpuses.

It would have been obvious to one of ordinary skill in the art to modify the invention of Prabhavalkar and Liu by acquiring sample text corpuses, wherein the sample text corpuses include text corpuses of a first language and text corpuses of a second language; determining a plurality of sample words by segmenting sample text corpuses based on a preset algorithm of word segmentation; and determining an occurrence probability that each sample word occurs in the sample text corpuses in order to reduce processing time for training models, as suggested by Moore ([0005]).

Consider claim 7, Prabhavalkar and Liu do not, but Moore discloses determining the sample text corpuses comprises: acquiring first text corpuses of the first language, second text corpuses of the second language, and a correspondence between the first text corpuses and the second text corpuses (bilingual corpus 210, [0039]); selecting at least one first text sub-corpus from each of the first text corpuses (pairs of aligned text fragments, [0042]); determining a correspondence between the first text sub-corpus and the second text corpus (word alignment operates on aligned text fragments, [0044]); replacing each first text sub-corpus by a second text corpus based on the correspondence between the first text sub-corpuses and the second text corpuses to obtain mixed text corpuses (the word alignments are considered to “replace” the higher level text fragment alignments, since they are used in the final bilingual corpora for training, the text corpuses considered “mixed” since the word level alignments are 
It would have been obvious to one of ordinary skill in the art to modify the invention of Prabhavalkar and Liu by acquiring first text corpuses of the first language, second text corpuses of the second language, and a correspondence between the first text corpuses and the second text corpuses; selecting at least one first text sub-corpus from each of the first text corpuses; determining a correspondence between the first text sub-corpus and the second text corpus; replacing each first text sub-corpus by a second text corpus based on the correspondence between the first text sub-corpuses and the second text corpuses to obtain mixed text corpuses; and determining the mixed text corpuses as the sample text corpuses for reasons similar to those for claim 6. 

Consider claim 14, Prabhavalkar discloses determining the language model by storing each sample word and the occurrence probability corresponding to each sample word into an initial language model (e.g. 100 different probability values, [0073]).
Prabhavalkar and Liu do not specifically mention acquiring sample text corpuses, wherein the sample text corpuses include text corpuses of a first language and text corpuses of a second language; determining a plurality of sample words by segmenting sample text corpuses based on a preset algorithm of word segmentation; and determining an occurrence probability that each sample word occurs in the sample text corpuses.
Moore discloses acquiring sample text corpuses, wherein the sample text corpuses include text corpuses of a first language and text corpuses of a second language (bilingual corpus 210 includes bilingual data in which text in the first language is found, along with a translation of that text into a second language, [0041]); determining a plurality of sample words by segmenting sample text corpuses based on a preset algorithm of word segmentation (word alignment, [0039]); and determining an 
It would have been obvious to one of ordinary skill in the art to modify the invention of Prabhavalkar and Liu by acquiring sample text corpuses, wherein the sample text corpuses include text corpuses of a first language and text corpuses of a second language; determining a plurality of sample words by segmenting sample text corpuses based on a preset algorithm of word segmentation; and determining an occurrence probability that each sample word occurs in the sample text corpuses for reasons similar to those for claim 6.

Consider claim 15, Prabhavalkar and Liu do not, but Moore discloses determining the sample text corpuses comprises: acquiring first text corpuses of the first language, second text corpuses of the second language, and a correspondence between the first text corpuses and the second text corpuses (bilingual corpus 210, [0039]); selecting at least one first text sub-corpus from each of the first text corpuses (pairs of aligned text fragments, [0042]); determining a correspondence between the first text sub-corpus and the second text corpus (word alignment operates on aligned text fragments, [0044]); replacing each first text sub-corpus by a second text corpus based on the correspondence between the first text sub-corpuses and the second text corpuses to obtain mixed text corpuses (the word alignments are considered to “replace” the higher level text fragment alignments, since they are used in the final bilingual corpora for training, the text corpuses considered “mixed” since the word level alignments are not in series together, [0042]-[0044]); and determining the mixed text corpuses as the sample text corpuses (outputting the aligned text, [0047]).
It would have been obvious to one of ordinary skill in the art to modify the invention of Prabhavalkar and Liu by acquiring first text corpuses of the first language, second text corpuses of the second language, and a correspondence between the first text corpuses and the second text corpuses; . 


Allowable Subject Matter
Claims 4, 5, and 12 are objected to as being dependent on a rejected base claim, but would be allowable if rewritten in independent form including all limitations of the base and any intervening claims. Note: this assumes Applicant intended dependent claim 5 to depend on dependent claim 4 instead of independent claim 1.

The following is the examiner’s statement of reasons for indicating subject matter allowable over the prior art: 

Consider claim 4, the prior art does not fairly teach or suggest: “…acquiring a first correspondence, a second correspondence and a third correspondence; wherein the first correspondence is between a first speech feature of a first language and first text data, the second correspondence is between a second speech feature of a second language and second text data, the third correspondence is between first text sub-data and the second text data, and the first text sub-data is a part of data in the first text data; determining second text data corresponding to each piece of first text sub-data based on the third correspondence, after selecting a plurality of pieces of first text sub-data randomly from the first text data; replacing each piece of first text sub-data by the second text data 
 Dependent claim 12 recites similar features, and dependent claim 5 includes the allowable subject matter of intervening claim 4 (assuming Applicant intended claim 5 to depend on claim 4 rather than claim 1) by virtue of its dependency. 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
US 20200312309 A1 Lin discloses speech recognition using a WFST network including an acoustic model, a dictionary, and a language model ([0073])
US 20180047385 A1 Jiang discloses a hybrid phoneme, diphone, morpheme, and word-level deep neural network speech recognition system
US 20090326945 A1 Tian discloses providing a mixed language entry speech dictation system
US 10854193 B2 Fu et al. discloses real-time speech recognition by truncating a sequence of features off the speech signal
US 9263036 B1 Graves discloses speech recognition using deep RNNs
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jesse Pullias whose telephone number is 571/270-5135. The examiner can normally be reached on M-F 8:00 AM - 4:30 PM. The examiner’s fax number is 571/270-6135.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Andrew Flanders can be reached on 571/272-7516. 

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).


/Jesse S Pullias/
Primary Examiner, Art Unit 2655                                  03/17/22