DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. KR10-2019-0130900, filed on 10/21/2019.
Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 12, and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Kuo et al. (US20160253989A1)(hereinafter "Kuo"), Daniel Doulton (US20080133219A1)(hereinafter "Doulton"), and Kitade et al.  (US20140074475A1)(hereinafter "Kitate").

With respect to claim 1, 12, and 13, Kuo teaches An artificial intelligence apparatus for recognizing speech by correcting misrecognized word, comprising: a microphone; and a processor configured to: obtain, via the microphone, speech data including speech of a user; (Par. 0135: A user may enter commands and information into the system 1200 through input devices ...  input devices [not shown] may include a microphone, ... input devices are connected to the processing unit ..." , and Par. 0037:"… a set of audio inputs representing speech utterances, and a set of transcribed results that represent the known or correct speech recognition results associated with the audio inputs.").
Speech recognition components typically involve a large number of variables and modeling parameters", and Par. 0027:"… a speech recognition component 120 further includes an acoustic model component 124", and Par. 0028:"... a speech recognition component 120 further includes a language model component 126.", and Par. 0087:"... include taking a text transcription of an audio segment ...").
determine whether an uncertain recognition exists in an acoustic recognition result according to the acoustic model; (Par. 0089:"In at least some implementations, the internal lexicon of a speech recognition process [or “build”] specifies which words in a language can be recognized or spoken, and defines how an acoustic model expects a word to be pronounced [typically using characters from a single phonetic alphabet]. The one or more lexicon analysis operations at 983 may assess whether a particular recognition error may be attributable to one or more deficiencies of the lexicon of the acoustic model, and if so, optionally provides one or more recommendations to correct or modify the lexicon accordingly at 984.").
Kuo does not teach determine whether the converted text is a normal sentence by using a natural language processing model if an uncertain recognition exists in the acoustic recognition result; determine a sentence most similar to the converted text among sentences pre- learned by using the language model if the converted text is not a normal sentence; replace the converted text with the determined most similar sentence; and generate a speech recognition result corresponding to the speech data by using the converted text.
Doulton teaches determine whether the converted text is a normal sentence by using a natural language processing model if an uncertain recognition exists in the acoustic recognition outputs the most likely text in the sense that the match between the features of the input speech and the corresponding models is optimized. In addition, however, ASR must also take into account the likelihood of occurrence of the recognizer output text in the target language [normal sentence]. As a simple example, “see you at the cinema at eight” is a much more likely text than “see you at the cinema add eight”, although analysis of the speech waveform would more likely detect ‘add’ than ‘at’ in common English usage. The study of the statistics of occurrence of elements of language is referred to as language modelling. It is common in ASR to use both acoustic modelling, referring to analysis of the speech waveform, as well as language modelling to improve significantly the recognition performance.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Kuo in view of Doulton to determine whether the converted text is a normal sentence by using a natural language processing model if an uncertain recognition exists in the acoustic recognition result, in order to improve their capability for discrimination between the various sounds of speech, as evidence by Doulton. (See Par. 0014).

Neither Kuo nor Doulton teach determine a sentence most similar to the converted text among sentences pre- learned by using the language model if the converted text is not a normal sentence; replace the converted text with the determined most similar sentence; and generate a speech recognition result corresponding to the speech data by using the converted text.
language model, and outputs a predetermined number of hypotheses.”, and Par. 0005:” … for correcting a recognition error section in speech recognition that includes: a first step of detecting a recognition error section from a recognition result sentence recognized by a speech recognition apparatus; a second step of searching for an example sentence similar to the recognition result sentence, in which the recognition error section has been detected in the first step, from the example corpus prepared in advance and extracting the alternatives corresponding to the recognition error section from each of the searched example sentences; and a third step of selecting the best candidate from the alternatives extracted in the second step.”).
replace the converted text with the determined most similar sentence; (Par. 0019:” a recognition result output unit that generates preformatted character string data by removing a word string, which has been determined to be removed or replaced with other data items by the conversion word determination unit, from the character string data or replacing the word string with other data items on the basis of the recognition result data and outputs the preformatted character string data as a speech recognition result of the speech data.”).
and generate a speech recognition result corresponding to the speech data by using the converted text. (Par. 0098:” and a recognition result output unit that generates preformatted character string data by removing a word string, which has been determined to be removed or replaced with other data items by the conversion word determination unit, from the character on the basis of the recognition result data and outputs the preformatted character string data as a speech recognition result of the speech data.”).

Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Kuo, and Doulton in view of Kitade to determine a sentence most similar to the converted text among sentences pre- learned by using the language model if the converted text is not a normal sentence, replace the converted text with the determined most similar sentence; and generate a speech recognition result corresponding to the speech data by using the converted text, in order to automatically analyze information when the character string data recognition result data is acquired, as evidence by Kitade. (See Par. 0032)

With respect to claim 13, Kudo further teaches a non-transitory recording medium storing a program for a processor to perform a method for recognizing speech by correcting misrecognized word, the method comprising: (Par. 0003:"... an apparatus for diagnosing speech recognition errors may include at least one processing component, and one or more computer-readable media operably coupled to the at least one processing component. The one or more computer-readable media may bear one or more instructions that, when executed by the at least one processing component, perform operations including at least: performing one or more speech recognition operations").

Claims 2, and 3 are rejected under 35 U.S.C. 103 as being unpatentable over Kuo, Doulton, and Kitade  as applied to claim 1, and 2 respectively, and in further view of Trawick et al. (US20180068653A1)(hereinafter " Trawick ").

With respect to claim 2, and 3, Kuo, Doulton, and Kitade  teach an artificial intelligence apparatus.
Regarding claim 2, Kuo, Doulton, and Kitade  do not teach the artificial intelligence apparatus according to claim 1, wherein the processor is configured to: calculate probabilities corresponding to each phoneme for each predetermined window unit with respect to the speech data by using the acoustic model; calculate a word recognition reliability for each word included in the speech data by using at least one of a largest probability value (p1) among the calculated probabilities, a difference (pl- p2) between the largest probability value (pI) and a second largest probability value (p2) among the calculated probabilities,  or an entropy corresponding to the calculated probabilities; and determine whether an uncertain recognition exists in the acoustic recognition result based on the calculated word recognition reliability.
Trawick teaches calculate probabilities corresponding to each phoneme for each predetermined window unit with respect to the speech data by using the acoustic model; (Par. 0021:” The ASR system then uses an acoustic model [such as deep neural networks [DNNs]] to determine a probability or acoustic score for each phoneme or a phoneme in context [such as a tri-phone]. The acoustic scores are then used in a decoder that has language models to construct words from the phonemes, and then construct word sequences or transcriptions [also referred to as utterances herein] out of the words, where each word and word sequence probability score as well. Thereafter, each output transcription or utterance is provided with a confidence score. The confidence scores are used to assess the confidence that the output is correct and are often compared to a threshold to determine whether the output is accurate [relative to the actually spoken words] and should be used or presented to a user, or inaccurate [not similar to the actually spoken words] and is to be rejected and will not be presented to a user.”)
calculate a word recognition reliability for each word included in the speech data by using at least one of a largest probability value (p1) among the calculated probabilities, a difference (pl- p2) between the largest probability value (pI) and a second largest probability value (p2) among the calculated probabilities,  or an entropy corresponding to the calculated probabilities; (Par. 0037:” … the decoder 312 also may place the outputs of the n-best utterances onto a word lattice during decoding that provides confidence measures and/or alternative results….The WFST decoder 312 uses known specific rules, construction, operation, and properties for single-best or n-best speech decoding,”).
and determine whether an uncertain recognition exists in the acoustic recognition result based on the calculated word recognition reliability. (Par. 0021:” ... each output transcription or utterance is provided with a confidence score. The confidence scores are used to assess the confidence that the output is correct and are often compared to a threshold to determine whether the output is accurate [relative to the actually spoken words] and should be used or presented to a user, or inaccurate [not similar to the actually spoken words] and is to be rejected and will not be presented to a user.”).


Regarding claim 3, Kuo, Doulton, and Kitade  do not teach the artificial intelligence apparatus according to claim 2, wherein the processor is configured to: calculate an average of p 1 or an average of p1-p2 corresponding to phonemes included in a word for each word; and determine the calculated average as the word recognition reliability of the corresponding word.
Trawick teaches calculate an average of p 1 or an average of p1-p2 corresponding to phonemes included in a word for each word; (Par. 0027:” … the average frame-based probability scores for each phoneme in an utterance is determined, and then the individual average phoneme scores determined this way are then summed and divided by the number of phonemes in the utterance to yield a confidence score for that utterance.”).
and determine the calculated average as the word recognition reliability of the corresponding word. (Par. 0021:” The ASR system then uses an acoustic model [such as deep neural networks [DNNs]] to determine a probability or acoustic score for each phoneme or a word sequences or transcriptions [also referred to as utterances herein] out of the words, where each word and word sequence has a probability score as well. Thereafter, each output transcription or utterance is provided with a confidence score. The confidence scores are used to assess the confidence that the output is correct and are often compared to a threshold to determine whether the output is accurate [relative to the actually spoken words] and should be used or presented to a user, or inaccurate [not similar to the actually spoken words] and is to be rejected and will not be presented to a user.”, and Par. 0027:” The sum of the logs of the probabilities are then normalized by the number of frames [with one frame term per frame as mentioned] of an utterance in the sum [see equation [7] below as one example]. This provides a per-frame confidence score. Alternatively, the confidence score may be provided as a per-phoneme confidence score. In this case, the average frame-based probability scores for each phoneme in an utterance is determined, and then the individual average phoneme scores determined this way are then summed and divided by the number of phonemes in the utterance to yield a confidence score for that utterance. By one example form, phonemes that are not of interest do not contribute to the calculation; in particular, silence and non-speech noise phonemes are omitted in score calculations.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Kuo, Doulton, and Kitade  in view of Trawick to calculate an average of p 1 or an average of p1-p2 corresponding to phonemes included in a word for each word; and determine the calculated average as the word .

Claim 4, is rejected under 35 U.S.C. 103 as being unpatentable over Kuo, Doulton,  Kitade and Trawick as applied to claim 2, and in further view of Gschwendtner et al. (US20030110030A1)(hereinafter " Gschwendtner ").

With respect to claim 2, and 3, Kuo, Doulton, Kitade, and Trawick teach an artificial intelligence apparatus.
Regarding claim 2, Kuo, Doulton, Kitade, and Trawick do not teach the artificial intelligence apparatus according to claim 2, wherein the processor is configured to distinguish words included in the speech data from each other based on a blank or a silence. 
Gschwendtner teaches wherein the processor is configured to distinguish words included in the speech data from each other based on a blank or a silence. (Par. 0045:” The speech recognition means 7 are arranged to recognize pauses in speech [silence] between two words and the first marking stage 12 is arranged to automatically mark corresponding audio segments AS of the spoken text GT with the pause marking information PMI in the marking table MT.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Kuo, Doulton, Kitade, and Trawick in view of Gschwendtner to distinguish words included in the speech data from each .

Claim 5, is rejected under 35 U.S.C. 103 as being unpatentable over Kuo, Doulton, Kitade, and Trawick as applied to claim 2, and in further view of Yasuda et al. (US20150332675A1)(hereinafter " Yasuda ").

With respect to claim 5, Kuo, Doulton, Kitade, and Trawick teach an artificial intelligence apparatus.
Regarding claim 5, Kuo, Doulton, Kitade, and Trawick do not teach the artificial intelligence apparatus according to claim 2, wherein the processor is configured to determine, as an uncertainly recognized word, a word whose calculated word recognition reliability is smaller than a first reference value among the words. 
Yasuda teaches wherein the processor is configured to determine, as an uncertainly recognized word, a word whose calculated word recognition reliability is smaller than a first reference value among the words. (Par. 0008:” … a speech recognition section for analyzing the speech data so as to [i] identify a word or sentence included in the speech data and [ii] calculate a certainty of the word or sentence that has been identified; a response determining section for determining, in accordance with the certainty, whether it is necessary to ask back to a user or not; and an asking-back section for asking back to the user, in a case where the certainty is less than a first threshold and not less than a second threshold, the response certainty is less than the second threshold, the response determining section determining that the electronic apparatus is not going to ask back to the user ", and Par. 0100:" In a case where a certainty of a word or sentence is calculated by the speech recognition section 11a and this certainty is less than a first threshold [NO in S5], the control section 10 transmits, to the external device", and Par. 0103:" In a case where the certainty of the word or sentence has been received from the external device 200 and this certainty is less than the first threshold").
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Kuo, Doulton, Kitade, and Trawick in view of Yasuda to determine, as an uncertainly recognized word, a word whose calculated word recognition reliability is smaller than a first reference value among the words, in order to improve accuracy in speech recognition, as compared to a case where speech recognition is performed only by the electronic apparatus, as evidence by Yasuda. (See Par. 0105).

Claim 11, is rejected under 35 U.S.C. 103 as being unpatentable over Kuo, Doulton, and Kitade as applied to claim 1, and in further view of Rangarajan et al. (US 20200027452 A1 )(hereinafter " Rangarajan ").

With respect to claim 11, Kuo, Doulton, and Kitade  teach an artificial intelligence apparatus.

Rangarajan teaches wherein at least one of the acoustic model, the language model, or the natural language processing model is configured to include an artificial neural network, and is learned using a machine learning algorithm or a deep learning algorithm. (Par. 0033:”The second speech recognition engine 204 [also referred to as a second ASR engine and a second engine] is configured to identify the voice command 118 within the audio signal 114. In the illustrated example, the second speech recognition engine 204 includes a deep neural network to identify the voice command 118 within the audio signal 114. For example, the deep neural network functions as an acoustic model and a language model to identify the voice command 118 within the audio signal 114. A deep neural network is a form of an artificial neural network that includes multiple hidden layers between an input layer [e.g., the audio signal 114] and an output layer [the identified language and the dialect]. An artificial neural network is a type of machine learning model inspired by a biological neural network. For example, an artificial neural network includes a collection of nodes that are organized in layers to perform a particular function [e.g., to categorize an input]. Each node is trained [e.g., in an unsupervised manner] to receive an input signal from a node of a previous layer and provide an output signal to a node of subsequent layer. For example, the deep neural network of the second speech recognition engine 204 is trained on previous speech of the user, previous outputs of the first speech recognition engine 202, and previous outputs of the habits engine 206.”).
.

Allowable Subject Matter
Claims 6-10 are objected to as being dependent upon a rejected base claims, but would be allowable if written in independent form including all of the limitations of the base claim and any intervening claims.

	Claim 6 recites “the artificial intelligence apparatus according to claim 2, wherein the processor is configured to: generate intention information from the converted text by using the natural language processing model if an uncertain recognition exists in the acoustic recognition result; generate a dropout intention information set from the converted text by applying a dropout technique to the natural language processing model; calculate a ratio of dropout intention information that is the same as the generated intention information among pieces of dropout intention information included in the dropout intention information set; and determine whether the converted text is a normal sentence, based on the calculated ratio.” Which is allowable over the prior art. The closest teachings to the indicated allowable subject matter are the references that cited in the current office action. One such prior art of the 
	Claims 7-10 depends from claim 6, which are also allowable for substantially similar reason. 


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. Takaaki et al. (JP2016206487A) teach speech recognition result shaping device performs speech recognition result shaping on the top N most likely speech recognition results including the most likely speech recognition result.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DARIOUSH AGAHI whose telephone number is (408)918-7689.  
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/D.A./Examiner, Art Unit 2656                                                                                                                                                                                                        
/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656