DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
In response to the Office Action mailed 1/27/2022, applicant has submitted an amendment filed 4/27/2022.
Claim(s) 1-4, 6-10, 12-16, 18, has/have been amended.  
Allowable Subject Matter
Claims 1-18 are allowed.
The following is an examiner’s statement of reasons for allowance:

As per Claim(s) 1 (and similarly claims 7 and 13, and consequently claims 2-6, 8-12, and 14-18 which depend on claims 1, 7, and 13), the prior art of record does not teach or suggest the combination of all limitations in claim(s) 1, including (i.e. in combination with the remaining limitations in claim[s] 1) A method for recognizing speech, the method comprising: in response to detecting a speech frame, converting the speech frame into a current text in real time; 5in response to there being no previously saved historical text, inputting the current text into a semantic parsing model to obtain a parsing result; in response to the parsing result including a valid intention slot, ending a speech endpoint detection to 10complete the recognition; and outputting an instruction corresponding to the valid intention slot.
JP 3962904 B2 (X reference in JP search report) teaches “The speech recognition system is provided with a means 5 for changing a recognition dictionary by decomposing the command language in a recognition dictionary 4 into prescribed units such as units of word and syllable and a holding means 6 for holding a recognition results recognized by a speech recognition means 3. The recognition means 3 recognizes the speech uttered by a user based on the recognition dictionary changed by the changing means 5 and judges whether or not plurality of combinations of recognition results becomes the command language registered in the recognition dictionary before change when a plurality of the recognition results are held in the holding means 6. Thus, the command language can be correctly recognized even in the case that the command language is uttered by being divided into a plurality of speeches such as the case that the user utters the command language at intervals” (see Abstract of Google Translation).  This reference appears to describe where multiple recognition results are stored in a holding means (suggested to be a buffer) and determining whether the combination of recognition results matches a registered recognition dictionary command (see paragraphs 16 of Google Translation).  Once a combination of recognition results is determined to be a command word, the contents are of the holding unit are emptied (see paragraphs 17 of Google Translation).  This reference does not appear to specifically describe where a command/intent is detected in a single recognition result in the holding unit.  Paragraphs 21-24 of the Google translation describe where one recognition result (i.e. the single word “file”) is checked for acceptance by an original recognition dictionary, and is not accepted in a network grammar, and then the combination of “file” and another recognition result “open” is determined.  This reference also does not appear to specifically describe where the checking of a first word stored in a holding means is checked in response to there being no historical text (i.e. the check is performed no matter what once the most recent recognition result is stored)
2018/0143967 describes where a command can be a single word that is used to provide intent information (paragraph 37, Figure 2)
	The prior art teaches/suggests the following concepts:
I. Storing/buffering text until a complete command is identified.
6813603 teaches “Referring now to FIGS. 14A and 14B, the populate form procedure 72 displays the selected form or report 310 and places the insertion pointer to the left of the field terminator for the first field, even if the first field already stores text data 312. The user may speak a sequence of commands and entry field data or values at any time. The user may navigate to different entry fields of the form in any order. As described above, as the user speaks, the speech engine 68 generates text in accordance with the speech and stores the text in the current text buffer. The populate form procedure 72 scans the text in the current text buffer until a complete command is identified” (col. 11, lines 45-57).  This reference’s text buffer appears to be the text stored in a field (col. 4, line 66 - col. 5, line 2).  Figure 12 describes where a buffer is checked until a complete command is detected, and then the command is removed from the text buffer and is processed.  Figure 11 describes where text stored in the text buffer is generated from audio speech.  Col. 9, lines 35-41 describes where a current text buffer change is “with each addition of a character”.  Col. 10, lines 1-22 further describes where text buffer changes are character by character.  This reference does not read on the independent claims because the character-by-character updating of the buffer necessarily means that at the time that a command is detected (and shortly before), there is previously saved historical text relative to the command text (at a minimum, the command text minus one character).
2020/0020334 describes where an electronic device stores text (suggested to be speech recognition text from a server, see e.g. Figures 6A-6D) and storing text until a designated command is entered or a designated time elapses (paragraph 174).  It is not clear if the designated command is any command, because paragraphs 150, 153, and 200 appears to describe where a voice command terminates a continuous command mode.
2021/0082397 (LATE filing date) teaches “In some embodiments, the dialog system may refrain from performing one or more steps in the processing pipeline at a given point in the dialog. For example, a first utterance does not include sufficient information to determine an intent. In such a case, the dialog system may continue to process the following utterance(s), along with the first utterance, in a continuous stream until the overall intent is determined. As a specific example, the dialog system executes ASR and NLP on a first utterance, “I was wondering . . . [pause] . . .” and waits for a second utterance, “if you sell hats,” before executing ASR and NLP on the second utterance, and processing both utterances together to determine an overall intent for the two utterances, Check_Stock”.  This reference does not qualify as prior art.
2017/0221475 describes storing audio signals until a command is completed, or until a correct transcription of an entity name is determined (paragraph 28).
II. Semantic parsing to determine intent and using a semantic model to perform parsing.
2011/0320187 teaches “mapping the natural language question provided as an input into one or more deductive database queries that capture one or more intents behind the natural language question further includes performing a semantic parsing of the natural language question by mapping the natural language question into one or more semantic hypergraphs that capture one or more meanings behind the natural language question and performing an intent detection of one or more semantic hypergraphs by transforming the one or more semantic hypergraphs into one or more deductive database queries that capture the one or more intents behind the natural language question” (paragraph 15).  This reference describes performing intent detection of results of semantic parsing of a natural language question.
6138100 teaches “This language model is employed by the speech recognition module 214 to produce an output string that is comprised of a word or phrase sequence as governed by the language (syntax) model. The output string of the speech recognition module serves as input to the language understanding component of the Natural Language module 216--a parser in this module, governed by the language understanding (semantics) rules or models to produce a parse tree (parse trees for ambiguous input) for the input string. The parse tree is then translated into an expression (or set of expressions) which represents the meaning of the input string and which is interpretable by CPU 224” (col. 6, lines 25-44).  This reference suggests where a speech recognition output string (at least suggested to be text) is provided to semantics models to produce a parse tree (suggested to be a result of parsing the speech recognition output string).
2017/0178627 teaches “The dialog system 104 can receive the recognized speech from the ASR module 102. The dialog system 104 can interpret the recognized speech to identify what the speaker wants. For example, the dialog system 104 can include a parser for parsing the recognized speech and an intent classifier for identifying intent from the parsed recognized speech” (paragraph 25).  This reference suggests where recognized speech (suggested to be words/text) is parsed and then intent is identified from the parsed recognized speech (suggested to be a result of parsing the recognized speech).
	6865528 teaches “Although speech recognition systems have been used in the past to simply provide textual output corresponding to a spoken utterance, there is a desire to use spoken commands to perform various actions with a computer. Typically, the textual output from the speech recognition system is provided to a natural language parser, which attempts to ascertain the meaning or intent of the utterance in order to perform a particular action. This structure therefore requires creation and fine-tuning of the speech recognition system as well as creation and fine-tuning of the natural language parser, both of which can be tedious and time consuming” (col. 1, line 59 – col. 2, line 2).  This reference appears to teach away from determining intent of speech recognition text using a natural language parser.
III. providing a final recognition result made of partial hypotheses in response to a whole sentence being determined, by a language model, to be grammatically complete, and determining, via semantic processing, whether an utterance is complete, and continuing to listen to a user until a parameter list is full.
Froelich (US 2017/0256261) teaches where an ASR system emits partial hypotheses until a language model determines that a whole sentence is grammatically complete and emits a final result, and if the speaker keeps talking a new partial response will begin (paragraph 62).  In this reference, there does not appear to be a possibility of a complete command/intent to be detected in a new partial result in response to there being no previous partial results (no historical text).
	2016/0148610 teaches “In one use case, even if a threshold amount of silence is reached while a user is speaking, the recognized words of portions of the utterance that the user has spoken may be utilized during the semantic processing to determine whether the utterance is actually complete (e.g., whether the user has finished speaking what he/she intends to communicate). In a further use case, natural language processing instructions 122 may identify a command before the end of the utterance is detected, and continue listening until the command's parameter list is full. However, if the threshold amount of silence is reached, the system may prompt the user for additional input relating to the parameters (e.g., if a city parameter has not yet been communicated, the user may be prompted to identity a city)” (paragraph 57).
	IV. Other relevant references.
	(US 2019/0378493, European Search Report X reference) (paragraphs 72, 80, 115) This reference appears to be directed to where endpoint detection time is an amount of time elapsed from the start of a most recent voice input (see e.g., paragraphs 72-75) and also where a user may speak a particular word to either end voice input or continue voice input, including when the user does not think of an exact word, user's intent is not clear, or the user does not know what function can be performed by a voice service (paragraph 78) and where a voice command activates/inactivates a voice service (paragraph 80) and where text of a last word is used to determine that voice input is completed if the last word is not a particular word, and is used to determine that end point detection time should be extended if the last word is a particular word (paragraph 115).  This reference does not appear to input the text to a semantic parsing model in response to there being no previously saved historical text.  A speech end/continuation word in the text is presumably preceded by some text of previous spoken words since it would be pointless to end speech when the user has not spoken anything yet, and it would not be a continuation of speech if nothing was spoken yet such that there is nothing to continue. 
2011/0313768 teaches “If the spoken command is not recognized as a full phrase voice command, then the system determines at step 308 whether the spoken command is a partial phrase voice command. If so, then the system continues to listen at step 310 in an "active listening" mode for further voice commands. When an end phrase is spoken and recognized in step 314, then the system proceeds to execute the command in step 306. If an end phrase is not spoken, then the system checks to see at step 316 to see if the partial phrase is part of a valid command. If so, the system returns to the active listening mode in step 310. If the spoken command is not recognized as a partial phrase voice command, then after a brief timeout at step 318, the system returns to the passive listening mode at step 302” (paragraph 86).  Paragraph 86 similarly does not describe where end point detection occurs when a valid intent is identified in response to there being no historical text, since it would be pointless to end speech when the user has not spoken anything yet, and it would not be a continuation of speech if nothing was spoken yet such that there is nothing to continue. Paragraph 85 describes where a system returns to a passive listening state in response to detecting a full phrase voice command, which suggests (together with paragraph 86’s teaching of an “active listening” mode) that the system is in an active listening mode while listening to the full phrase voice command, which also suggests where, upon recognizing a valid command/intent a speech endpoint is detected and the command is executed and the system returns from active listening to passive listening.  If the system was in passive mode prior to receiving/recognizing the full phrase voice command, it is suggested that there was no speech (and thus no recognized text if the speech recognizer is a speech-to-text recognizer) prior to the full phrase voice command.  This reference does not specifically describe where the text of the full phrase voice command is input to a semantic parsing model in response to there being no previously saved historical text (no checking of whether there is or is not saved historical text is performed).

	Upon further search and consideration (in response to the amendment filed 4/27/2022):
CN 107146602 A (cited in IDS and First Office Action for CN 202010143037) appears to describe “judging the voice recognition preserved is not present” and then “judge whether current speech identification information has complete It is semantic” and “If so, current speech identification information then is defined as into voice identification result” (see page 3 of Google translation) which appears to suggest where historical voice recognition information is determined to not be present and then current speech recognition information is checked to see if it has a complete semantic meaning, and if so, the current speech recognition information is defined as a speech recognition result.  The last 5 sentences of page 3 of the Google Translation of CN 107146602 appears to describe where determining whether voice recognition information is a complete semantic entity is based on semantic parsing, and where the semantic parsing result is matched to an intention (which suggests where checking to see if current speech recognition information has a complete semantic meaning is performed by semantic parsing and determining whether the semantic parsing result matches a stored intent).  This reference does not appear to clearly/specifically describe ending speech endpoint detection to complete recognizing the speech in response to the semantic parsing result including a valid intention (determining that current speech recognition information that has a complete semantic meaning is a speech recognition result suggests detecting that no more speech recognition is needed but this does not necessarily mean that speech endpoint detection has ended).
2021/0104236 teaches “acquire an audio speech segment associated with a user utterance; convert the audio speech segment into a text segment; determine an intent based on a text string associated with the text segment, wherein the text string represents a portion of the user utterance; and generate a response based on the intent prior to when the user utterance completes” (paragraph 108).  This reference appears to describe where an audio speech segment (a speech “frame”) is converted into a text segment and is determined to correspond to an intent (which suggests where an intent can be found in a speech frame)
2020/0135182 teaches “Background noise, poor annunciation, and the like may make it difficult for an electronic digital assistant to determine a meaning or intent of a speech segment (for example, a name, a number, or an address) associated with an audio query submitted by a user with absolute certainty. As a result, the electronic digital assistant may generate a plurality of possible meanings for the speech segment and assign a probability to each possible meaning of the plurality of possible meanings. For example, the electronic digital assistant determines the probability for each possible meaning based on speech recognition algorithms (in other words, voice analytics). As another example, the electronic digital assistant determines the probability for each possible meaning based on speech recognition algorithms and as a function of context information such as recognition of similar speech segments included in recent previous audio queries. For example, the electronic digital assistant may determine that a possible meaning of “darts” for a speech segment has a higher probability than a possible meaning of “tarts” when recent previous queries related to sports or games. On the other hand, the electronic digital assistant may determine that a possible meaning of “tarts” for a speech segment has a higher probability than a possible meaning of “darts” when recent previous queries related to baked goods. The probability represents the likelihood that the possible meaning is the correct meaning (or the speaker's intended meaning) of the speech segment. The electronic digital assistant may generate a response such as a clarifying question to determine whether the possible meaning associated with the highest probability (for example, as determined by speech recognition algorithms) is the correct meaning for the speech segment” (paragraph 10).  This reference appears to describe where a speech segment (i.e. a speech “frame”) can have a meaning or an intent.  This reference appears to be directed to determining what word/text is spoken in a speech frame, and not to whether word(s) that are determined to a be a current speech recognition text have a valid intent.
2019/0378493 teaches “The processor may be configured to analyze a user's intent to end a speech based on at least one of context information of the electronic device, characteristic information of a user, whether an additional voice input is detected, whether a predetermined word is detected, whether a predetermined gesture is detected, or whether a sentence is completed” (paragraph 102, see also paragraph 232).  This reference appears to refer to a user’s intent to end voice input, and not to where endpoint detection is ended based on detecting a valid intent
	2019/0139566 (published less than 1 year before effective date, commonly owned by Baidu) teaches “At block 304, the semantic analysis is performed on the recognized result and it is determined whether the semantic integrity of the recognized result is satisfied, if yes, act in block 305 is performed; otherwise, act in block 306 is performed” (paragraph 61) and “In this embodiment, in order to ensure the accuracy of the recognized result, when the first time duration T1 reaches the second time period T12, the cloud may determine whether the speech to be recognized ends by determining whether the semantic integrity of the recognized result is satisfied, i.e., by determining whether the semanteme of the recognized result is integrated. Specifically, when the currently counted first time duration T1 reaches the second time period T12, the semantic analysis is performed on the recognized result. For example, the semantic analysis may be performed on the recognized result using the prior art, such that it may be determined whether the semanteme of the recognized result is integrated. When the semanteme of the recognized result is integrated, it may be determined that the speech to be recognized ends. However, when the semanteme of the recognized result is not integrated, act in block 306 is triggered” (paragraph 62).  This reference does not qualify as prior art.
	
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC YEN whose telephone number is (571)272-4249. The examiner can normally be reached M-F 12:00PM -8:30PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, RICHEMOND DORVIL can be reached on (571)272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





EY 4/30/2022
/ERIC YEN/           Primary Examiner, Art Unit 2658