DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-18 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

As per Claim 1:
“the recognition” in the 3rd to last line of claim 1 lacks antecedent basis (i.e. no previous “a recognition”/”recognition” has been recited prior to “the recognition”).  At a minimum, it is not clear if “the recognition” is supposed to refer to “recognizing speech” in the preamble or to “converting the speech frame into a current text in real time”.

	As per Claim 2:
	“delaying a time of the speech endpoint detection” appears to have been intended to describe an intended result of continuing to detect a new speech frame (i.e. nd limitation of claim 3).  It is, therefore, not clear whether “delaying a time of the speech endpoint detection” is an intended result of continuing to detect a new speech frame, or is a separate step relative to “saving the current text as a historical text” and “continuing to detect a new speech frame” (in which case, like the 2nd limitation of claim 3, “and” should probably precede “delaying”).

	As per Claim 3:
	“the time of the speech endpoint detection” in lines 10-11 of claim 3 lacks antecedent basis (“a time of the speech endpoint detection” is recited in claim 2 but claim 3 depends on claim 1).
	Due to the issue discussed in the previous paragraph, it is also not clear if claim 3 is supposed to depend on claim 2.
“the recognition” in the 3rd to last line of claim 3 lacks antecedent basis (same issue pertaining to “the recognition” in claim 1).
“the valid intention slot” at the end of claim 3 is ambiguous.  Lines 7-8 of claim 3, lines 12-13 of claim 3, and lines 7-8 of claim 1 each recite “a valid intention slot”, and the 3 recitations of “a valid intention slot” do not need to refer to the same intention slot, such that it is not clear which recitation of “a valid intention slot” is the one that “the valid intention slot” at the end of claim 3 is supposed to refer to when any two or more of the 

As per Claim 4:
“the spliced current text and the historical text” is fairly clearly intended to refer to the result of splicing the current text and the historical text, but the plain meaning of this phrase refers to the current text which has been spliced, and “the historical text” (i.e. the original historical text).  At a minimum, it is not clear if Applicant meant to claim the plain meaning of this phrase or if Applicant meant to claim what Applicant has fairly clearly intended to claim.

As per Claim 6:
“the cache instruction corresponding to the text to be parsed” in the last 2 lines of claim 6 lacks antecedent basis.  At a minimum, “the text to be parsed” is matched to the correspondence relationship table, and not to the cache instruction.

Claims 7-10 and 12 include the issues of claims 1-4 and 6 respectively.
Claims 13-16 and 18 include the issues of claims 1-4 and 6 respectively.

The dependent claims include the issues of their respective parent claims.

Allowable Subject Matter
Claims 1, 7, and 13, would be allowable if rewritten or amended to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action.
Claims 2-6, 8-12, and 14-18 would be allowable if rewritten to overcome the rejection(s) under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), 2nd paragraph, set forth in this Office action and to include all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  
	As per Claim(s) 1 (and similarly claims 7 and 13, and consequently claims 2-6, 8-12, and 14-18 which depend on claims 1, 7, and 13), the prior art of record does not teach or suggest the combination of all limitations in claim(s) 1, including (i.e. in combination with the remaining limitations in claim[s] 1) A method for recognizing speech, the method comprising: in response to detecting a speech frame, converting the speech frame into a current text in real time; 5in response to there being no previously saved historical text, inputting the current text into a semantic parsing model to obtain a parsing result; in response to the parsing result including a valid intention slot, ending a speech endpoint detection to 10complete the recognition; and outputting an instruction corresponding to the valid intention slot.
JP 3962904 B2 (X reference in JP search report) teaches “The speech recognition system is provided with a means 5 for changing a recognition dictionary by decomposing the command language in a recognition dictionary 4 into prescribed units such as units of word and syllable and a holding means 6 for holding a recognition specifically describe where a command/intent is detected in a single recognition result in the holding unit.  Paragraphs 21-24 of the Google translation describe where one recognition result (i.e. the single word “file”) is checked for acceptance by an original recognition dictionary, and is not accepted in a network grammar, and then the combination of “file” and another recognition result “open” is determined.  This reference also does not appear to specifically describe where the checking of a first word stored in a holding means is checked in response to there being no historical text (i.e. the check is performed no matter what once the most recent recognition result is stored)
2018/0143967 describes where a command can be a single word that is used to provide intent information (paragraph 37, Figure 2)
	The prior art teaches/suggests the following concepts:
I. Storing/buffering text until a complete command is identified.
6813603 teaches “Referring now to FIGS. 14A and 14B, the populate form procedure 72 displays the selected form or report 310 and places the insertion pointer to the left of the field terminator for the first field, even if the first field already stores text data 312. The user may speak a sequence of commands and entry field data or values at any time. The user may navigate to different entry fields of the form in any order. As described above, as the user speaks, the speech engine 68 generates text in accordance with the speech and stores the text in the current text buffer. The populate form procedure 72 scans the text in the current text buffer until a complete command is identified” (col. 11, lines 45-57).  This reference’s text buffer appears to be the text stored in a field (col. 4, line 66 - col. 5, line 2).  Figure 12 describes where a buffer is checked until a complete command is detected, and then the command is removed from the text buffer and is processed.  Figure 11 describes where text stored in the text buffer is generated from audio speech.  Col. 9, lines 35-41 describes where a current text buffer change is “with each addition of a character”.  Col. 10, lines 1-22 further describes where text buffer changes are character by character.  This reference does not read on the independent claims because the character-by-character updating of the buffer necessarily means that at the time that a command is detected (and shortly before), there is previously saved historical text relative to the command text (at a minimum, the command text minus one character).
2020/0020334 describes where an electronic device stores text (suggested to be speech recognition text from a server, see e.g. Figures 6A-6D) and storing text until a designated command is entered or a designated time elapses (paragraph 174).  It is not clear if the designated command is any command, because paragraphs 150, 153, and 200 appears to describe where a voice command terminates a continuous command mode.
2021/0082397 (LATE filing date) teaches “In some embodiments, the dialog system may refrain from performing one or more steps in the processing pipeline at a given point in the dialog. For example, a first utterance does not include sufficient information to determine an intent. In such a case, the dialog system may continue to process the following utterance(s), along with the first utterance, in a continuous stream until the overall intent is determined. As a specific example, the dialog system executes ASR and NLP on a first utterance, “I was wondering . . . [pause] . . .” and waits for a second utterance, “if you sell hats,” before executing ASR and NLP on the second utterance, and processing both utterances together to determine an overall intent for the two utterances, Check_Stock”.  This reference does not qualify as prior art.
2017/0221475 describes storing audio signals until a command is completed, or until a correct transcription of an entity name is determined (paragraph 28).
II. Semantic parsing to determine intent and using a semantic model to perform parsing.
2011/0320187 teaches “mapping the natural language question provided as an input into one or more deductive database queries that capture one or more intents behind the natural language question further includes performing a semantic parsing of 
6138100 teaches “This language model is employed by the speech recognition module 214 to produce an output string that is comprised of a word or phrase sequence as governed by the language (syntax) model. The output string of the speech recognition module serves as input to the language understanding component of the Natural Language module 216--a parser in this module, governed by the language understanding (semantics) rules or models to produce a parse tree (parse trees for ambiguous input) for the input string. The parse tree is then translated into an expression (or set of expressions) which represents the meaning of the input string and which is interpretable by CPU 224” (col. 6, lines 25-44).  This reference suggests where a speech recognition output string (at least suggested to be text) is provided to semantics models to produce a parse tree (suggested to be a result of parsing the speech recognition output string).
2017/0178627 teaches “The dialog system 104 can receive the recognized speech from the ASR module 102. The dialog system 104 can interpret the recognized speech to identify what the speaker wants. For example, the dialog system 104 can include a parser for parsing the recognized speech and an intent classifier for identifying 
	6865528 teaches “Although speech recognition systems have been used in the past to simply provide textual output corresponding to a spoken utterance, there is a desire to use spoken commands to perform various actions with a computer. Typically, the textual output from the speech recognition system is provided to a natural language parser, which attempts to ascertain the meaning or intent of the utterance in order to perform a particular action. This structure therefore requires creation and fine-tuning of the speech recognition system as well as creation and fine-tuning of the natural language parser, both of which can be tedious and time consuming” (col. 1, line 59 – col. 2, line 2).  This reference appears to teach away from determining intent of speech recognition text using a natural language parser.
III. providing a final recognition result made of partial hypotheses in response to a whole sentence being determined, by a language model, to be grammatically complete, and determining, via semantic processing, whether an utterance is complete, and continuing to listen to a user until a parameter list is full.
Froelich (US 2017/0256261) teaches where an ASR system emits partial hypotheses until a language model determines that a whole sentence is grammatically complete and emits a final result, and if the speaker keeps talking a new partial response will begin (paragraph 62).  In this reference, there does not appear to be a in response to there being no previous partial results (no historical text).
	2016/0148610 teaches “In one use case, even if a threshold amount of silence is reached while a user is speaking, the recognized words of portions of the utterance that the user has spoken may be utilized during the semantic processing to determine whether the utterance is actually complete (e.g., whether the user has finished speaking what he/she intends to communicate). In a further use case, natural language processing instructions 122 may identify a command before the end of the utterance is detected, and continue listening until the command's parameter list is full. However, if the threshold amount of silence is reached, the system may prompt the user for additional input relating to the parameters (e.g., if a city parameter has not yet been communicated, the user may be prompted to identity a city)” (paragraph 57).
	IV. Other relevant references.
	(US 2019/0378493, European Search Report X reference) (paragraphs 72, 80, 115) This reference appears to be directed to where endpoint detection time is an amount of time elapsed from the start of a most recent voice input (see e.g., paragraphs 72-75) and also where a user may speak a particular word to either end voice input or continue voice input, including when the user does not think of an exact word, user's intent is not clear, or the user does not know what function can be performed by a voice service (paragraph 78) and where a voice command activates/inactivates a voice service (paragraph 80) and where text of a last word is used to determine that voice input is completed if the last word is not a particular word, and is used to determine that end point detection time should be extended if the last word is a particular word semantic parsing model in response to there being no previously saved historical text.  A speech end/continuation word in the text is presumably preceded by some text of previous spoken words since it would be pointless to end speech when the user has not spoken anything yet, and it would not be a continuation of speech if nothing was spoken yet such that there is nothing to continue. 
2011/0313768 teaches “If the spoken command is not recognized as a full phrase voice command, then the system determines at step 308 whether the spoken command is a partial phrase voice command. If so, then the system continues to listen at step 310 in an "active listening" mode for further voice commands. When an end phrase is spoken and recognized in step 314, then the system proceeds to execute the command in step 306. If an end phrase is not spoken, then the system checks to see at step 316 to see if the partial phrase is part of a valid command. If so, the system returns to the active listening mode in step 310. If the spoken command is not recognized as a partial phrase voice command, then after a brief timeout at step 318, the system returns to the passive listening mode at step 302” (paragraph 86).  Paragraph 86 similarly does not describe where end point detection occurs when a valid intent is identified in response to there being no historical text, since it would be pointless to end speech when the user has not spoken anything yet, and it would not be a continuation of speech if nothing was spoken yet such that there is nothing to continue. Paragraph 85 describes where a system returns to a passive listening state in response to detecting a full phrase voice command, which suggests (together with paragraph 86’s teaching of an “active listening” mode) that the system is in an active listening mode while listening specifically describe where the text of the full phrase voice command is input to a semantic parsing model in response to there being no previously saved historical text (no checking of whether there is or is not saved historical text is performed).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC YEN whose telephone number is (571)272-4249. The examiner can normally be reached M-F 12:00PM -8:30PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, RICHEMOND DORVIL can be reached on (571)272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is 





EY 1/24/2022
/ERIC YEN/Primary Examiner, Art Unit 2658