DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
Claims 1, 2 and 4 are objected to because of the following informalities:  
Independent claim 1 includes extraneous punctuation of a dash “—“ in the phrase “determining – through semantic understanding”, which should be deleted.
Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 to 2, 4, 6 to 9, and 11 to 12 are rejected under 35 U.S.C. 103 as being unpatentable over Froelich (U.S. Patent Publication 2017/0256261) in view of Klisch et al. (U.S. Patent Publication 2002/0042709).
Concerning independent claims 1 and 6 to 7, Froelich discloses a method, system, and computer-readable medium for speech recognition, comprising:
“a content server obtaining a user’s speech information from a client device, and . . . completing the speech interaction in a first manner” – an aim is to enable a user to 
“the first manner comprises: sending the speech information to an automatic speech recognition server and obtaining a partial speech recognition result returned by the automatic speech recognition server each time” – ASR system 32 is present on a server device of remote system 8 (“an automatic speech recognition server”) (¶[0052] and ¶[0060]: Figure 4); ASR system 32 provides continuous recognition, which means that as the user 4 starts speaking the ASR system starts to emit partial hypotheses (“a partial speech recognition result”) on what is being recognized; if the speaker keeps talking a new partial response will begin (“each time”); the partial results are in the form 
“after determining that voice activity detection starts” – speech detector 44 uses the output of speech recognition service 30 to detect speech activity (“determining that voice activity starts”), i.e., in switching between a currently speaking and a currently non-speaking state; following an interval of speech inactivity, an interval of speech activity commences in response to identifying at least one individual word in voice input during an interval of speech inactivity (¶[0089] - ¶[0090]: Figure 4); “voice activity detection starts”, then, after identifying at least one individual word by speech recognition service 30;
“and in response to determining through semantic understanding that the obtained partial speech recognition result already includes entire content that the user hopes to express, taking the partial speech recognition result as a final speech recognition result without waiting for an end of the voice activity detection” – the partial hypotheses (“the obtained partial speech recognition result”) continue to be emitted until language model 14 determines that a whole sentence is grammatically complete (“entire content”) and emits a final result (“a final speech recognition result”) (¶[0062]: Figure 4); language model 34 applies a set of grammatical rules to the provisional set of words 52 to determine additional information about the semantic content (“semantic understanding”) (¶[0071]: Figure 5A); an additional function of language model 34 is one of detecting a grammatically complete sentence (“entire content that the user hopes to express”) (¶[0074]: Figure 5A); in response to detecting a grammatically complete sentence, language model 34 makes a final decision on the sequence of words spoken i.e., switching between a currently speaking and a currently non-speaking state (¶[0089]: Figure 6B); following a period of speech activity, an interval of speech inactivity commences in response to a final result 52F being outputted by language model 34, triggered by detecting a condition of indicative of speech inactivity, i.e., as the language model 34 detecting a grammatically complete sentence (“without waiting for an end of the voice activity detection”) (¶[0091] - ¶[0092]: Figure 6B);  language model 34 provides “semantic understanding” because it determines if semantic content of voice input is a grammatically complete sentence, and “without waiting for an end of the voice activity” if there is a grammatically complete sentence; 
“obtaining a response speech corresponding to the final speech recognition result, and returning the response speech to the client device” – partial response 53 is provisional, in that it is not necessarily in a form ready for outputting to the user, and it is only when the final result 52F is outputted by language model 34, i.e., in response to a detection of a grammatically complete sentence, that the partial response 54 is initialized by response generator 40, thereby generating final response 54F (“a response speech corresponding to the final speech recognition result”); response 54F is ‘final’ in the sense that it is a complete response to the grammatically complete sentence detected by language model 34, that is substantially ready for output to user 4, in the sense that the information is settled, though in some cases some formatting with text-to-speech conversion may still be needed (¶[0080]: Figure 5B); response 
Concerning independent claims 1 and 6 to 7, Froelich discloses that response generator 40 is able to generate a final response 54F more quickly when final result 52F is finally outputted by language model 34 than it would be able to if it relied on final result 54F alone.  (¶[0086]: Figure 6B)  Froelich, then, discloses substantially improving the speech interaction response speed by using a language model to control when a final speech recognition result is obtained as a grammatically complete sentence instead of using a standard voice activity detector to determine an end of speech as silence for speech inactivity.  However, Froelich  does not disclose the limitations directed to “obtaining user’s expression attribute information, which is pre-determined by analyzing the user’s past speaking expression habits and indicates whether a user is one who expresses content completely at one time, and in response to determining according to the expression attribute information that the user is a user who expresses content completely at one time”, then completing the speech interaction in a first manner.  That is, Froelich discloses determining if voice input from a user represents a complete grammatical sentence.  If a complete grammatical sentence is detected, then a final decision as to a sequence of words is outputted, so that a speech interaction is completed “in a first manner”.  If a complete grammatical sentence is not detected, then Froelich, then, broadly discloses determining a response according to if “the user is a user who expresses content completely, at one time” and “expression attribute information”.  Detection of a complete grammatical sentence implies that ‘the user has expressed content completely, at one time’, and this is “expression attribute information”.  Still, Froelich does not disclose “obtaining user’s attribute information which is pre-determined by analyzing the user’s past speaking expression habits”, i.e., expression of complete content is not determined by ‘analyzing the user’s past speaking habits’.
Concerning independent claims 1 and 6 to 7, Klisch et al. teaches analyzing a spoken sequence of numbers recognized by automatic speech recognition using a determination of a speaking pause length between two consecutive numbers and deciding if the two consecutive numbers belongs to a single numerical value on the basis of the determined pause length.  (Abstract)  Specifically, Klisch et al. teaches a pause length threshold that depends upon an individual speaker.  Here, a pause length threshold is automatically adapted to the current user’s speaking habit.  This can be done by analyzing previously entered numerical values which the user has already acknowledged to be correct (“obtaining user’s expression attribute information which is pre-determined by analyzing the user’s past speaker expression habits”).  (¶[0015])  Processing unit 160 decides whether or not two consecutive numbers belong to a single e.g., ‘5’, ‘100’, and ‘30’.  If neither of the two pause lengths P1 and P2 exceeds pause length threshold Θ, processing unit 160 decides that the spoken sequence of numbers contains a single numerical value, i.e., ‘530’.  If processing unit 160 determines that only the first pause length P1 exceeds pause length threshold Θ, it decides that the spoken sequence of numbers contains the two numerical values ‘5’ and ‘130’, but if only the second pause length P2 exceeds pause length threshold Θ, processing unit 160 decides that the spoken sequence of numbers contains the two numerical values ‘500’ and ‘30’.  (¶[0033]: Figure 3)  Klisch et al., then, teaches “obtaining user’s expression attribute information which is pre-determined by analyzing the user’s past speaking expression habits” for a pause length threshold that depends upon an individual user and is adapted to a pre-determined threshold by analyzing that user’s past speaking habits.  This pause length threshold characterizing a given user is “the expression attribute information that the user is a user who expresses content completely at one time” because a pause length threshold for a given user characterizes a user as someone who is more or less likely to speak numbers with or without pauses.  Additionally, Klisch et al.’s recognition of numerical sequences Klisch et al. in speech recognition of partial and final results of Froelich for a purpose of providing a robust distinction between different semantic interpretations for ambiguities of numerical values.    

Concerning claims 2, 8, and 11, Froelich discloses:
“for the partial speech recognition result obtained each time before and after the start of the voice activity detection, respectively, obtaining a search result corresponding to the partial speech recognition result” – in generating partial response 54, response generation module 40 can communicate one or more identified words in the set of partial results 52 (“for the partial speech recognition result obtained each time”) to keyword lookup service 38 in order to retrieve information associated with the one or more words; keyword lookup service 38 may be a search engine, e.g., Microsoft® Bing® or Google (“obtaining a search result corresponding to the partial speech recognition result”); any retrieved information that proves relevant can be incorporated from partial response 54 into final response 54F; this pre-lookup can be performed whilst the user is still speaking, i.e., during an interval of speech activity (when speech detector 42 is still 
“sending the search result to a Text to Speech server for speech synthesis” – final response 54F may be generated in a text format and converted to audio data using a text-to-speech conversion algorithm (¶[0102]: Figure 4); server 8 includes audio encoder 50, which provides this text-to-speech conversion algorithm; here, text-to-speech conversion is equivalent to “speech synthesis”; implicitly, a search result obtained from pre-lookup is incorporated from a partial response 52; 
“upon obtaining the final speech recognition result, taking a speech synthesis result obtained according to the final speech recognition result as the response speech” – response 54F is ‘final’ in the sense that it is a complete response to the grammatically complete sentence detected by language model 34, that is substantially ready for output to user 4, in the sense that the information is settled, though in some cases some formatting with text-to-speech conversion may still be needed (“taking a speech synthesis result obtained according to the final speech recognition result as the response speech”) (¶[0080]: Figure 5B); final response 54F may be generated in a text format and converted to audio data using a text-to-speech conversion algorithm (¶[0102]: Figure 4).



Concerning claims 4, 9, and 12, Froelich discloses the steps of:
“if it determined according to the expression attribute information that the user is a user who does not express content completely at one time, completing the speech interaction in a second manner” – language model 34 has a functionality to detect a grammatically completed sentence in a provisional set 54 (¶[0071]: Figure 4); speech recognition service 30 operates cyclically on two levels of granularity; speech recognition system 32 operates continuously to repeatedly identify individual words as they are spoken by user 2, i.e., to generate and update partial results 52 on a per-word basis; as these words are identified, language model 34 operates continuously to repeatedly identify whole sentences spoken by the user, i.e., final result 52F, on a per-sentence basis; both mechanisms are used to control conversational agent 36, whereby bot 36 exhibits both per-word and per-sentence behavior (¶[0079]: Figure 4); here, per-word behavior by bot 36 for partial results on individual words is “completing the speech interaction in a second manner”, where these partial results on a per-word basis are “expression attribute information that the user is a user who does not express content completely at one time”; 
“the second manner comprises: sending the speech information to the automatic speech recognition server, and obtaining a partial speech recognition result returned by the automatic speech recognition server each time” – ASR system 32 is present on a server device of remote system 8 (“the automatic speech recognition server”) (¶[0052] and ¶[0060]: Figure 4); ASR system 32 provides continuous recognition, which means that as the user 4 starts speaking the ASR system starts to emit partial hypotheses (“a 
“for the partial speech recognition result obtained each time, respectively obtaining a search result corresponding to the partial speech recognition result, and sending the search result to the Text-to-Speech server for speech synthesis” – in generating partial response 54, response generation module 40 can communicate one or more identified words in the set of partial results 52 (“for the partial speech recognition result obtained each time”) to keyword lookup service 38 in order to retrieve information associated with the one or more words; keyword lookup service 38 may be a search engine, e.g., Microsoft® Bing® or Google (“obtaining a search result corresponding to the partial speech recognition result”) (¶[0087]: Figure 4); final response 54F may be generated in a text format and converted to audio data using a text-to-speech conversion algorithm (“and sending the search result to the Text-to-Speech server for speech synthesis”) (¶[0102]: Figure 4);
“upon determining that the voice activity detection ends, taking the finally-obtained speech syntheses result as the response speech, and returning the response speech to the client device” – any retrieved information that proves relevant can be incorporated from partial response 54 into final response 54F; this pre-lookup can be performed whilst the user is still speaking, i.e., during an interval of speech activity (when speech detector 42 is still indicating a speaking state) and subsequently incorporated into final response 54F for outputting when a speech activity interval ends 

Response to Arguments
Applicants’ arguments filed 03 January 2022 have been fully considered but they are not persuasive.
Applicants provide an amendment to independent claims 1, 6, and 7 directed to “without waiting for an end of the voice activity detection”, and present arguments traversing the prior rejection of these independent claims as being obvious under 35 U.S.C. §103 over Froelich (U.S. Patent Publication 2017/0256261) in view of Klisch et al. (U.S. Patent Publication 2002/0042709).  Applicants give a brief argument directed to Klisch et al., but their main allegations are against Froelich.  Applicants state that i.e., prior to the end of the voice activity detection.  Applicants cite ¶[0008] of the Specification that the inventor discovered that in a practical application, before the voice activity detection ends, it might occur a case that partial speech recognition results obtained at a certain time are already the final speech recognition results, and that would prolong the speech response time and reduce the speech interaction response speech.  Applicants note ¶[0056] to ¶[0057] of the Specification, which describes that time may be saved up to 500 to 600 milliseconds.  Applicants allege that this technical solution is not disclosed by Froelich.  Applicants maintain that Froelich discloses a speech detector that is not a voice activity detector, so that it is not reasonable to interpret the speech detector as a voice activity detector in Froelich.  Applicants contend that their determination based on a voice activity detector is “totally different” from a determination by a speech detector in Froelich.  Applicants argue that Froelich would lead to a “totally different” direction, as this leads to obtaining grammatically complete sentences, and replacing a voice activity detector with a speech detector, and sacrifices response time.  That is, Applicants say that Froelich intends to obtain expression as complete as possible, and may sacrifice some response time, but the invention improves response speed which may potentially sacrifice some accuracy.  Then Applicants’ argument directed against Klisch et al. is that this reference fails to teach “obtaining user’s expression information which is pre-determined by analyzing the 
Generally, Applicants’ arguments are not persuasive to one having ordinary skill in the art considering what the prior art teaches as a whole and as construed in light of what is described in Applicants’ Specification.  Applicants’ invention is maintained to be equivalent to what is disclosed for speech detection by Froelich.  That is, if one carefully considers what is being performed by speech detector 44 of Froelich and what is performed by Applicants’ “determining through semantic understanding that the obtained particular speech recognition result already includes entire content that the user hopes to express”, then these are the same.  Equivalently, Froelich’s speech detector 44 uses a language model 34 of speech recognition service 30 to determine “a grammatically complete sentence”, and this is identical to determining “speech recognition result already includes entire content” in the claim language of Applicants.  Similarly, Froelich expressly discloses the whole rationale of what Applicants believe provides unexpected results for their invention: improving speech interaction response speed.  Applicants observe that their Specification, ¶[0056] - ¶[0057], describes an increase in response speed of 500 to 600 milliseconds (about half of a second).  However, Froelich equivalently discloses an increase in response speed.  At ¶[0086], Froelich states:
Note, however, that by generating and updating the partial response 54 based on the partial results 52 on a per-word basis (and not just the final result 52F′), the response generator 40 is able to generate the final response 54F more quickly when the final result 52F is finally outputted by the language model 34 that it would be able to if it relied on the final result 52F alone.  (emphasis added) 

Applicants’ argument, then, presupposes an advantage that already resides in Froelich.  The fact that Applicants have recognized another advantage which would flow naturally from following the suggestion of the prior art cannot be the basis for patentability when the differences would otherwise be obvious.  See Ex parte Obiaya, 227 USPQ 58, 60 (Bd. Pat. App. & Inter. 1985).
	The examiner agrees that speech detector 44 of Froelich is not a standard voice activity detector because it relies upon a language model 34 of speech recognition service 30 to detect an end of speech.  Froelich, at ¶[0089] - ¶[0094]: Figures 5A to 5B and 6A to 6B, describes this speech detector 44 according to given rules as operating according to a voice activity detector to determine a start of speech, but instead uses a language model to determine an end of speech when it is a grammatically complete sentence instead of simply detecting silence to determine an end of speech as is done by a conventional voice activity detector.  This is completely equivalent to what is being performed by Applicants, and is not as they contend “totally different”.  Specifically, Froelich, at ¶[0090], states that an interval of speech activity commences in response to identifying at least one individual word in the voice input.  Then, Froelich, at ¶[0091] - ¶[0093], states that an interval of speech inactivity commences in response to detecting a grammatically complete sentence as detected by language model 34.  Accordingly, Froelich discloses (1) – (3) as presented by Applicants’ Remarks: (1) ‘the voice activity starts’ as described by Froelich at ¶[0090], where speech activity begins in response to identifying at least one individual word in voice input 19; (2) ‘the obtained partial speech recognition result already includes entire content that the user hopes to express’ as Froelich at ¶[0091] - ¶[0092], where a language model 34 detects a condition of a grammatically complete sentence; and (3) ‘without waiting for an end of the voice activity detection’ as described by Froelich at ¶[0093] - ¶[0094], where speech inactivity is detected only after an interval of one to three seconds and no partials being detected.  That is, Froelich generates a final response more quickly if a grammatically complete sentence is detected, and does not wait one to three seconds more for an end of voice activity if there is a grammatically complete sentence.  
	Applicants’ new limitation is somewhat problematic as directed to “without waiting for an end of the voice activity detection”.  Granted, there is literal support for this limitation in the Specification, Page 5, Lines 29 to 30, but this is provided only as a statement to distinguish over the prior art, and is not fairly explained by any reiteration in the Specification.  The problem is that “voice activity detection” is ambiguous in this context.  Applicants argue that Froelich does not use a standard voice activity detector, but then neither do Applicants.  There is nothing that describes what constitutes a standard voice activity detector in the Specification.  That is, Applicants’ voice activity detector is not a standard voice activity detector because it relies upon semantic understanding that a partial speech recognition result already includes entire content that the user hopes to express.  A conventional voice activity detector would not do this, but would simply wait for silence to determine that speech has ended.  Applicants are trying to draw a distinction that does not exist between their invention and Froelich.  Even if a start of speech activity depends upon identifying an individual word instead of speech, per se, in Froelich, there is no basis for making this distinction using Applicants’ Specification.  
Froelich.  Figures 5A to 5B illustrate that a user states: “maybe the swallows flew”, which is recognized by ASR 32 but is not a grammatically complete sentence.  However, the user continues to speak: “maybe the swallows flew south”, which is a grammatically complete sentence, and recognized as grammatically complete by language model 34.  Given that there is a grammatically complete sentence, speech detector 44 does not wait an additional one to three seconds, but immediately generates a response that is delivered as: “but is it still June”.  This response is delivered to the user in Figure 6A.  If a user had continued to speak: “though it is still June”, then language model 34 recognizes this as a grammatically complete sentence, and response generation is delivered as: “I agree, it’s unlikely they have yet” instead of “but it is still June”.  
	Any unexpected results argued by Applicants ensue naturally in Froelich.  See Obiaya, supra.  Applicants may provide some specific improvement of response speed in a range of 500 to 600 milliseconds, but an increase in response speed is equivalently disclosed by Froelich, at ¶[0086], where it is stated that “response generator 40 is able to generate the final response 54F more quickly when the final result 52F is finally outputted by the language model 34 that it would be able if it relied on the final result 52F alone.”  Froelich, at ¶[0039] - ¶[0040], is directed towards distinguishing over the use of a conventional voice activity detector in the same way as the Specification, Page 5, Lines 28 to 30, where it is stated to “return a response to the user for broadcasting, and end the speech interaction, without waiting for the end of the voice activity detection as in the prior art”.
Klisch et al.  Here, Applicants’ arguments are only conclusory against Klisch et al., and do not fairly provide an analysis of what is taught by that reference.  Applicants merely characterize Klisch et al. as recognizing numerical sequences according to pause length, and their analysis is plainly deficient.  Instead, Klisch et al. clearly and repeatedly teaches that a pause length threshold depends on the individual speaker and a “user’s speaking habit”.  This is significantly identical to Applicants’ claim language of “the user’s past speaking expression habits”.  Specifically, Klisch et al. states at ¶[0015] and ¶[0017]: 
[0015] It has been found that robust setting of a pause length threshold is strongly interrelated with speech tempo which in turn depends on the individual speaker. In reality, the speech tempo of different speakers can vary within a wide range. According to a preferred embodiment of the invention, the pause length threshold is therefore automatically adapted to the current user's speaking habit. This can e.g. be done by analyzing previously determined speaking pause lengths within one or more previously uttered numerical values which the user has already acknowledged to be correct. . . .
[0017] Like the pause length threshold, the respective thresholds of further prosodic parameters can be user-adjustable or be automatically adjusted dependent on the user's speaking habit or be adjusted in accordance with appropriate training data. 
(emphasis added)
Granted, Klisch et al. is directed to a specific problem in the speech recognition of a series of numerical digits instead of to a more general problem in the speech recognition of words.  Still, numerical digits are spoken as words.  Klisch et al., then, clearly teaches Applicants’ limitation of “analyzing the user’s past speaking expression habits”.  These speaking habits are taught to be dependent on the individual speaker, so they are ‘attribute information of the user’, i.e., “obtaining user’s expression attribute information which is pre-determined by analyzing”.  If a user is a variety of user who does not pause between the speaking of numerical digits, then the user is “one who Klisch et al.
	Applicants’ argument are not persuasive.  There are no new grounds of rejection.  Accordingly, this rejection is properly FINAL.

Conclusion
THIS ACTION IS MADE FINAL.  Applicants are reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 272-7608.  The examiner can normally be reached on Monday-Thursday 8:30 AM-6:00 PM.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MARTIN LERNER/Primary Examiner
Art Unit 2657                                                                                                                                                                                                        January 24, 2022