DETAILED ACTION
Applicant’s arguments filed in the reply on 4/30/2021 were received and fully considered. Claims 1-8, 10-13, and 15-20 were amended. Claim 21 is new. The current office is FINAL. Please see corresponding rejection headings and response to arguments section below for more detail.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s remarks with respect to the 35 USC 112F (claim interpretation) raised in the previous office action were acknowledged. 
Applicant’s arguments filed with respect to the 35 USC 112B rejections raised in the previous office action were persuasive in view of amendment. These rejections are withdrawn.
Applicant's arguments filed with respect to the 35 USC 101 rejections raised in the previous office action have been fully considered, but they are not persuasive. The amendment with respect to a microphone and a camera do not integrate the judicial exception, when considered as an ordered combination, into a practical application. Rather, the amendment pertains to insignificant extra-solution activity (data gathering). See MPEP 2106.05(g). Please see 35 USC 101 rejection heading above for more detail.
.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter without significantly more. The claims as whole, considering all claim elements both individually and in combination, do not amount to significantly more than an abstract idea.
The independent claims 1, 20 and 21 recites: “An information processing device comprising: a detection unit configured to detect a breakpoint of a speech of a user on a basis of a result of recognition that is to be obtained during the speech of the user; and an estimation unit configured to estimate an intention of the speech of the user on a basis of a result of semantic analysis of a divided speech sentence obtained by dividing a speech sentence at the detected breakpoint of the speech wherein the result of the recognition includes a result of recognition of voice data of the speech of the user obtained by a microphone capturing a voice emitted by the user and a result of recognition of image data obtained by a camera capturing 
The limitation of “detecting a breakpoint of a speech”, “estimating intention”, and “dividing a speech sentence at breakpoint” as drafted covers a human organizing activities. More specifically an individual can detect a breakpoint of a speech (i.e. on a piece of paper) in this case the output of an ASR engine and then inspecting the data (i.e. data examination) in order to identify the breakpoint “pause” of sentences at the output of a given ASR. Next the based on a specific criteria the same individual can split the sentences at the pause or breakpoint via visual inspections and subsequently estimate an intention from the phrases/sentences that were generated and analyzed. 
This judicial exception is not integrated into a practical application. In particular, claims 1, 20, and 21 recites processing device additional element of computing device (which can include “processor”, “memory”) as per the independent claims. For example, in Par. 0271 (also in FIG. 13) in the as filed specification states: “a computer 1000, a central processing unit (CPU) 1001, a read-only memory (ROM) 1002, and a random access memory (RAM) 1003 are connected to each other by a bus 1004. An input and output interface 1005 is further connected to the bus 1004. An input unit 1006, an output unit 1007, a recording unit 1008, a communication unit 1009, and a drive 1010 are connected to the input and output interface 1005. Furthermore, capturing a speech of a user via a microphone or augmenting such activities with capturing an image via camera is amount to adding insignificant extra-solution activity to the judicial exception -see MPEP 2106.05(g). As such these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits 
Furthermore, the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using a computer which due to lack of specificity it is considered as a general computer (or processor) -see par. 0271 of the Applicant’s Specification. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Moreover, the limitation in the claims noted above taken individual or as an ordered set do not amount to significantly more than judicial exception. As such they are directed to an abstract idea as discussed, which performs mental activity. Thus neither of the additional elements nor limitations ‘as taken individually or ordered set’ amount to significantly more solution activity. The claims are not patent eligible.
Claim 2 is directed toward human activity. Recognizing a voice data further includes a result of recognition of sensor data obtained by sensing the user or a surrounding of the user, which can be performed by a human. Transcription of the surrounding of a user can be documented with a pen and paper. All of such action can be carried out with an aid of a pen and paper or in combination of a generic monitor. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim directed toward abstract idea.

Claim 4 is directed toward an abstract idea. Estimating an intention of the speech of the user on a basis of an intention (Intent) and entity information (Entity) that are to be sequentially obtained for each of the divided speech sentences. A human can dissect a sentence into various pieces on a paper and extract meaning from individual part and suggest intention or action that is required from the context of the dissected sentence. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim directed toward abstract idea.
Claim 5 is directed toward an abstract idea. Extracting an intention (Intent) that follows the speech sentence, from among intentions (Intents) of the respective divided speech sentences. A human can dissect a sentence into various pieces on a paper and extract single intention from among other intentions. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim directed toward abstract idea.

Claim 7 is directed toward an abstract idea. Claim recites “The information processing device according to claim 4, wherein the entity information (Entity) includes, as a type thereof, a Body type representing that a free speech is included, and in a case in which an intention (Intent) of a last divided speech sentence includes entity information (Entity) of a Body type, in a case in which a target divided speech sentence being a divided speech sentence provided ahead of the last divided speech sentence, and being targeted satisfies a specific condition, the estimation unit is further configured to make an intention (Intent) of the target divided speech sentence unexecuted, and adds content thereof to entity information (Entity) of a Body type that is included in the intention (Intent) of the last divided speech sentence.” Which can be done by a human. Using body type (i.e. different portion of human body such as head, neck, eye, etc.) as a means to communicate with the system can be easily done by a human. Body movement can be understood by a human operator just as effective and generic photo sensor or such means. Human can observe a user and monitor his movement and interpret it as per method that a computing machine will do. Based on the body movement observed a human can use a pen and pencil to write down the sentences and divide the sentence down to their component level and intent related to that to be identified. The body type or part that was 
Claim 8 is directed toward an abstract idea. Claim recites “wherein, in a case in which the target divided speech sentence does not satisfy the specific condition, the estimation unit is further configured to discard the intention (Intent) of the target divided speech sentence “. The conditional statement in the claim can easily be decided by a human. A human can write down on a paper the sentence/sentences of a given speech and look to see a given condition is met with in the sentences at hand. If such condition is not met or cannot be satisfied the sentences can be eliminated. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim directed toward abstract idea.
Claim 9 is directed toward an abstract idea. Claim recites “wherein the specific condition includes a condition for determining whether or not a rate of the speech of the user exceeds a predetermined threshold value, or a condition for determining whether or not the user looks at a predetermined target“. The conditional statement in the claim can easily be decided by a human. A human can observe the user and notice if the user is looking at a given point or elsewhere. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim directed toward abstract idea.
Claim 10 is directed toward an abstract idea. Claim recites “wherein the entity information (Entity) includes, as a type thereof, a Body type representing that a free speech is included, and when the divided speech sentence including entity information (Entity) of a Body type does not exist, the estimation unit estimates an intention of the speech of the user in 
Claim 11 is directed toward an abstract idea. Claim recites “wherein, when the speech of the user includes an intention (Intent) of retraction, the estimation unit is further configured to delete a divided speech sentence to be retracted, from a target of intention estimation of the speech of the user.” Retraction of a speech or part of the speech can be carried out with a pen and paper by a human. He can write down the sentence and mark down the intent in question and delete it. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim directed toward abstract idea.
Claim 12 is directed toward an abstract idea. Claim recites “wherein, when an nth divided speech sentence includes an intention (Intent) of retraction, the estimation unit is further configured to delete an (n-1)th divided speech sentence from a target of intention estimation of the speech of the user.” which can be done by a human. He can write down the sentences and count down to the nth one and delete the one in question. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim directed toward abstract idea.

Claim 14 is directed toward an abstract idea. Claim recites “wherein the feedback information includes a voice, a sound effect, or an image.” Which can be done by a human. Providing feedback once breakpoint of a sentence is detected can easily be done via shouting or providing audio feedback to the user. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim directed toward abstract idea.
Claim 15 is directed toward an abstract idea. Claim recites “wherein the detection unit detects the breakpoint of the speech on a basis of the result of the recognition of the voice data, when a time of a pause of the speech of the user exceeds a fixed time, when a boundary of an intonation phrase included in the speech of the user is detected, or when falter or filler included in the speech of the user is detected.” Which can be done by a human. Detecting 
Claim 16 is directed toward an abstract idea. Claim recites “wherein the detection unit is further configured to detect the breakpoint of the speech on a basis of the result of the recognition of the image data, when a time in which a mouth of the user does not move exceeds a fixed time, or when a movement of a visual line of the user that exceeds a predetermined threshold is detected” Which can be done by a human. Detecting breakpoint of a voice data when image of the user is observed and based on that he can identify the breakpoints. By watching the user’s mouth movement or lack of movements a human can figure out the same as well, associating a movement of a lip against a preset amount can also be observed by a human and note it down. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim directed toward abstract idea.
Claim 17 is directed toward an abstract idea. Claim recites “wherein the detection unit is further configured to detect the breakpoint of the speech on a basis of the result of the recognition of the sensor data, when intake of breath of the user is detected, or a movement of an entire or a part of a body of the user is detected” which can be done by a human. Detecting breakpoint of a voice data when sensor data of the user is available and based on that human 
Claim 18 is directed toward an abstract idea. Claim recites “further comprising a task execution unit configured to execute a task on a basis of a result of intention estimation of the speech of the user, wherein the task execution unit is implemented via at least one processor”, which can be carried out by a human. Executing a given task based on the intention of voice can also be done by a human just by observing the user and once intonation is observed a given action is performed. As an example if the intention is observed he make a call which a human can easily place such a call. Even though the claim includes “at least one processor” in the generation unit, it is not considered an additional element due to lack of specificity, and it is considered as a generic computer (or processor) -see par. 0271 of the Applicant’s Specification. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept, as mentioned earlier. Beyond the added generic processor, the  claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim directed toward abstract idea.
Claim 19 is directed toward an abstract idea. Claim recites “a speech recognition unit configured to perform speech recognition (ASR) for obtaining the speech sentence from the speech of the user; and a semantic analysis unit configured to perform semantic analysis (NLU) of the divided speech sentence to be sequentially obtained at the breakpoint of the speech, 
Therefore, claims 1-21 are not patent eligible under 35 USC 101.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention 

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1, 4 – 6, 20, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Charles Melvin Johnson (US 10134425 B1)(hereinafter "Johnson"), and, in further view of Kim et al.  (US 20180268812 A1)(hereinafter "Kim").

Johnson, and Kim were applied in the previous office action.
pauses in spoken words and may interpret those pauses as potential breaks in a conversation. Those breaks in a conversation may be considered as breaks between utterances and thus considered the beginning (beginpoint) or end (endpoint), of an utterance. The beginning/end of an utterance may also be detected using speech/voice characteristics).
and an estimation unit configured to estimate an intention of the speech of the user on a basis of a result of semantic analysis of a divided speech sentence obtained by dividing a speech sentence at the detected breakpoint of the speech (Col. 6, lines 20-28:” The NLU process takes textual input [such as processed from ASR 250 based on the utterance 11], and attempts to make a semantic interpretation of the text. That is, the NLU process determines the meaning behind the text based on the individual words and then implements that meaning. NLU processing 260 interprets a text string to derive an intent or a desired action from the user as well as the pertinent pieces of information in the text that allow a device [e.g., device 110] to complete that action.”, and Col. 6, lines 33-36:” The NLU may process several textual inputs related to the same utterance. For example, if the ASR 250 outputs N text segments [as part of an N-best list], the NLU may process all N outputs to obtain NLU results.”, and Col 5, lines 64-67:” ASR results in the form of a single textual representation of the speech, an N-best list including multiple hypotheses and respective scores, lattice, etc. may be sent to a server, such text into commands for execution, either by the device 110, by the server 120, or by another device [such as a server running a search engine, etc.]”, and Col 15, lines 32-34:” An endpoint detector may determine an endpoint based on different hypotheses determined by the speech recognition engine 258”).
and wherein the detection unit and the estimation unit are each implemented via at least one processor. (Col. 28, lines 30 - 34:"Each of these devices [110/120] may include one or more controllers/processors [1204/1304], that may each include a central processing unit [CPU] for processing data and computer-readable instructions, and a memory [1206/1306] for storing data and instructions of the respective device.”).
However, Johnson does not teach wherein the result of the recognition includes a result of recognition of voice data of the speech of the user obtained by a microphone capturing a voice emitted by the user and a result of recognition of image data obtained by a camera capturing an image of the user.
Kim teaches wherein the result of the recognition includes a result of recognition of voice data of the speech of the user obtained by a microphone capturing a voice emitted by the user and a result of recognition of image data obtained by a camera capturing an image of the user, (Par. 0004:” In some implementations, a system is capable improving endpoint detection of a voice query submitted by a user. For instance, the system may initially obtain audio data encoding the submitted voice query, and video data synchronized with the obtained audio data that includes images of the user's face when submitting the voice query. The system then uses techniques to distinguish between portions of the audio data corresponding to speech input and other portions of the voice query corresponding to non-speech input, e.g., background noise. As an example, the system initially determines a sequence of video frames that includes images of a face of the user. The system then identifies a sequence of video frames that includes images of detected lip movement. In some implementations, the system determines the first and last frames of the sequence, and their corresponding time points. The system then identifies an audio segment of the audio data that has a starting and ending time point corresponding to the time points of the first and last frames of the sequence of video frames. The system endpoints the audio data to extract the audio segment and provides the audio segment for output to an ASR for transcription..”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson in view of Kim to include a result of recognition of voice data of the speech of the user obtained by a microphone capturing a voice emitted by the user and a result of recognition of image data obtained by a camera capturing an image of the user, in order to verify audio data against detected lip movement data indicating terms and/or phrases spoken by the user to identify and/or correct misrecognized terms, as evidence by Kim (see Par. 0005).

Regarding claim 4, Johnson teaches the information processing device according to claim 1, wherein the estimation unit is further configured to estimate an intention of the speech of the user on a basis of an intention (Intent) and entity information (Entity) that are to be sequentially obtained for each of the divided speech sentences. (Col. 6, lines 20-32: “The NLU process takes textual input [such as processed from ASR 250 based on the utterance 11] semantic interpretation of the text. That is, the NLU process determines the meaning behind the text based on the individual words and then implements that meaning. NLU processing 260 interprets a text string to derive an intent or a desired action from the user as well as the pertinent pieces of information in the text that allow a device [e.g., device 110] to complete that action. For example, if a spoken utterance is processed using ASR 250 and outputs the text “call mom” the NLU process may determine that the user intended to activate a telephone in his/her device and to initiate a call with a contact matching the entity “mom.”). 
Regarding claim 5, Johnson teaches the information processing device according to claim 4, wherein the estimation unit is further configured to extract an intention (Intent) that follows the speech sentence, from among intentions (Intents) of the respective divided speech sentences. (Col. 7, lines 31-35: “An intent classification [IC] module 264 parses the query to determine an intent or intents for each identified domain, where the intent corresponds to the action to be performed that is responsive to the query”, and Col 7, lines 38-40:”The IC module 264 identifies potential intents for each identified domain by comparing words in the query to the words and phrases in the intents database 278.”). 

Regarding claim 6, Johnson teaches the information processing device according to claim 4, wherein the estimation unit is further configured to extract entity information (Entity) that follows the speech sentence, from among pieces of entity information (Entity) of the respective divided speech sentences. (Col. 7, lines 31-36: “An intent classification (IC) module 264 parses the query to determine an intent or intents for each identified domain, where the intent corresponds to the action to be performed that is responsive to the query. Each domain is associated with a database [278a-278n] of words linked to intents.”, and col 6, lines 24-28:” NLU processing 260 interprets a text string to derive an intent or a desired action from the user as well as the pertinent pieces of information in the text that allow a device [e.g., device 110] to complete that action.”, and Col 7 lines 44-46:” In order to generate a particular interpreted response, the NER 262 applies the grammar models and lexical information associated with the respective domain. Each grammar model 276 includes the names of entities (i.e., nouns) commonly found in speech about the particular domain (i.e., generic terms), whereas the lexical information 286 from the gazetteer 284 is personalized to the user(s) and/or the device.”). 

Regarding claim 20, Johnson teaches an information processing method of an information processing device, the information processing method comprising ( Col. 32 Lines 10-14:” Aspects of the disclosed system may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium”) to perform operations with steps virtually identical to the functions performed in claim 1. 

Regarding claim 21, Johnson teaches a non-transitory computer-readable medium having embodied thereon a program, which when executed by a computer causes the computer to execute an information processing method, the method comprising: (Col. 28, lines 48 - 54:"Computer instructions for operating each device [110/120] and its various executed by the respective device's controller[s]/processor[s] [1204/1304], using the memory [1206/1306] as temporary “working” storage at runtime. A device's computer instructions may be stored in a non-transitory manner in non-volatile memory [1206/1306], storage [1208/1308], or an external device[s].”) to perform operations with steps virtually identical to the functions performed in claim 1. 


Claim  2 is rejected under 35 U.S.C. 103 as being unpatentable over  Johnson, and Kim as applied to claim 1, in further view of Eagleman et al.  (US 20180233163 A1)(hereinafter " Eagleman").

With regard to claim 2, Johnson teaches an information processing device as established above. 
Neither Johnson nor Kim teach wherein the result of the recognition further includes a result of recognition of sensor data obtained by sensing the user or a surrounding of the user.
Eagleman teaches further includes a result of recognition of image data obtained by capturing an image of the user (Par. 0049:” In one variant, the input information includes information related to a user's surroundings and/or the surroundings of a system component [e.g., sensor], such as information associated with nearby objects and/or people [e.g., wherein Block S110 includes extracting information, such as text, semantic meaning, conceptual information, and/or any other suitable information, from the input information]. In a first example of this variant, the system includes an image sensor [e.g., camera], and the text input text recognized in images captured by the image sensor [e.g., automatically detected, such as by performing image segmentation, optical character recognition, etc.], and/or other extracted information [e.g., conceptual information) includes information discerned from the images [e.g., as described above]. For example, Block S110 can include transforming an image of a sign containing a message or string of characters in the user's environment into character data, for instance, if a user is in transit and the sign contains information related to safety or travel information [e.g., in a foreign country train depot). In a second example, the system includes an audio sensor [e.g., microphone], and the text input includes text associated with sounds sampled by the audio sensor [e.g., transcriptions of speech], and/or other extracted information [e.g., conceptual information] includes information discerned from the sounds [e.g., as described above].”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson and Kim in view of Eaglemen to include a result of recognition of image data obtained by capturing an image of the user, or a result of recognition of sensor data obtained by sensing the user or a surrounding of the user, in order to benefit from receiving information derived from language inputs, derived from visual sources, derived from audio sources when either redundancy is beneficial, or if it is inconvenient or infeasible to receive information using a more conventional modality, as evidence by Eaglemen (See Par. 0003)

Claim  3 is rejected under 35 U.S.C. 103 as being unpatentable over  Johnson, Kim and Eaglemen  as applied to claim 2, in further view of Solomon et al.  (US 20180232662 A1 )(hereinafter "Solomon").

Solomon was applied in the previous office action.
With regard to claim 3, Johnson teaches an information processing device as established above. 
However, Johnson, Kim, and Eaglemen do not teach wherein the detection unit is further configured to detect the breakpoint of the speech on a basis of a state or a gesture of the user that is to be obtained from the result of the recognition.
Solomon teaches wherein the detection unit is further configured to detect the breakpoint of the speech on a basis of a state or a gesture of the user that is to be obtained from the result of the recognition. (Par. 0087:” People often communicate in unanticipated and interesting ways, whether verbally, non-verbally (visually), or otherwise. Accurately parsing this variety of language and corresponding surface forms can prove challenging. For example, where a deterministic or other rules-based parser is used to parse spoken utterances or sign language gestures/expressions, the rules-based parser may be seeded with a library of pre-selected surface forms, such as a collection of idioms or regular expressions. In situations where a rules-based parser does not recognize the surface form of a particular spoken utterance, it may be unable to provide the user intent underlying the utterance.”)
Therefore it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson, Kim, and Eaglemen in .

Claims 7, and 8 are rejected under 35 U.S.C. 103 as being unpatentable over  Johnson, Kim as applied to claim 4, in further view of Finkelstein et al.  (US 20180260680 A1)(hereinafter "Finkelstein"), in further view of Roy et .al (US 8219407 A1)(hereinafter "Roy").

Finkelstein and Roy were applied in the previous office action.
With regard to claims 7, and 8 Johnson teaches an information processing device as established above. 
With respect to claim 7, neither Johnson nor Kim teach The information processing device according to claim 4, wherein the entity information (Entity) includes, a Body type representing that a free speech is included, and in a case in which an intention (Intent) of a last divided speech sentence includes entity information (Entity) of a Body type, in a case in which a target divided speech sentence being a divided speech sentence provided ahead of the last divided speech sentence, and being targeted satisfies a specific condition, the estimation unit is further configured to make an intention (Intent) of the target divided speech sentence unexecuted, and add content thereof to entity information (Entity) of a Body type that is included in the intention (Intent) of the last divided speech sentence.
Finkelstein teaches wherein the entity information (Entity) includes, a Body type representing that a free speech is included, (Par. 0007:”FIG. 5 schematically shows a parser and intent handler processing a portion of a conversation”, and Par. 0381:” When included, input subsystem 770 may comprise or interface with one or more user-input devices. In some embodiments, the input subsystem may comprise or interface with selected natural user input (NUI) componentry. Such componentry may be integrated or peripheral, and the transduction and/or processing of input actions may be handled on-or off-board. Example NUI componentry may include a microphone for speech and/or voice recognition; an infrared, color, stereoscopic, and/or depth camera for machine vision and/or gesture recognition; a head tracker, eye tracker, accelerometer, and/or gyroscope for motion detection, gaze detection, and/or intent recognition; electric-field sensing componentry for assessing brain activity; any of the sensors described with respect to the example use cases and environments discussed above; and/or any other suitable sensor.”) Note the sensors are used in conjunction with the user’s speech to parse intent out.
and in a case in which an intention (Intent) of a last divided speech sentence includes entity information (Entity) of a Body type, (Par. 0256:” After initial identification of the person, the entity tracker 100 may use less resource-intensive techniques in order to continue tracking the person while conserving computing resources. For example, the entity tracker 100 may use lower-resolution cameras to track the person based on the general shape of their body, their gait [e.g., by evaluating angles formed between different joints as the person walks], their clothing [e.g., tracking patches of color known to correspond to the person's clothing], etc. In some examples, and to periodically confirm its initial identification of the person is still accurate, the entity tracker 100 may perform facial recognition intermittently after the initial identification. In general and depending on the particular context, the entity tracker 100 may identification and tracking of entities.”) Note: Basically combining information from body and information from speech used in order to determine intent.
Finkelstein further teaches and add content thereof to entity information (Entity) of a Body type that is included in the intention (Intent) of the last divided speech sentence (Par. 0381:” When included, input subsystem 770 may comprise or interface with one or more user-input devices. In some embodiments, the input subsystem may comprise or interface with selected natural user input (NUI) componentry. Such componentry may be integrated or peripheral, and the transduction and/or processing of input actions may be handled on-or off-board. Example NUI componentry may include a microphone for speech and/or voice recognition; an infrared, color, stereoscopic, and/or depth camera for machine vision and/or gesture recognition; a head tracker, eye tracker, accelerometer, and/or gyroscope for motion detection, gaze detection, and/or intent recognition; electric-field sensing componentry for assessing brain activity; any of the sensors described with respect to the example use cases and environments discussed above; and/or any other suitable sensor.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson, and Kim in view of Finkelstein to use body type and entity information in order to ensure that a user's requests and intentions are fully captured, as evidence by Finkelstein (See Par. 0002).
Johnson, Kim and Finkelstein do not teach in a case in which a target divided speech sentence being a divided speech sentence provided ahead of the last divided speech sentence, 
Roy teaches in a case in which a target divided speech sentence being a divided speech sentence provided ahead of the last divided speech sentence, and being targeted satisfies a specific condition, the estimation unit is further configured to make an intention (Intent) of the target divided speech sentence unexecuted. (col. 19 lines 1-25:”Based on the determined command (intent) elements, the logical command processor determines which command elements are present in the representation of the speech input. If all the command elements required for completeness are present, the command is executed. Otherwise the system builds and registers grammars with the speech recognizer for the missing command elements and the user is prompted for the missing command elements. This prompting can take place one element at-a-time, or all the missing elements can be requested in a single prompt. The system waits for and receives the subsequent user speech input, which is processed by the speech recognizer. If an abort, cancel or fail condition are not present in the input or otherwise triggered, for example by time or exceeding a predetermined number of loops, a representation of the applicable speech input is parsed into the command structure in the memory location maintaining the representation of the prior speech input. Speech input which is not applicable because if falls outside the scope of the relevant grammars is preferably ignored, although in some embodiments this determination of relevance may be made by the logical command processor. Additionally, the system may receive input from other input input is likewise added to the command structure in the representation of the speech input.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson, Kim, and Finkelstein in view of Roy to not to satisfy a specific condition  for the prior intention is unexecuted in order to use a logical command processor to provide additional processing of a portion of the output of a speech recognizer which cannot be processed by the speech recognizer itself, as evidence by Roy (See Col. 1 Lines: 30-33).

With respect to claim 8 Johnson further teaches wherein, in a case in which the target divided speech sentence does not satisfy the specific condition, the estimation unit is further configured to discard the intention (Intent) of the target divided speech sentence. (Col. 16, lines 56-60:”As part of the language modeling [or in other phases of the ASR processing] the speech recognition engine 258 may, to save computational resources, prune and discard low recognition score states or paths that have little likelihood of corresponding to the spoken utterance, either due to low recognition score pursuant to the language model, or for other reasons. Such pruned paths are considered inactive.”

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over  Johnson, Kim, Finkelstein, and Roy as applied to claim 8, in further view of William S. Carter  (US 9311932B1)(hereinafter "Carter").

Carter was applied in the previous office action.
With regard to claim 9 Johnson, and Kim teach an information processing device as established above. 
Neither Johnson nor Kim teaches wherein the specific condition includes a condition for determining whether or not a rate of the speech of the user exceeds a predetermined threshold value, or a condition for determining whether or not the user looks at a predetermined target. 
Carter teaches wherein the specific condition includes a condition for determining whether or not a rate of the speech of the user exceeds a predetermined threshold value, or a condition for determining whether or not the user looks at a predetermined target. (Col 9, lines 17-20:” Application 302 sends speech segment 318 to speech recognition application 304 for transcription. Application 302 receives from speech recognition application 304 transcript 320, which corresponds to speech segment 318.”, and Col 9, lines 39-45:” Component 322 compares the computed speech rate with speech rate threshold 312 to determine whether the speech rate in speech segment 318 is faster or slower than speech rate threshold 312. If the speech rate of speech segment 318 is faster than threshold 312, component 322 reduces the pause duration threshold for pause detection in a subsequent speech segment, and vice versa.”, and Col 9, lines 46-53:” Only as an example and without implying any limitation thereto, in one embodiment, the reduction [or increase] in the pause duration threshold is proportional to the ratio by which the computed speech rate is higher [or lower] than speech rate threshold 312. For example, if the speech rate of speech segment is ten percent higher than speech rate 
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson, and Kim in view of Carter a condition for determining whether or not the speech rate of user exceed a predetermined threshold in order to segment a continuous speech into appropriately sized discrete phrase or sentence fragments which can trigger the automatic generation of a corresponding textual transcript in real time or near real time, as evidence by Carter (See Col 3, lines 12-16).

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over  Johnson, Kim as applied to claim 1, in further view of Finkelstein.

With regard to claim 10 Johnson teaches an information processing device as established above. 
Johnson further teaches, wherein the entity information (Entity) includes, as a type thereof, and when the divided speech sentence including entity information (Entity) of a Body type does not exist, the estimation unit is further configured to estimate an intention of the speech of the user in accordance with intentions (Intents) of the respective divided speech sentences.  (Col.7 Lines 61-66:” For example, the NER module 260 may parse the query to identify words as subject, object, verb, preposition, etc., based on grammar rules and models, prior to recognizing named entities. The identified verb may be used by the IC module 264 to intent, which is then used by the NER module 262 to identify frameworks.” Note: Johnson’s teaching determines intent information is independent of the body information.) 
Neither Johnson nor Kim teach a Body type representing that a free speech is included.
Finkelstein teaches a Body type representing that a free speech is included (Par. 0381:” When included, input subsystem 770 may comprise or interface with one or more user-input devices. In some embodiments, the input subsystem may comprise or interface with selected natural user input (NUI) componentry. Such componentry may be integrated or peripheral, and the transduction and/or processing of input actions may be handled on-or off-board. Example NUI componentry may include a microphone for speech and/or voice recognition; an infrared, color, stereoscopic, and/or depth camera for machine vision and/or gesture recognition; a head tracker, eye tracker, accelerometer, and/or gyroscope for motion detection, gaze detection, and/or intent recognition; electric-field sensing componentry for assessing brain activity; any of the sensors described with respect to the example use cases and environments discussed above; and/or any other suitable sensor.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson, and Kim in view of Finkelstein to use body type and entity information in order to ensure that a user's requests and intentions are fully captured, as evidence by Finkelstein (See Par. 0002).

Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over  Johnson, Kim as applied to claim 1, in further view of Jeschke et al.  (US 20050267759 A1)(hereinafter " Jeschke").

Jeschke was applied in the previous office action.
With regard to claim 11 Johnson teaches an information processing device as established above. 
Neither Johnson nor Kim teach wherein, when the speech of the user includes an intention (Intent) of retraction, the estimation unit is further configured to delete a divided speech sentence to be retracted, from a target of intention estimation of the speech of the user.
Jeschke teaches wherein, when the speech of the user includes an intention (Intent) of retraction, the estimation unit is further configured to delete a divided speech sentence to be retracted, from a target of intention estimation of the speech of the user (Par. 0009:” The system further provides a method that may include the steps of interrupting the speech dialogue upon receipt of a predetermined pause command by the SDS and continuing the speech dialogue upon receipt of a predetermined continuation command. The method may provide a step for canceling the speech dialogue upon receipt of a predetermined cancellation command by the SDS. The cancellation command may be a speech command or the command may be transmitted by the user activating a control key or switch.)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson, Kim in view of Jeschke to include an intention retraction in order to receive the commands for interrupting and continuing the speech dialogue by receiving the pause and continue commands, as evidence by Jeschke (See Par. 0007).

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over  Johnson, Kim, and Jeschke as applied to claim 11, in further view of Roy.

With regard to claim 12 Johnson teaches an information processing device as established above. 
Neither Johnson nor Kim teach wherein, when an nth divided speech sentence includes an intention (Intent) of retraction, the estimation unit is further configured to delete an (n-1)th divided speech sentence from a target of intention estimation of the speech of the user.
Roy teaches wherein, when an nth divided speech sentence includes an intention (Intent) of retraction, the estimation unit is further configured to delete an (n-1)th divided speech sentence from a target of intention estimation of the speech of the user. (col. 19 lines 1-25:”Based on the determined command (intent) elements, the logical command processor determines which command elements are present in the representation of the speech input. If all the command elements required for completeness are present, the command is executed. Otherwise the system builds and registers grammars with the speech recognizer for the missing command elements and the user is prompted for the missing command elements. This prompting can take place one element at-a-time, or all the missing elements can be requested in a single prompt. The system waits for and receives the subsequent user speech input, which is processed by the speech recognizer. If an abort, cancel or fail condition are not present in the input or otherwise triggered, for example by time or exceeding a predetermined number of loops, a representation of the applicable speech input is parsed into the command structure in speech input. Speech input which is not applicable because if falls outside the scope of the relevant grammars is preferably ignored, although in some embodiments this determination of relevance may be made by the logical command processor. Additionally, the system may receive input from other input devices such as a keyboard, mouse or handwriting device, and this input is likewise added to the command structure in the representation of the speech input.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson, Kim in view of Roy to include an intention of retraction in order to use a logical command processor to provide additional processing of a portion of the output of a speech recognizer which cannot be processed by the speech recognizer itself, as evidence by Roy (See Col. 1 Lines: 30-33).

Claims 13, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over  Johnson, and Kim as applied to claim 1, in further view of Ciurpita et al. ( US20030023439A1)(hereinafter " Ciurpita").

Ciurpita was applied in the previous office action.
With regard to claim 13, and 14 Johnson teaches an information processing device as established above. 
With regard to claim 13, neither Johnson nor Kim teach further comprising a generation unit configured to generate feedback information to be output at the detected breakpoint of the speech, wherein the generation unit is implemented via at least one processor.
recognize sequences of speech units between these natural pauses of a human and provide useful feedback. In other words, the system takes advantage of these natural pauses between utterances to provide feedback to the user.”, and Par. 0026:” The system of the present invention may be embodied as a single digital signal processor (DSP) capable of performing voice recognition and feedback, and may include a VR engine, system controller, and text-to-speech (TTS) generator.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson, and Kim in view of Ciurpita to generate feedback to be output in order to provide feedback after each subgroup by repeating the recognition results, as evidence by Ciurpita (See Par. 0010).

With regard to claim 14, neither Johnson nor Kim teach wherein the feedback information includes a voice, a sound effect, or an image.
Ciurpita teach wherein the feedback information includes a voice, a sound effect, or an image (Par. 0068:” … and sends feedback data to a Text-to-Speech Generator (TTS) 175 for suitable processing before the audio feedback is sent to a user of the system 100.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson, and Kim in view of .

Claim  15 is rejected under 35 U.S.C. 103 as being unpatentable over  Johnson, Kim, Eaglemen , and Solomon as applied to claim 3, in further view of Aravamudan et al.  (US 20140337370 A1)(hereinafter "Aravamudan").

Aravamudan was applied in the previous office action.
With regard to claim 15, Johnson teaches an information processing device as established above. 
Johnson, Kim, Eaglemen, and solomon do not teach wherein the detection unit is further configured to detect the breakpoint of the speech on a basis of the result of the recognition of the voice data, when a time of a pause of the speech of the user exceeds a fixed time, when a boundary of an intonation phrase included in the speech of the user is detected, or when falter or filler included in the speech of the user is detected.
Aravamudan teaches wherein the detection unit is further configured to detect the breakpoint of the speech on a basis of the result of the recognition of the voice data (Par 0054:” In the case of demarcating the phrase boundary, the user may confidently speak the portion following the pause. Accordingly, the present system can determine the portion following the pause as a certain phrase or title based on the loudness or speed of the speaker's voice. Another method to distinguish whether the portion following the pause is a confident phrase or an uncertain phrase can be based on a further utterance following the initial pause.”)
pauses--the length of a pause being used as a metric for deciding what results to present.”, and Par. 0055:” the presence of a pause within the speech input can be used as a confidence measure of portions of the input itself. The interpretation of the duration of pauses and their frequency of occurrence is also factored in by embodiments of the present invention to distinguish the cases of user just speaking slowly [so that speech recognition may work better] versus pausing to perform cognitive recall.”)
or when falter or filler included in the speech of the user is detected. (Par. 0052:” In addition to the use of pauses, other forms of disfluencies, including auditory time fillers, are used in speech processing. In the event user speaks additive filler words or sounds to accompany a pause, those filler words and sounds may be recognized as pause additives by the downstream modules that process the output of the speech-to-text engine. For instance, use of filler words such as "like" followed by pause, or sounds such as "umm," "hmm," "well," "uh," and "eh" followed by a pause are also considered collectively as a pause with the overall pause duration including the duration of utterance of the filler words. In other embodiments, auditory filler words are not followed by a pause. Typically, auditory time fillers are continuous and lack variations in tone and volume. These characteristics may aid the detection of auditory time fillers.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson, Kim, Eaglemen, and solomon in view of Aravamudan to detect breakpoint of the speech from the voice data, or .

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over  Johnson, Kim, Eagleman, and Solomom as applied to claim 3).

With regard to claim 16, Johnson teaches an information processing device as established above. 
Johnson, Eagleman, and Solomom do not teach wherein the detection unit is further configured to detect the breakpoint of the speech on a basis of the result of the recognition of the image data when a time in which a mouth of the user does not move exceeds a fixed time, or when a movement of a visual line of the user that exceeds a predetermined threshold is detected.
Kim teaches wherein the detection unit detects the breakpoint of the speech on a basis of the result of the recognition of the image data when a time in which a mouth of the user does not move exceeds a fixed time, or when a movement of a visual line of the user that exceeds a predetermined threshold is detected. (Par. 0004:”In some implementations, a system is capable improving endpoint detection of a voice query submitted by a user. For instance, the system may initially obtain audio data encoding the submitted voice query, and video data synchronized with the obtained audio data that includes images of the user's face when voice query. … As an example, the system initially determines a sequence of video frames that includes images of a face of the user. The system then identifies a sequence of video frames that includes images of detected lip movement.”, and Par. 0006: ”…receiving synchronized video data and audio data; determining that a sequence of frames of the video data includes images corresponding to lip movement on a face; endpointing the audio data based on first audio data that corresponds to a first frame of the sequence of frames and second audio data that corresponds to a last frame of the sequence of frames; generating, by an automated speech recognizer, a transcription of the endpointed audio data; and providing the generated transcription for output.”, and Par. 0052:” For instance, the lip movement module 124 may be capable of identifying lip movement patterns within the detected lip movement 109, and then determining terms and/or phrases that are predetermined to be associated with the identified lip movement patterns.”)
 Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson, Eagleman, and Solomom in view of Kim to detect the breakpoint based on the user’s image in order to verify audio data against detected lip movement data indicating terms and/or phrases spoken by the user to identify and/or correct misrecognized terms, as evidence by Kim (see Par. 0005).

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over  Johnson, Kim, Eaglemen, and Solomon as applied to claim 3.


Johnson, Kim, Eaglemen, and Solomon do not teach wherein the detection unit is further configured to detect the breakpoint of the speech on a basis of the result of the recognition of the sensor data, when intake of breath of the user is detected, or a movement of an entire or a part of a body of the user is detected.
Kim teaches wherein the detection unit is further configured to detect the breakpoint of the speech on a basis of the result of the recognition of the sensor data, when intake of breath of the user is detected, or a movement of an entire or a part of a body of the user is detected. (Par. 0063:” the face detection module 224 transmits the video data 206b and the audio data 204a to the lip movement module 226, which then synchronizes the video data and the audio data and identifies detected lip movement data, e.g., the lip movement data 109, as described above. The query endpoint module 228 then segments the synchronized audio data based on the detected lip movement data, and generates a transcription 208a for an audio segment”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson, Kim, Eaglemen, and Solomon in view of Kim to use movement to detect breakpoint in order to verify audio data against detected lip movement data indicating terms and/or phrases spoken by the user to identify and/or correct misrecognized terms, as evidence by Kim (See Par. 0005).

Claims 18, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over  Johnson, Kim, as applied to claim 1, in further view of Homma et al. (US 20160188574 A1)(hereinafter "Homma").

Homma was applied in the previous office action.
With regard to claim 18, and 19 Johnson teaches an information processing device as established above. 
With regard to claim 18, neither Johnson nor Kim teach further comprising a task execution unit configured to execute a task on a basis of a result of intention estimation of the speech of the user, wherein the task execution unit is implemented via at least one processor.
Homma teaches further comprising a task execution unit configured to execute a task on a basis of a result of intention estimation of the speech of the user, wherein the task execution unit is implemented via at least one processor. (Par. 0031:” a transmission unit that transmits the input that is input to the input unit by the user to the intention estimation equipment; and an execution unit that receives an intention estimation result on the input by the user performed by the intention estimation equipment and acts according to the intention estimation result.”, and Par. 0056:” The control unit 1070 includes a CPU, a ROM, and a RAM.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Johnson, and Kim in view of Homma to execute a task based on the result of intention in order to increase the accuracy of intention estimation by using log information of natural language that is recorded when the electronic equipment is actually utilized by the user, as evidence by Homma (See Par. 0009).

With regard to claim 19 Johnson further teaches further comprising:  a speech recognition unit configured to perform speech recognition (ASR) for obtaining the speech sentence from the speech of the user.   (Col. 3, lines 36-39:” The system may perform [160] ASR processing on the audio data and may determine [162] an endpoint of the speech using the audio data, direction, and duration”, and Col. 4, lines 13-16:”The ASR component 250 converts the audio into text. The ASR component 250 thus transcribes audio data into text data representing the words of the speech contained in the audio data.”)
and a semantic analysis unit configured to perform semantic analysis (NLU) of the divided speech sentence to be sequentially obtained at the breakpoint of the speech. (Col. 6, lines 20-28:”The NLU process takes textual input [such as processed from ASR 250 based on the utterance 11] and attempts to make a semantic interpretation of the text. That is, the NLU process determines the meaning behind the text based on the individual words and then implements that meaning. NLU processing 260 interprets a text string to derive an intent or a desired action from the user as well as the pertinent pieces of information in the text that allow a device (e.g., device 110) to complete that action.”)
wherein the speech recognition unit and the semantic analysis unit are each implemented via at least one processor. (Col. 6, lines 5-7:” The device performing NLU processing 260 [e.g., server 120] may include various components, including potentially dedicated processor[s], memory, storage, etc.”).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 


Any inquiry concerning this communication or earlier communications from the examiner should be directed to DARIOUSH AGAHI whose telephone number is (408)918-7689.  The examiner can normally be reached on Monday - Thursday and alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/D.A./Examiner, Art Unit 2656                                                                                                                                                                                                        

/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        
05/16/2021