DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 
The following title is suggested: DEVICE, METHOD AND COMPUTER- READABLE STORAGE MEDIUM FOR VOICE RECOGNITION CONTROL.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, and 5-11 are rejected under 35 U.S.C. 103 as being unpatentable over Yoon, (US 2018/0358013 A1, herein “Yoon”) in view of Kane et al., (US 2019/0355352 A1, herein “Kane”).
Regarding claim 1, Yoon teaches an information processing device comprising (Yoon paras. 52 and 54, apparatus for selecting a task by processing an electrical signal containing sound wave information): 
processing circuitry (Yoon figs. 1 and 2, processor 20) to acquire a voice signal representing voices corresponding to a plurality of utterances made by one or more users (Yoon paras. 55 and 59, processor receives an electrical voice signal from the sound receiver, where the signal has multiple voices V1 – Vn, uttered by n speakers S1 - Sn); 
to recognize the voices from the voice signal (Yoon fig. 2, voice separation 30, paras. 75 and 81, processor separates voice signals corresponding to the voices received from the speakers from each other, and further, classifies the respective speakers by identifying them), convert the recognized voices into character strings to identify the plurality of utterances (Yoon para. 75, the respective voice signals are converted to text), and identify times corresponding to the respective utterances (Yoon paras. 172-174, fig. 13, voice signals are analyzed according to a time point at which the voice signal is received, such as t11 for the utterance from speaker 203, and t12 for the utterance from speaker 201); 
to identify users who have made the respective utterances, as speakers from among the one or more users (Yoon fig. 2, paras. 75, 81, and 153, classification of the respective speakers by identifying them, and further, identifying them by where they are seated as occupants in a car); 
to store utterance history information including a plurality of records (Yoon figs. 3, and 7, paras. 123-125, tables featuring various information (records) about the utterances (after they are spoken and processed, thus history information) are stored in storage portion 80), the plurality of records indicating the respective utterances, and the speakers corresponding to the respective utterances (Yoon fig. 7, paras. 125-128, a decided task table stored in storage portion 80 including who the speaker is (driver, previous driver, non-driver) and a keyword uttered relating to a particular task); 
to estimate meanings of the respective utterances (Yoon paras. 113-116, by extracting keywords from the text of utterances from identified speakers, and comparing the keywords to a task database, the processor determines (estimates meaning) an utterance to be directed towards a voice command); 
to perform a determination process of referring to the utterance history information (Yoon paras. 127-132, a task table created from the sequence of utterances (utterance history) is referenced to determine priorities among the uttered task requests) and when a last utterance of the plurality of utterances and one or more utterances of the plurality of utterances immediately preceding the last utterance are not a conversation (Yoon fig. 13, paras. 169-173, in an exemplary embodiment, speakers 201-203 are speaking utterances along a timeline as shown, and where at least the last two utterances are commands, and not part of a conversation), determining that the last utterance is a voice command for controlling a target (Yoon fig. 13, paras. 172-173, the utterance from speaker 201 (which is the last one on the timeline) is identified and ranked first for a task command to control the car system to begin a gas station search task); and 
to, when it is determined that the last utterance is the voice command, control the target in accordance with the meaning estimated from the last utterance (Yoon fig. 13, para. 173, according to the decided processing order, where in the example given, the last utterance of “Hey, find the nearest gas station” is the last utterance, the task “find gas station” is determined from the utterance as a voice command, and the processor processed the gas station search).
While Yoon teaches its system is aware of the times that utterances are made (see fig. 13, and times listed for respective utterances), Yoon does not explicitly teach that the times of the utterances are stored also in the tables stored in storage portion 80. Therefore, Yoon does not explicitly teach “the times corresponding to the respective utterances.”
Kane teaches the times corresponding to the respective utterances (Kane paras. 3, plurality of utterances are grouped into conversation threads according to an utterance time, where paras 60-61 teach that a system memory stores the data associated with the users for the grouping based on content-agnostic factors, such as the utterance time).
Therefore, taking the teachings of Yoon and Kane together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the stored task table of Yoon to include an utterance time as disclosed in Kane at least because doing so would allow for a speech recognition to evolve over time resulting in increased speed and accuracy  (see Kane para. 65).
Regarding claim 5, Yoon teaches wherein the processing circuitry identifies a pattern of an utterance group including the last utterance, from among a plurality of predetermined patterns (Yoon paras. 111, fig. 13, paras. 171-174, utterances are determined to be within a group that are uttered by a present driver (output of predetermined patterns including present driver, past driver, and non-driver), where the driver has made the last utterance in the example given), and wherein how to determine whether the last utterance is the voice command depends on the identified pattern (Yoon para. 172, the last utterance has a matching keyword for a task, and is determined to be a voice command, whereupon a gas station search task is determined with priority 1 since it is the driver saying the utterance).
Regarding claim 6, Yoon does not teach the limitations of claim 6. Kane teaches wherein the processing circuitry acquires an image signal representing an image of a space in which the one or more users exist, determines, from the image, a number of the one or more users (Kane paras. 24 and 26, speech recognition unit analyzes image signals from an optic sensor, to identify users in an environment 202 using faceprints), and performs the determination process when the determined number is not less than 2 (Kane para 20, the system monitors the environment 202 which includes a plurality of users (thus not less than two)).
Therefore, taking the teachings of Yoon and Kane together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the utterance processing of Yoon to include facial identification of speakers as disclosed in Kane at least because doing so would allow for a speech recognition to evolve over time resulting in increased speed and accuracy  (see Kane para. 65).
Regarding claim 7, Yoon teaches controls the target in accordance with the meaning estimated from the last utterance (Yoon fig. 13, paras. 171-174, the system is controlled to conduct a gas station search based on identified keywords (meaning) from the last utterance).
Yoon does not explicitly teach wherein when the determined number is 1, the processing circuitry controls. 
Kane teaches wherein when the determined number is 1, the processing circuitry controls (Kane para. 27, speech recognition unit uses faceprints to find a match to allow one or more (thus including when it is just one) users in the environment to be uniquely identified, and controls the conversation recognition system using the profile from the identified user from the faceprint).
Therefore, taking the teachings of Yoon and Kane together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the utterance processing of Yoon to include facial identification of speakers as disclosed in Kane at least because doing so would allow for a speech recognition to evolve over time resulting in increased speed and accuracy  (see Kane para. 65).
Regarding claim 8, Yoon teaches wherein the processing circuitry determines a topic of the last utterance and determines whether the determined topic is a predetermined specific topic (Yoon paras. 110-113, and 118-119, keywords representing tasks (thus directed towards various topics) are extracted from each separated voice signal (utterance) including a last utterance, as shown in fig. 13, where the keywords detected from respective text are different, and have different predefined priorities, where words related to safety have a high priority, and words related to non-safety (predetermined specific topic) have a low priority), and performs the determination process when the determined topic is not the predetermined specific topic (Yoon paras. 118-119, 131, 139, 174, when the detected keywords are not non-safety (that is, they are safety related) a priority scheme is employed to determine if the utterance is from a driver, and the priority of the task per the task DB).
Regarding claim 9, Yoon teaches wherein when the determined topic is the predetermined specific topic, the processing circuitry controls the target in accordance with the meaning estimated from the last utterance (Yoon paras. 118-119, 137 and 140, when the detected keywords are non-safety related, the identified tasks thereof are given low priority, and the resulting commands that control operation of various systems like the air conditioner or radio are executed after higher priority tasks (thus in accordance with the meaning, when, as in fig. 13, the last utterance is safety related and given higher priority)).
Regarding claim 10, Yoon teaches an information processing method comprising: (Yoon para. 10, method for selecting a task by extracting keywords from received voices of a plurality of speakers): 
acquiring a voice signal representing voices corresponding to a plurality of utterances made by one or more users (Yoon paras. 55 and 59, processor receives an electrical voice signal from the sound receiver, where the signal has multiple voices V1 – Vn, uttered by n speakers S1 - Sn); 
recognizing the voices from the voice signal (Yoon fig. 2, voice separation 30, paras. 75 and 81, processor separates voice signals corresponding to the voices received from the speakers from each other, and further, classifies the respective speakers by identifying them);
converting the recognized voices into character strings to identify the plurality of utterances (Yoon para. 75, the respective voice signals are converted to text);
identifying times corresponding to the respective utterances (Yoon paras. 172-174, fig. 13, voice signals are analyzed according to a time point at which the voice signal is received, such as t11 for the utterance from speaker 203, and t12 for the utterance from speaker 201); 
identifying users who have made the respective utterances, as speakers from among the one or more users (Yoon fig. 2, paras. 75, 81, and 153, classification of the respective speakers by identifying them, and further, identifying them by where they are seated as occupants in a car); 
estimating meanings of the respective utterances (Yoon paras. 113-116, by extracting keywords from the text of utterances from identified speakers, and comparing the keywords to a task database, the processor determines (estimates meaning) an utterance to be directed towards a voice command);
referring to utterance history information (Yoon paras. 127-132, a task table created from the sequence of utterances (utterance history) is referenced to determine priorities among the uttered task requests) including a plurality of records (Yoon figs. 3, and 7, paras. 123-125, tables featuring various information (records) about the utterances (after they are spoken and processed, thus history information) stored in storage portion 80), the plurality of records indicating the respective utterances, and the speakers corresponding to the respective utterances (Yoon fig. 7, paras. 125-128, a decided task table stored in storage portion 80 including who the speaker is (driver, previous driver, non-driver) and a keyword uttered relating to a particular task), and when a last utterance of the plurality of utterances and one or more utterances of the plurality of utterances immediately preceding the last utterance are not a conversation (Yoon fig. 13, paras. 169-173, in an exemplary embodiment, speakers 201-203 are speaking utterances along a timeline as shown, and where at least the last two utterances are commands, and not part of a conversation), determining that the last utterance is a voice command for controlling a target (Yoon fig. 13, paras. 172-173, the utterance from speaker 201 (which is the last one on the timeline) is identified and ranked first for a task command to control the car system to begin a gas station search task); and 
when it is determined that the last utterance is the voice command, control the target in accordance with the meaning estimated from the last utterance (Yoon fig. 13, para. 173, according to the decided processing order, where in the example given, the last utterance of “Hey, find the nearest gas station” is the last utterance, the task “find gas station” is determined from the utterance as a voice command, and the processor processed the gas station search).
While Yoon teaches its system is aware of the times that utterances are made (see fig. 13, and times listed for respective utterances), Yoon does not explicitly teach that the times of the utterances are stored also in the tables stored in storage portion 80. Therefore, Yoon does not explicitly teach “the times corresponding to the respective utterances.”
Kane teaches the times corresponding to the respective utterances (Kane paras. 3, plurality of utterances are grouped into conversation threads according to an utterance time, where paras 60-61 teach that a system memory stores the data associated with the users for the grouping based on content-agnostic factors, such as the utterance time).
Therefore, taking the teachings of Yoon and Kane together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the stored task table of Yoon to include an utterance time as disclosed in Kane at least because doing so would allow for a speech recognition to evolve over time resulting in increased speed and accuracy  (see Kane para. 65).
Regarding claim 11, Yoon teaches a non-transitory computer-readable storage medium storing a program for causing a computer (Yoon paras. 65-67, processor executing a program stored in storage portion 80) 
to acquire a voice signal representing voices corresponding to a plurality of utterances made by one or more users (Yoon paras. 55 and 59, processor receives an electrical voice signal from the sound receiver, where the signal has multiple voices V1 – Vn, uttered by n speakers S1 - Sn); 
to recognize the voices from the voice signal (Yoon fig. 2, voice separation 30, paras. 75 and 81, processor separates voice signals corresponding to the voices received from the speakers from each other, and further, classifies the respective speakers by identifying them), convert the recognized voices into character strings to identify the plurality of utterances (Yoon para. 75, the respective voice signals are converted to text), and identify times corresponding to the respective utterances (Yoon paras. 172-174, fig. 13, voice signals are analyzed according to a time point at which the voice signal is received, such as t11 for the utterance from speaker 203, and t12 for the utterance from speaker 201); 
to identify users who have made the respective utterances, as speakers from among the one or more users (Yoon fig. 2, paras. 75, 81, and 153, classification of the respective speakers by identifying them, and further, identifying them by where they are seated as occupants in a car); 
to store utterance history information including a plurality of records (Yoon figs. 3, and 7, paras. 123-125, tables featuring various information (records) about the utterances (after they are spoken and processed, thus history information) are stored in storage portion 80), the plurality of records indicating the respective utterances, and the speakers corresponding to the respective utterances (Yoon fig. 7, paras. 125-128, a decided task table stored in storage portion 80 including who the speaker is (driver, previous driver, non-driver) and a keyword uttered relating to a particular task); 
to estimate meanings of the respective utterances (Yoon paras. 113-116, by extracting keywords from the text of utterances from identified speakers, and comparing the keywords to a task database, the processor determines (estimates meaning) an utterance to be directed towards a voice command); 
to perform a determination process of referring to the utterance history information (Yoon paras. 127-132, a task table created from the sequence of utterances (utterance history) is referenced to determine priorities among the uttered task requests) and when a last utterance of the plurality of utterances and one or more utterances of the plurality of utterances immediately preceding the last utterance are not a conversation (Yoon fig. 13, paras. 169-173, in an exemplary embodiment, speakers 201-203 are speaking utterances along a timeline as shown, and where at least the last two utterances are commands, and not part of a conversation), determining that the last utterance is a voice command for controlling a target (Yoon fig. 13, paras. 172-173, the utterance from speaker 201 (which is the last one on the timeline) is identified and ranked first for a task command to control the car system to begin a gas station search task); and 
to, when it is determined that the last utterance is the voice command, control the target in accordance with the meaning estimated from the last utterance (Yoon fig. 13, para. 173, according to the decided processing order, where in the example given, the last utterance of “Hey, find the nearest gas station” is the last utterance, the task “find gas station” is determined from the utterance as a voice command, and the processor processed the gas station search).
While Yoon teaches its system is aware of the times that utterances are made (see fig. 13, and times listed for respective utterances), Yoon does not explicitly teach that the times of the utterances are stored also in the tables stored in storage portion 80. Therefore, Yoon does not explicitly teach “the times corresponding to the respective utterances.”
Kane teaches the times corresponding to the respective utterances (Kane paras. 3, plurality of utterances are grouped into conversation threads according to an utterance time, where paras 60-61 teach that a system memory stores the data associated with the users for the grouping based on content-agnostic factors, such as the utterance time).
Therefore, taking the teachings of Yoon and Kane together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the stored task table of Yoon to include an utterance time as disclosed in Kane at least because doing so would allow for a speech recognition to evolve over time resulting in increased speed and accuracy  (see Kane para. 65).
Claims 2-4 are rejected under 35 U.S.C. 103 as being unpatentable over Yoon in view of Kane, as set forth above regarding claim 1 from which claims 2-4 depend, further in view of Song et al., “Dialogue Session Segmentation by Embedding-Enhanced TextTiling,” Proc. Interspeech 2016, pp. 2706-2710, arXiv:1610.03955v1 [cs.CL] (herein “Song NPL”).
Regarding claim 2, Yoon does not explicitly teach the limitations of claim 2. Kane teaches wherein the processing circuitry indicating a degree of matching between the last utterance and the one or more utterances in terms of context (Kane paras. 33-39, language model clusters (indicates a degree of matching) between the utterances to find utterances belonging to a particular conversation, by comparing linguistic features of the utterances (including context) against predetermined thresholds, and where fig. 7, illustrates a timeline of utterances with a last utterance at the bottom), and when the context matching rate is not greater than a predetermined threshold, determines that the last utterance and the one or more utterances are not a conversation (Kane para. 34, utterances with linguistic features that do not meet the predetermined threshold will not be clustered into a particular conversation).
Song NPL teaches calculates a context matching rate (Song NPL section 3.3, similarity between two utterances in a dialogue (calculated) according to equation 7 under a heuristic-max method).
Therefore, taking the teachings of Yoon and Kane together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the language processing of Yoon to include considerations of grouping utterances contextually as disclosed in Kane at least because doing so would allow for a speech recognition to evolve over time resulting in increased speed and accuracy  (see Kane para. 65).
Further, taking the teachings of Yoon and Song NPL together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the language processing of Yoon to include calculating a similarity score as disclosed in Song NPL at least because doing so would allow for greater understanding of the meaning expressed in open-domain conversations (see Song NPL section 1).
Regarding claim 3, Yoon does not explicitly teach the limitations of claim 3. Kane teaches wherein the processing circuitry indicating a degree of matching between the last utterance and the one or more utterances in terms of context (Kane paras. 33-39, language model clusters (indicates a degree of matching) between the utterances to find utterances belonging to a particular conversation, by comparing linguistic features of the utterances (including context) against predetermined thresholds, and where fig. 7, illustrates a timeline of utterances with a last utterance at the bottom), determines a weight that decreases the context matching rate as a time interval between the last utterance and the utterance immediately preceding the last utterance increases (Kane paras. 40-41, utterances are grouped into conversation according to the utterance time, where differences in utterance time are compared to a threshold to determine likelihood (weighting) of the utterances being in the same conversation (having a higher context matching rate), where utterances closer in time have a higher likelihood of being grouped together – thus the distance in time being a weight decreasing likelihood of context match/same conversation, and where fig. 7 illustrates a timeline of utterances with a last utterance at the bottom, and the utterance immediately preceding the last one shown), and when a value obtained by correcting the context matching rate with the weight is not greater than a predetermined threshold, determines that the last utterance and the one or more utterances are not a conversation (Kane paras. 40-42, utterances with a time difference that when compared to a threshold results in a lesser likelihood that the utterances are of the same conversation are separated from each other (determined not to be a conversation) based on the determined likelihoods).
Song NPL teaches calculates a context matching rate (Song NPL section 3.3, similarity between two utterances in a dialogue (calculated) according to equation 7 under a heuristic-max method).
Therefore, taking the teachings of Yoon and Kane together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the language processing of Yoon to include an considerations of grouping utterances as disclosed in Kane at least because doing so would allow for a speech recognition to evolve over time resulting in increased speed and accuracy  (see Kane para. 65).
Further, taking the teachings of Yoon and Song NPL together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the language processing of Yoon to include calculating a similarity score as disclosed in Song NPL at least because doing so would allow for greater understanding of the meaning expressed in open-domain conversations (see Song NPL section 1).
Regarding claim 4, Yoon does not explicitly teach the limitations of claim 4. Kane teaches wherein the processing circuitry a probability (Kane para. 40, likelihood of utterances being in a conversation are determined) that the one or more utterances lead to the last utterance, by referring to a conversation model (Kane paras. 33-34, 56 and 65, a language model (conversational model) is used to compares linguistic features to group the utterances together (thus determine that earlier utterances are connected to “lead” the last utterance, where fig. 7 illustrates a last utterance in a time line of utterances including earlier spoken utterances).
Song NPL teaches calculates, as the context matching rate (Song NPL section 3.3, similarity measure calculated from equation 7). 
Song NPL further teaches trained from conversations conducted by a plurality of users (Song NPL section 4.1, word embeddings used to determine the similarity measures, are trained from a dataset comprises of 3 million utterances from a public forum (which would have a plurality of users)).
Therefore, taking the teachings of Yoon and Kane together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the language processing of Yoon to include an considerations of grouping utterances as disclosed in Kane at least because doing so would allow for a speech recognition to evolve over time resulting in increased speed and accuracy (see Kane para. 65).
Further, taking the teachings of Yoon and Song NPL together as a whole, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the language processing of Yoon to include calculating a similarity score as disclosed in Song NPL at least because doing so would allow for greater understanding of the meaning expressed in open-domain conversations (see Song NPL section 1).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Kim et al., US 2019/0378515 A1, directed towards controlling a vehicle based on acquiring utterances from occupants in the vehicle and analyzing the speech and dialog of the speech.
Doshi et al., US 2019/0318759 A1, directed towards detecting an end point of a user’s voice command in an automatic speech recognition based human machine interface.
Furumoto et al., US 2017/0243580 A1, directed towards a speech recognition system for controlling a navigation system that determines whether speech from a user is intended to be directed towards operating the navigation system.


Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M KOETH whose telephone number is (571)272-5908. The examiner can normally be reached Monday-Friday, 09:30-18:30 EDT/EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

MICHELLE M. KOETH
Primary Examiner
Art Unit 2656



/MICHELLE M KOETH/Primary Examiner, Art Unit 2656