DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-7, 9-15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lee 2015/0279356 A1 in view of Leppanen U.S. PAP 2015/0228274.


Regarding claim 1 Lee teaches a method for operating an electronic device managing voice-based interaction in an Internet of things (IoT) network system (speech recognition server, see abstract), the method comprising: 
identifying a first voice utterance of a user from a first IoT device among a plurality of IoT devices in the IoT network system ( a plurality of terminals 21, 21-1, 21-2, . . . , and 21-N to receive speech, and a server 23 to recognize speech, see par. [0081]); 
identifying at least one second voice utterance of the user from at least one second IoT device among the plurality of IoT devices in the IoT network system (While being spread as physical waves in a space, the speech sound is received by the plurality of terminals 21, 21-1, 21-2 . . . , and 21-N that are positioned at separate locations. Each of the terminals converts the received speech sound into electronic speech signals, and the converted speech signals may be transmitted to the server 23 through a wired or wireless communication network that transmits and receives electronic signals., see par. [0082]; 
determining a time interval at which voice utterances received with same user ID; determining whether the time interval is less than a threshold interval (y recognize a command by applying a weighted value according to distances between the speaker (U) and each of the terminals that have been calculated based on times of arrival when speech reaches each of the terminals from the speaker, see par. [0083]); 
based on identifying that the time interval is less than the threshold interval, generating a voice command by combining the first voice utterance and the at least one second voice utterance (The electronic signals generated in the terminals may include not only electronic signals converted from physical speech, but also signals recognized as a speech signal that is likely to include a meaningful command, see par. [0082]); 
and triggering at least one IoT device among the plurality of IoT devices in the IoT network system to perform at least one action corresponding to the voice command (he target may perform a specific action corresponding to the received signal in operation 1311, see par. [0157]).
However Lee does not teach identifying a user identification (ID) by comparing pre-stored voice information with extracted voice information from at least one voice utterance among the first voice utterance and the at least one second voice utterance.
Lepannen teaches one or more devices in physical proximity of a user of a principal device are identified. Multiple audio samples captured by the identified devices are received. An audio sample comprising a voice of the user of the principal device is selected from among the multiple audio samples captured by the identified devices based on suitability of the audio sample for speech recognition, see abstract. One approach to improving the quality of an audio sample is to utilize an array of microphones, see par. [0002]. One or more of the audio samples received from secondary devices 106, 108, and 110 may include a voice other than the voice of user 102 and multi-device speech recognition apparatus 200 may be configured to compare the received audio samples to a reference audio sample of the voice of user 102 to identify such an audio sample. Multi-device speech recognition apparatus 200 may be configured to compare the received audio samples to a reference audio sample of the voice of user 102 to identify such an audio sample (see par. [0023]). 
It would have been obvious to one of ordinary skill in the art to combine the Lee invention with the teachings of Leppanen for the benefit of improving the quality of the voice samples, see par. [0002].

Regarding claim 2 Lee teaches the method of claim 1, wherein the first IoT device is located at a first location in the IoT network system, and the at least one second IoT device is located at a second location in the IoT network system, and wherein the first location is different than the second location in the IoT network system (terminals 21, 21-1, 21-2 . . . , and 21-N that are positioned at separate locations, see par. [0082]).
Regarding claim 3 Lee teaches the method of claim 1, wherein the first voice utterance of the user is identified in a first time period and the at least one second voice utterance of the user is identified in a second time period (a processor configured to calculate times of arrival of a speech sound at each of the terminals using speech signals received from each of the terminals, calculate distances between a user and the terminals based on the times of arrival of the speech sound to each of the terminals, see par. [0008]).
Regarding claim 4 Lee teaches the method of claim 1, wherein the at least one action corresponding to the voice command is determined by: dynamically detecting an intent from the voice command (recognize a user’s speech as commands, see par. [0005]); 
and determining the at least one action corresponding to the voice command based on the intent ( speech signals that indicate speech recognition results or speech commands are transmitted from the server to a target in operation 1309, and the target may perform a specific action corresponding to the received signal in operation 1311, see par. [0157]).
Regarding claim 5 Leppanen teaches the method of claim 1, wherein the generating of the voice command by combining the first voice utterance and the at least one second voice utterance comprises: recognizing the first voice utterance and the at least one second voice utterance(selecting preferred frames based on their suitability for speech recognition, see par. [0030])); 
determining a confidence level to combine the first voice utterance with the at least one second voice utterance (identify a preferred frame for each portion of time based on their suitability for speech recognition e.g., based on one or more of the frames' signal-to-noise ratios, amplitude levels, gain levels, or phoneme recognition levels, see par. [0031]); 
and combining the first voice utterance with the at least one second voice utterance based on the confidence level (combining preferred frames to generate hybrid sample, see par. [0030]).

Regarding claim 6 Leppanen teaches the method of claim 5, wherein the determining of the confidence level to combine the first voice utterance with the at least one second voice utterance comprises: determining confidence parameters associated with the first voice utterance and the at least one second voice utterance (A recognition confidence value corresponding to a confidence level that the corresponding text strings accurately reflect the content of the audio samples from which they were generated may then be determined for each of text string outputs 306, 308, and 310. Audio samples 300, 302, and 304, or their respective text string outputs 306, 308, and 310 may be ordered based on their respective recognition confidence values, and the audio sample or text string output corresponding to the greatest confidence level may be selected., see par. [0026]); 
and determining the confidence level of the first voice utterance to combine with the at least one second voice utterance based on the confidence parameters (electing preferred frames based on their suitability for speech recognition, and combining the preferred frames to form a hybrid sample, see par. [0030]).
Regarding claim 7 Lee teaches the method of claim 1, further comprising: controlling the at least one second IoT device to present a message comprising the first voice utterance based on a distance between the user and the at least one second IoT device ( server 23 may recognize a command by applying a weighted value according to distances between the speaker (U) and each of the terminals that have been calculated based on times of arrival when speech reaches each of the terminals from the speaker, see [par. [0083]).
Regarding claim 9 Lee teaches an electronic device for managing voice-based interaction in an Internet of things (IoT) network system, the electronic device comprising: at least one processor configured to: 
identify a first voice utterance of a user from a first IoT device among a plurality of IoT devices in the IoT network system ( a plurality of terminals 21, 21-1, 21-2, . . . , and 21-N to receive speech, and a server 23 to recognize speech, see par. [0081]); 
identify at least one second voice utterance of the user from at least one second IoT device among the plurality of IoT devices in the IoT network system (While being spread as physical waves in a space, the speech sound is received by the plurality of terminals 21, 21-1, 21-2 . . . , and 21-N that are positioned at separate locations. Each of the terminals converts the received speech sound into electronic speech signals, and the converted speech signals may be transmitted to the server 23 through a wired or wireless communication network that transmits and receives electronic signals., see par. [0082]; 
determine a time interval at which voice utterances received with same user ID; determining whether the time interval is less than a threshold interval (y recognize a command by applying a weighted value according to distances between the speaker (U) and each of the terminals that have been calculated based on times of arrival when speech reaches each of the terminals from the speaker, see par. [0083]); 
based on identifying that the time interval is less than the threshold interval, generating a voice command by combining the first voice utterance and the at least one second voice utterance (The electronic signals generated in the terminals may include not only electronic signals converted from physical speech, but also signals recognized as a speech signal that is likely to include a meaningful command, see par. [0082]); 
and triggering at least one IoT device among the plurality of IoT devices in the IoT network system to perform at least one action corresponding to the voice command (he target may perform a specific action corresponding to the received signal in operation 1311, see par. [0157]).
However Lee does not teach identify a user identification (ID) by comparing pre-stored voice information with extracted voice information from at least one voice utterance among the first voice utterance and the at least one second voice utterance.
Lepannen teaches one or more devices in physical proximity of a user of a principal device are identified. Multiple audio samples captured by the identified devices are received. An audio sample comprising a voice of the user of the principal device is selected from among the multiple audio samples captured by the identified devices based on suitability of the audio sample for speech recognition, see abstract. One approach to improving the quality of an audio sample is to utilize an array of microphones, see par. [0002]. One or more of the audio samples received from secondary devices 106, 108, and 110 may include a voice other than the voice of user 102 and multi-device speech recognition apparatus 200 may be configured to compare the received audio samples to a reference audio sample of the voice of user 102 to identify such an audio sample. Multi-device speech recognition apparatus 200 may be configured to compare the received audio samples to a reference audio sample of the voice of user 102 to identify such an audio sample (see par. [0023]). 
It would have been obvious to one of ordinary skill in the art to combine the Lee invention with the teachings of Leppanen for the benefit of improving the quality of the voice samples, see par. [0002].

Regarding claim 10 Lee teaches the electronic device of claim 9, wherein the first IoT device is located at a first location in the IoT network system, and the at least one second IoT device is located at a second location in the IoT network system, and wherein the first location is different than the second location in the IoT network system (terminals 21, 21-1, 21-2 . . . , and 21-N that are positioned at separate locations, see par. [0082]).
Regarding claim 11 Lee teaches the electronic device of claim 9, wherein the first voice utterance of the user is identified in a first time period and the at least one second voice utterance of the user is identified in a second time period (a processor configured to calculate times of arrival of a speech sound at each of the terminals using speech signals received from each of the terminals, calculate distances between a user and the terminals based on the times of arrival of the speech sound to each of the terminals, see par. [0008]).
Regarding claim 12 Lee teaches the electronic device of claim 9, wherein the at least one action corresponding to the voice command is determined by: dynamically detecting an intent from the voice command; (recognize a user’s speech as commands, see par. [0005]); 
and determining the at least one action corresponding to the voice command based on the intent ( speech signals that indicate speech recognition results or speech commands are transmitted from the server to a target in operation 1309, and the target may perform a specific action corresponding to the received signal in operation 1311, see par. [0157]).
Regarding claim 13 Leppanen teaches the electronic device of claim 9, wherein the at least one processor, in order to generate the voice command by combining the first voice utterance and the at least one second voice utterance, is further configured to: recognize the first voice utterance and the at least one second voice utterance (selecting preferred frames based on their suitability for speech recognition, see par. [0030])); 
determine a confidence level to combine the first voice utterance with the at least one second voice utterance (identify a preferred frame for each portion of time based on their suitability for speech recognition e.g., based on one or more of the frames' signal-to-noise ratios, amplitude levels, gain levels, or phoneme recognition levels, see par. [0031]); 
and combine the first voice utterance with the at least one second voice utterance based on the confidence level (combining preferred frames to generate hybrid sample, see par. [0030]).
Regarding claim 14 Leppanen teaches the electronic device of claim 13, wherein the at least one processor, in order to determine the confidence level to combine the first voice utterance with the at least one second voice utterance, is further configured to: determine confidence parameters associated with the first voice utterance and the at least one second voice utterance (A recognition confidence value corresponding to a confidence level that the corresponding text strings accurately reflect the content of the audio samples from which they were generated may then be determined for each of text string outputs 306, 308, and 310. Audio samples 300, 302, and 304, or their respective text string outputs 306, 308, and 310 may be ordered based on their respective recognition confidence values, and the audio sample or text string output corresponding to the greatest confidence level may be selected., see par. [0026]);
and determine the confidence level of the first voice utterance to combine with the at least one second voice utterance based on the confidence parameters (electing preferred frames based on their suitability for speech recognition, and combining the preferred frames to form a hybrid sample, see par. [0030]).

Regarding claim 15 Lee teaches the electronic device of claim 9, wherein the at least one processor is further configured to: control the at least one second IoT device to present a message comprising the first voice utterance based on a distance between the user and the at least one second IoT device ( server 23 may recognize a command by applying a weighted value according to distances between the speaker (U) and each of the terminals that have been calculated based on times of arrival when speech reaches each of the terminals from the speaker, see [par. [0083]).

Claim(s) 8 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lee 2015/0279356 A1 in view of Leppanen U.S. PAP 2015/0228274 further iun view of Aravamudan U.S. PAP 2014/0337370 A1.

Regarding claim 8 Lee in view of Leppanen does not teach the method of claim 7, wherein the message further comprises an inquiry message for inquiring whether to merge the first voice utterance and the at least one second voice utterance, further comprising: receiving a response for the inquiry message; and merging the first voice utterance and the at least one second voice utterance according to the response.
In the same field of endeavor Aravamudan teaches systems and methods for selecting and presenting content items based on user input. The method includes receiving first input intended to identify a desired content item among content items associated with metadata, determining that an input portion has an importance measure exceeding a threshold, and providing feedback identifying the input portion. The method further includes receiving second input, and inferring user intent to alter or supplement the first input with the second input, see abstract. Some embodiments include a speech-based incremental input interface, in which the present systems and methods provide real-time feedback on user input as the user speaks. The present methods and systems enable a user experience similar to human interactions in which a listener responds to a query immediately or even before a user finishes a question, see par. [0020]. The user asks a follow up question, "which was the movie she acted that was directed by Terrence," and pauses to recall the director's full name (exchange 505). The present system determines that a portion of the input ("Terrence") has an importance measure that exceeds a threshold, based on the detected disfluency (e.g., based on the pause from the user). The present system further determines that the user intended to supplement the existing query, and determines an alternative query input that combines the term "Terrence" with the existing query (e.g., Jessica Chastain movie). The present system selects a subset of content items based on comparing the alternative query input and corresponding metadata. Upon discovering a strong unambiguous match (for example, both for the recognized input portion ("Terrence") and also for the selected subset of content items (e.g., "Tree of Life" movie)) that rates a response success rate above a threshold, the present system interrupts the user to present the subset of content items. For example, the present system interjects with "you mean Terrence Malick's `Tree of Life` (exchange 506), see par. [0047].
It would have been obvious to one of ordinary skill in the art at the time the invention was filed to combine the Lee in view of Leppanen invention with the teachings of Aravamudan for the benefit of inferring user intent to alter or supplement the first input with the second input, see abstract.


Regarding claim 16 Lee in view of Leppanen does not teach the electronic device of claim 15, wherein the message further comprises an inquiry message for inquiring whether to merge the first voice utterance and the at least one second voice utterance, and wherein the at least one processor is further configured to: receive a response for the inquiry message; and merge the first voice utterance and the at least one second voice utterance according to the response.
In the same field of endeavor Aravamudan teaches systems and methods for selecting and presenting content items based on user input. The method includes receiving first input intended to identify a desired content item among content items associated with metadata, determining that an input portion has an importance measure exceeding a threshold, and providing feedback identifying the input portion. The method further includes receiving second input, and inferring user intent to alter or supplement the first input with the second input, see abstract. Some embodiments include a speech-based incremental input interface, in which the present systems and methods provide real-time feedback on user input as the user speaks. The present methods and systems enable a user experience similar to human interactions in which a listener responds to a query immediately or even before a user finishes a question, see par. [0020]. The user asks a follow up question, "which was the movie she acted that was directed by Terrence," and pauses to recall the director's full name (exchange 505). The present system determines that a portion of the input ("Terrence") has an importance measure that exceeds a threshold, based on the detected disfluency (e.g., based on the pause from the user). The present system further determines that the user intended to supplement the existing query, and determines an alternative query input that combines the term "Terrence" with the existing query (e.g., Jessica Chastain movie). The present system selects a subset of content items based on comparing the alternative query input and corresponding metadata. Upon discovering a strong unambiguous match (for example, both for the recognized input portion ("Terrence") and also for the selected subset of content items (e.g., "Tree of Life" movie)) that rates a response success rate above a threshold, the present system interrupts the user to present the subset of content items. For example, the present system interjects with "you mean Terrence Malick's `Tree of Life` (exchange 506), see par. [0047].
It would have been obvious to one of ordinary skill in the art at the time the invention was filed to combine the Lee in view of Leppanen invention with the teachings of Aravamudan for the benefit of inferring user intent to alter or supplement the first input with the second input, see abstract.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Pertinent prior art available on form 892.
Barnett ‘420 teaches tools and techniques for implementing an IoT interface for human interaction with multiple devices, see abstract.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Ortiz-Sanchez whose telephone number is (571)270-3711. The examiner can normally be reached Monday- Friday 9AM-6PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MICHAEL ORTIZ-SANCHEZ/Primary Examiner, Art Unit 2656