Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1, 4-6, 8-11, 14-16, and 18-22 are pending.  Claims 1, 11, and 22 are independent and have been amended.  Claim 21 depends from 1.  The dependent Claims are amended to replace “external apparatus” with “remote controller.”
Claim 22 has been allowed.  
This Application was published as U.S. 2019/0172460.
Earliest apparent priority 6 December 2017.
Claims 11-20 are method-claim equivalents of apparatus Claims 1-10.  
Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 9/30/2021 has been entered.  This is a second RCE.
Response to Amendments
Some of the sources of Objections to Claims 1 and 11 have been remedied by the amendments.  However, not all of the informalities have been addressed.
Response to Arguments
Summary: 
Claims 1 and 11 are amended to include “remote controls.”  Allowability of Claim 22 was in that the broadcast information from the TV was used.  Claims 1 and 11 have no TV and no broadcast and thus don’t use a “voice recognition parameter {that} includes source information identifying a source of the broadcast signal being received by the electronic apparatus and a state of the electronic apparatus.”
Discussion:
Allowed Claim 22 includes features that show that the “electronic apparatus” is not some generic server or relay device and is rather a TV (display, broadcast, broadcast source) and that the moving user (from remote to remote) is giving a command to the TV about the TV program being broadcast.  Claims 1 and 11 are generic in that respect.  (See Alexa, Ok Google, Cortana, Siri, etc.)
Figure 7 best depicts the main idea of the Claims:

    PNG
    media_image1.png
    484
    392
    media_image1.png
    Greyscale
 
Note also the following drawings of the instant Application where Figure 10, 
   
    PNG
    media_image2.png
    583
    404
    media_image2.png
    Greyscale
       
    PNG
    media_image3.png
    432
    299
    media_image3.png
    Greyscale
 

11.	A method of controlling an electronic apparatus, the method comprising:
receiving a first audio data from a first remote controller configured to control operation of the electronic device; 
establishing a session using account information and a voice recognition command list with a voice recognition server based on the first audio data;
identifying whether a user of the first remote controller and a user of a second remote controller, configured to control operation of the electronic apparatus, are a same user by comparing the first audio data received from the first remote controller with a second audio data received from a second remote controller, based on receiving the second audio data received from the second remote controller in a state where the session is established;
based on the user of the first remote controller and the user of the second remote controller being the same user:
maintaining the established session,
combining the second audio data with the first audio data,
transmitting the combined audio data to the voice recognition server re-using the account information and the voice recognition command list voice recognition parameter of the maintained session; and
receiving a first result data corresponding to the transmitted the combined audio data from the voice recognition server re-using the maintained session, and
based on the user of the first remote controller and the user of the second remote controller being the same user:
	blocking the established session and establish a new session,
transmitting the second audio data to the voice recognition server using the established new session with the voice recognition server, and
	receiving a second result data corresponding to the second audio data from the voice recognition server,
wherein the voice recognition command list includes a plurality of command corresponding to a plurality of functions provided by the electric apparatus.

Applicant argues that Moniz and Mozer fail to teach identifying whether the user of the first external device and the second external device are the same user and maintaining the session if the user is the same or blocking the session and starting a new session if the users are not the same and also fails to teach re-using of the account information and the command list of the maintained session.  Response 16.

Applicant appears to set forth the following distinct arguments:
1- References do not teach the “remote controllers” of the Claim.  Response 15-16.
2- Moniz and Kracun do not teach identifying if the user of the first and second device are “a same user” based on audio data.  Response 17-19.
3- Moniz does not teach blocking the established session and establishing a new session.  Response 17-19.

1. Remote Controllers:
First and second “remote controllers” were added by amendment.
Mozer already taught:  “[0024] … The electronic device may be small and light enough to be worn like jewelry or to be embedded in clothing, shoes, a cap or helmet, or some other form of headgear or bodily apparel. It can also contain functions of a vehicle, a navigation device, a clock, a radio, a remote control such as used for controlling a television set, etc….”

2. User Identification from the Voice of the User:
In Reply, in Moniz, the identity of the user can be determined by various methods including from his voice:  “One or more techniques may be used by the system to obtain the speaker ID associated with an utterance. In one technique, audio speaker identification may be performed, where audio data corresponding to the utterance may be compared to stored data corresponding to individual speakers. The system can then match the utterance audio data to the stored data (or some other data indicating how an individual speaker sounds in pitch, volume, speech rate, vocabulary, semantic structure, etc.) to determine who spoke the utterance and thus obtain the ID corresponding to that speaker. ….”  Moniz, Col. 22, line 45 to Col. 23, line 3.
The identity of the user is used as a parameter to maintain the continuity of sessions as the user moves from one room/device to another room/device and is therefore going from one Alexa device to another Alexa device.  See Figures 5A and 5B of Moniz. The identification of the speaker though his voice is performed to assist in determination of the intent of the command and for anaphora resolution.  The user asks in the first room and from the first Alexa device "How old is the President?" and when the user walks from Room 1 to Room 2 and asks from a second Alexa device "when was he sworn in?" the central server knows that the “he” in this second question refers to the same “President” in the first question based on the identity of the user.  See Moniz, Col. 21, lines 20-50.  Continuity is maintained between a session between the first Alexa device and the server and the second Alexa device and the server.
In Kracun which is directed to diarization of speech, the system determines whether the person who spoke sentence A is the same as the person who spoke sentence B or sentence C.  In diarization systems, the system does not quite care about the identity of the speakers and rather which parts of a conversation were spoken by the same speakers.  Additionally, the method by which the sameness of the speakers is determined is by comparing the audio.  Note Kracun:  “[0034] The diarization module 218 analyzes the audio data 212 and identifies the portions of the audio data spoken by different users….  The diarization models 234 may not be trained to identify speech from a particular person. The diarization module 218 applies the diarization models 234 to the audio data 212 to identify portions that are spoken by a common speaker even if the diarization model 234 does not include data for the same speaker. The diarization module 218 may identify patterns in portions spoken by the same person. For example, the diarization module 218 may identify portions with a common pitch.”  Kracun is comparing the pitch of the different portions of the audio to see if the portions were spoken by the same or different persons without identifying who the speakers were.  This teaches “comparing audios” which is claimed.
Accordingly, while the speaker identification of Moniz teaches if the two voices are from the same user, Kracun was added that teaches the particular method of comparing the received audio:  “identifying whether a user of the first remote controller and a user of a second remote controller, configured to control operation of the electronic apparatus, are a same user by comparing the first audio data received from the first remote controller with a second audio data received from a second remote controller, based on receiving the second audio data received from the second remote controller in a state where the session is established.” 

3. Blocking and Maintaining the Session:
First, the passage of the Specification of the instant Application, that is cited by the Applicant, includes an admission of prior art:

    PNG
    media_image4.png
    419
    573
    media_image4.png
    Greyscale

Response, 15.
Note also the following passage under the heading “Description of Related Art”:
[0004] That is, in the related art, in the case of a switching operation of attempting to recognize another external apparatus while receiving the audio data using the external apparatus, a session with the existing server is blocked and a new session is established. In this process, unnecessary processing time and waste of traffic for connecting the server occur.
(Published Application.)
This same sentence is repeated in the portion highlighted by the Applicant:  “[0141] … Here, blocking the session means that the session is closed (e.g., ended). Then, voice recognition was started with respect to the audio data received from the second external apparatus 200-2, a new session was established with the voice recognition server 300, and the voice recognition process was performed. That is, in the related art, when there is voice recognition switching from the first external apparatus 200-1 to the second external apparatus 200-2, the existing session was blocked and a new session was connected.”

This admission makes it conclusive that the limitation at issue, i.e. “based on the user of the first remote controller and the user of the second remote controller being the same user:  blocking the established session and establish a new session,” is taught by the prior art.

Further, assuming there were no admission of prior art, there needed to be some elaboration of the “blocking” and “establishing” and why and how these were different from the prior art in the Specification and there isn’t.

Finally, with respect to Moniz, the entire point of Moniz is to maintain continuity of the communication session when the user moves from device to device and room to room as shown in Figures 5A and 5B.  “For example, as shown in FIG. 5A, a user 15a in Room 1 speaks an utterance to device 110a and asks a question such as "How old is the President?" The system may then process the audio of the utterance, determine an answer, and send output audio data back to device 110a to respond "Barack Obama is fifty-five years old." The user may then walk from Room 1 to Room 2 (shown in FIG. 5B) and speak a new utterance to device 110b asking "when was he sworn in?" The system may be configured to recognize that the word "he" in the second utterance corresponds to the entity referred to in the first utterance.”  Col. 21, lines 32-43.  The teachings of Moniz do not contain anything to the contrary of “maintaining” of the sessions.  Rather, they are directed to many different methods of maintaining continuity when the speaker changes device or even when two different speakers use the same device or different devices.  One of Moniz’s scenarios, i.e., Figures 5A and 5B, map to the Claims.
Moniz has the speech controlled device (Alexa) directly communicating with the server or servers.  
Mozer is a 3-device reference that was cited for teaching an intermediary device such as a TV that has to receive the voice from the remote and transmit it to an ASR server.  “Systems and methods for improving the interaction between a user and a small electronic device such as a Bluetooth headset are described. The use of a voice user interface in electronic devices may be used. In one embodiment, recognition processing limitations of some devices are overcome by employing speech synthesizers and recognizers in series where one electronic device responds to simple audio commands and sends audio requests to a remote device with more significant recognition analysis capability….”  Mozer, Abstract.  
Moniz teaches continuity of communication between the input device and server when the speaker jumps from input device/ “first remote controller” to input device/ “second remote controller” and Mozer teaches an intermediary / “electronic apparatus” between the input devices / “first and second remote controlleres” and the server.  
Voice control is taught by both Moniz and Mozer where either the speech controlled device (Alexa in Moniz) or the electronic device with access point (TV in Mozer) is controlled by the content of the input speech.

Patentability of the other independent Claims is argued based on their similarity to Claim 1. Accordingly, the above provides a reply to those arguments as well.
Patentability of the dependent Claims is argued based on their dependence from their base independent Claims. Accordingly, the above provides a reply to those arguments as well.
Claim Objections
Claim 1 is objected to because of the following informality:
…
wherein the voice recognition command list includes a plurality of [[command]] commands corresponding to a plurality of functions provided by the electronic apparatus. 
	Appropriate correction is required.
	
Claim 11 is objected to because of informalities:
…
wherein the voice recognition command list includes a plurality of [[command]] commands corresponding to a plurality of functions provided by the electric apparatus.

	Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 11, 14-16, and 18-20 (depending from 11) are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claim 11 includes the following language:
…
transmitting the first audio data from the first remote controller and the second audio data from the second remote controller to the voice recognition server re-using the account information and the voice recognition command list voice recognition parameter of the maintained session; and
	
It is not clear which is intended:
… re-using the account information and the voice recognition command list [[voice recognition parameter]] of the maintained session; and
OR
… re-using the account information and [[the voice recognition command list]] a voice recognition parameter of the maintained session; and
	
	There is no antecedent basis for a “voice recognition command list voice recognition parameter” in the Claim or the Specification.
	Dependent Claims inherit the indefiniteness and do not have language that might clarify the ambiguity.

(Note the following portions of the Published Application: [0069] The information about voice recognition may include, for example, and without limitation, at least one of: usage terms and conditions, account information, a network status, a voice recognition parameter, and a voice recognition command list.  [0070] The voice recognition parameter may include, for example, and without limitation, at least one of: currently input source information and an apparatus status. The voice recognition command list may include, for example, and without limitation, at least one of: application information used in the electronic apparatus 100, EPG data of a currently input source, and commands for functions provided by the electronic apparatus 100. The electronic apparatus 100 may maintain information about an existing session as it is, and thus stability of the session may be ensured.)
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 11, 14-16, and 18-19 (depending from 11) and 1, 4-6, 8-9, and 21 (depending from 1) are rejected under 35 U.S.C. 103 as being unpatentable over Moniz (U.S. 10,482,885) (filed 16 November 2016) in view of Mozer (U.S. 2009/0204410) and further in view of Kracun (U.S. 2019/0115029).
Claims 1, 5-6, and 8-9 are device claims with limitations similar to the limitations of method Claims 11, 15-16, and 18-19 and are rejected under similar rationale.  The structural components are noted in the rejection of method claims to cover the device claims.
Moniz is directed to Amazon’s Alexa:

    PNG
    media_image5.png
    423
    570
    media_image5.png
    Greyscale

    PNG
    media_image6.png
    427
    517
    media_image6.png
    Greyscale



    PNG
    media_image7.png
    458
    639
    media_image7.png
    Greyscale


    PNG
    media_image8.png
    454
    625
    media_image8.png
    Greyscale


    PNG
    media_image9.png
    468
    757
    media_image9.png
    Greyscale

Regarding Claim 11, Moniz:
11. A method of controlling an electronic apparatus, [Moniz, is directed to “command execution” and the commands include the “control” of the “speech controlled devices 110” to perform the command:  “Depending on system configuration, a speech processing system may be capable of executing a number of different commands such as playing music, answering queries using an information source, opening communication connections, sending messages, shopping, etc….”  Col. 2, lines 19-39.  Figure 1A, 146:  “… The server 120 may then execute (146) a command corresponding to the second text using the first entity.”  Col. 6, lines 21-51.  Figure 9 shows the hardware components of the “Server 120” / “electronic apparatus” included “communication circuitry” / “I/O Device Interfaces 902,” “Controllers/Processors 904” and “Memory 906.”  The “server 120” is not shown as having a display.]
the method comprising: [In the instant Application the “electronic apparatus” is a TV that is being controlled by commands given to two different remote control devices which are the “first and second remote controller” of the Claim.  The TV communicates with a speech recognition server.  In Moniz the device that is being controlled and the devices that get the spoken command are the same “speech controlled device 110” or the device that is controlled may be some other media player that Alexa controls.]
receiving a first audio data from a first remote controller; [Moniz, Figure 1A:  The “Server 120” receives the voice input/audio data from the “speech controlled device 110a”/ “first remote controller.”  Speech controlled devices of Moniz are not remote controllers per se but are at times used to control some other media player in the room and in that sense they become remote controllers:  “In addition, while not illustrated, each user profile 404 may include data regarding the locations of individual devices (including how close devices may be to each other in a home, if the device location is associated with a user bedroom, etc.), address data, or other such information. The user profile 404 may also link other devices that enable the system to track when other devices may be displaying/playing other media that may be consumed by a user using one device when an utterance is received by a different device.”  Col. 21, lines 10-20.  (Mozer is cited for teaching the “remote controllers” of the Claim.)  Figure 1A, also shows “First input audio 11a” which can be input after a “wakeword.”   “The wakeword detection module 220 works in conjunction with other components of the device 110, for example a microphone (not illustrated) to detect keywords in audio 11....”  Col. 10, lines 11-19.  “Thus, the wakeword detection module 220 may compare audio data to stored models or data to detect a wakeword…”  Col. 10, lines 56-57.  See Col. 23, lines 19-24, where the users begin with “Alexa” and continue with their command such as:  “Alexa, play some Weird Al.”  Further, the “speech controlled devices 110a, 110b” are communicating via the “networks 199” with the “Servers 120.”]
establishing a session using account information and a voice recognition command list with a voice recognition server based on the first audio data; [Moniz. The servers 120 perform speech recognition on the received audio data and also identify the speaker, including Speaker ID, which is mapped to “account information” of the Claim.  Figure 2, NLU storage 273 includes “Domain Grammar 276” which teaches the “voice recognition command list” of the Claim because it tells the system which phrases are recognized as commands in a particular Domain (calendar, email, navigation, playing music, etc.).  The communication is established based on the voice command/ “first audio data.”  In response to receiving a “wakeword” / “first audio data,” the Alexa device starts a communication session with the server.  As a part of this process the “speaker ID” / “account information” is also obtained and used and can identify the devices associated with the speaking user.  Figure 1A, 132: determine first audio data is associated with first speaker ID.  Figures 1A-1D, the “Speech controlled device 110a, 110b” establish a communication session with the “servers 120” that perform the “speech processing” functions including speech recognition:  Moniz claim 10,  “The server 120 may receive (130), from the first device 110a, first audio data corresponding to the first utterance. The server 120 may determine that the first audio data is associated with a first device ID (e.g., an ID associated with device 110a). The server 120 may also determine (132) that the first audio data is associated with a first speaker ID. For example, the server 120 may determine that the user 1 spoke the first utterance. This determination may be done by performing speaker identification on the first audio data to determine that the first audio data corresponds to user 1.”  Col. 5, lines 51-65.  Figure 2 shows that the “Server(s) 120” includes the “Automatic Speech Recognition 250” and “NLU 260.”  If more than one “server” is included in the “Server(s) 120,” one server may relay the speech to another for voice recognition.  “The server(s) 120 (which may be one or more different physical devices) may be capable of performing traditional speech processing (e.g., ASR, NLU, command processing, etc.) as described herein. A single server 120 may perform all speech processing or multiple servers 120 may combine to perform all speech processing.”  Col. 5, lines 7-13.  “Once the wakeword is detected, the local device 110 may "wake" and begin transmitting audio data 111 corresponding to input audio 11 to the server(s) 120 for speech processing. Audio data corresponding to that audio may be sent to a server 120 for routing to a recipient device or may be sent to the server for speech processing for interpretation of the included speech (either for purposes of enabling voice-communications and/or for purposes of executing a command in the speech). The audio data 111 may include data corresponding to the wakeword, or the portion of the audio data corresponding to the wakeword may be removed by the local device 110 prior to sending.”  Col. 11, lines 16-27.  There is no intermediary device in Moniz.  (This Claim is written from the viewpoint of the TV of the instant Application which is an intermediary device between the devices that get the voice input and the server that does the speech recognition and other tasks.)  ]  
identifying whether a user of the first remote controller and a user of the second remote controller are a same user by comparing the first audio data received from the first remote controller with a second audio data received from a second remote controller, based on receiving the second audio data received from the second remote controller in a state where the session is established;  [Moniz “identifies whether a user of the first .. and … second … apparatus” are the same “based on … second audio data … from the second … apparatus … {during} … session” but does so by comparing the voices of the speakers to stored data of individual speakers and not to one another:  “One or more techniques may be used by the system to obtain the speaker ID associated with an utterance. In one technique, audio speaker identification may be performed, where audio data corresponding to the utterance may be compared to stored data corresponding to individual speakers. The system can then match the utterance audio data to the stored data (or some other data indicating how an individual speaker sounds in pitch, volume, speech rate, vocabulary, semantic structure, etc.) to determine who spoke the utterance and thus obtain the ID corresponding to that speaker.”  Col. 22, lines 45-55.  Figure 1A, the “Server 120” receives the second wakeword / part of the “second audio data” from the “speech controlled device 110b”/ “second remote controller” and this wakeword is associated with “Second input audio 11b”/ “second audio data.”  The session has been established.  The point of Moniz is not to lose continuity when the user moves from device to device and room to room as shown in Figures 5A and 5B.  As shown in various scenarios of Col. 23, the system (server) knows it is dealing with the same thread of speech inputs if either the same user ID (voice or wakeword indicating same person) or same user account (different devices in the same house) are used when the user moves from device to device.]
blocking the established session and establish a new session based on the  user of the first remote controller and the user of the second remote controller being not the same user, [Moniz starts a new session when the user/speaker change is detected based on the input voice.  Col. 23, lines 35-48 teaches a scenario where both speaker identity (change in voice) and device receiving the command are taken into consideration in determining how to interpret the command.  Figure 5C shows two users same device and Figure 5D shows two users two different devices which is the scenario of this Claim. Col. 21, line 62 to Col. 22, lines 16.  “Further, following the question about the President, user 15a may begin a new conversation with the system using device 110a while user 25b continues the first conversation (about the President) with device 110b. The system may be configured to determine when certain conversations start or end so as to properly track entities and which anaphora are related to which entities.”  Col. 22, lines 9-16.]
the method further comprising:  
maintaining the established session based on the user of the first remote controller and the user of the second remote controller being the same user; [Moniz. The point of Moniz is not to lose continuity of NLP when the user moves from device to device and room to room as shown in Figures 5A and 5B where the devices change but the user is the same. See Col. 23, e.g.  “At a later point in time, a second speech-controlled device 110b may capture audio of a second spoken utterance (i.e., second input audio 11b) from first user 5a. The server 120 may receive (136), from the second device, second audio data corresponding to the second utterance. The server 120 may determine that the second audio data is associated with a second device ID (e.g., an ID associated with device 110b). The server 120 may also determine (138) that the second audio data is associated with the first speaker ID. This determination may also be done by performing speaker identification or may be performed using other techniques. The server 120 may process (140) the second audio data to determine second text (for example user ASR processing). The server 120 may then determine (142) that the second text includes a word corresponding to an entity, but the entity is not itself represented in the second audio data and therefore the word may constitute anaphora, exophora, or the like. This determination may be made using an NLU component such as a named entity recognition component 262 (discussed below) or other component. The server 120 may then determine (144), using the first speaker ID, the first and/or second device IDs and/or other information (such as the relative locations of devices 110a and 110b, the time between receipt of the first input audio data and second input audio data, or other information), that the word corresponds to the first entity from the first utterance. This may include determining that the first utterance and second utterance are part of the same conversation and thus the anaphora in the second utterance relates to the first utterance. The server 120 may then execute (146) a command corresponding to the second text using the first entity.”  Col. 6, lines 21-51.  (Moniz is silent regarding the communication related features.  However, so are the Claim and its supporting Specification. The Claim merely uses the term “maintaining” which is taught by the example of Figures 5A and 5B of Moniz.  Additionally, the supporting Specification includes no specifics regarding by what method the session is maintained.)]
transmitting the first audio data from the first remote controller and the second audio data from the second remote controller to the voice recognition server re-using the account information and the voice recognition command list voice recognition parameter of the maintained session; and [Moniz, The “account information” comes from the recognized user and his “User ID.”  The “voice recognition command list” comes from the “Domain grammar 276” of Figure 2 which is different for each “domain” and may be “personalized” for each User (Domain Lexicon 286 of Figure 2).  Col. 14, lines 15-30.  Figure 1A, the “speech controlled device 110b”/ “second remote controller” transmits the speech that it receives “Second input audio 11b” / “received audio data” to the “Server 120.”   Here, the server uses / “re-uses” the same User ID / “voice recognition parameter” of the first speaker to determine that it is the same user talking and also when the subject is about the same Domain, the same Grammar/ command list is used in the NLU processing:  “… The server 120 may determine that the second audio data is associated with a second device ID (e.g., an ID associated with device 110b). The server 120 may also determine (138) that the second audio data is associated with the first speaker ID….”  Col. 6, lines 21-51.  “In NLU processing, a domain may represent a discrete set of activities having a common theme, such as "shopping", "music", "calendaring", etc. As such, each domain may be associated with a particular recognizer 263, language model and/or grammar database (276a-276n), a particular set of intents/actions (278a-278n), and a particular personalized lexicon (286). Each gazetteer (284a-284n) may include domain-indexed lexical information associated with a particular user and/or device.”  Col. 14, lines 15-23.]
receiving a result data corresponding to the transmitted the first audio data and the second audio data from the voice recognition server re-using the maintained session, and [Moniz, Figure 1A, 134 and 140.  The ASR and NLU of the servers 120 operate on the audio data that is being received from the two “speech controlled devices 110a and 11b” in tandem and generate a result that is either executing a command or responding to a query such as “Alexa, where is the nearest Starbucks?” or “Alexa, play some Weird Al.”  Col. 23, lines 18-23.  The continuity of the sessions is maintained by “re-use” of the data associated with the user ID.  Additionally, the information regarding both the device and the user are stored in the user profile and are used for identifying the user and also executing the commands of the user.]
wherein the voice recognition command list includes a plurality of command corresponding to a plurality of functions provided by the electric apparatus. [Moniz refers to “commands” as “intents/actions” and each “intent domain” has a “word database” associated with it which can have a personalized lexicon for each user.  See Figure 2, NLU storage 273” includes “Domain 1 intents 278a” and “Entity library 282” includes “Domain 1 Lexicon 286aa” etc..  For example, “Play” is a command in a    “ … At this point in the process, "Play" is identified as a verb based on a word database associated with the music domain, which the IC module 264 will determine corresponds to the "play music" intent….”  Col. 15, lines 44-49.  “In addition, the entity library may include database entries about specific services on a specific device, either indexed by Device ID, Speaker ID, or Household ID, or some other indicator.”  Col. 14, lines 8-14.  “In NLU processing, a domain may represent a discrete set of activities having a common theme, such as "shopping", "music", "calendaring", etc. As such, each domain may be associated with a particular recognizer 263, language model and/or grammar database (276a-276n), a particular set of intents/actions (278a-278n), and a particular personalized lexicon (286)….”  Col. 14, 15-30.  For other commands see col. 23, lines 15-25: Where is, Who is, Play.]

Moniz teaches that the voice input begins with the “wakeword” “Alexa” which teaches the “voice input start signal” of the Claim.  “"Alexa, where is the nearest Starbucks?"”  Col. 23, line 20.Figure 2, “wakeword detection module 220” on the “device 110.”  However, the “voice input start signal” of the Claim appears to be just a Bluetooth signal from one device to another and is not a trigger or wake word.
Moniz shows two sets of devices: the speech controlled devices that the user talks to and the remote servers which perform the speech recognition.
Moniz teaches that more than one set of servers may be involved such that the received speech at one server is sent to another server with ASR capability.
However, the instant Application has the 3-player system where a remote control receives the speech and communicates it to the TV via a Bluetooth connection and the TV then communicates the message to the recognition server.
Accordingly, to teach the 3-player configuration of the Claim and an express teaching of a remote controller, a second reference is added.

Mozer, Figure 1:

    PNG
    media_image10.png
    350
    498
    media_image10.png
    Greyscale


    PNG
    media_image11.png
    641
    456
    media_image11.png
    Greyscale

Mozer teaches:
11. A method of controlling an electronic apparatus, [Mozer, “[0003] …Other small electronic products such as television remote controls have become covered in buttons and capabilities that are overwhelming to non-technical users…”  “[0024] …The electronic device may be small and light enough to be worn like … or some other form of headgear or bodily apparel. It can also contain functions of a vehicle, a navigation device, a clock, a radio, a remote control such as used for controlling a television set, etc….”]
the method comprising: [Mozer, Figure 2, Claim viewed from viewpoint of the intermediary “Electronic device with access point 202” which is like the TV in Figure 1 of the instant Application.]
receiving a first audio data from a first remote controller; [Mozer, Figure 2, the “electronic device with voice user interface 201” teaches the “first remote controller” of the Claim and the voice input to this device is received at the “electronic device with access point 202” from whose point of view the Claim is drafted.  “[0024] … The electronic device may be small and light enough to be worn like jewelry or to be embedded in clothing, shoes, a cap or helmet, or some other form of headgear or bodily apparel. It can also contain functions of a vehicle, a navigation device, a clock, a radio, a remote control such as used for controlling a television set, etc. … Thus, the small electronic device associated with the first synthesizer and recognizer may contain a Bluetooth interface, a cell phone, an internet address, and the like.”  The “first voice input start signal” is “connection 209” “[0059] …In this example, connection 209 may be a Bluetooth wireless connection and connection 210 may be a cellular or 802.11 wireless connection….”  “[0105] …The connection to the remote units, which may be a radio frequency or infra-red signal, an ultrasonic device, Bluetooth connection, WiFI, Wimax, cable or other wired or wireless connection, allows the small electronic device to both control the operation of the remote units and to retrieve desired information from the remote units….”  Bluetooth sends a start/initiation signal.]
establishing a session using account information and a voice recognition command list with a voice recognition server based on the first audio data; [Mozer, Figure 2, the “electronic device with access point 202” is connected through the “network 203” with several “servers 204, 205, 206” all of which include a “speech recognition system 208, 214, 253.”  The “voice recognition parameter” is mapped to user identification (see Mozer Figure 4, 406 and [0087]) and user identification is a parameter that is taken into account in the establishing of the communication session.]
identifying whether a user of the first remote controller and a user of the second remote controller are a same user by comparing the first audio data received from the first remote controller with a second audio data received from a second remote controller, based on receiving the second audio data received from the second remote controller in a state where the session is established; [Mozer, Figures 2 and 3. The “Bluetooth headset 301” which is the same as the “Electronic Device with Voice User Interface 201” and is also “the headset 301, coupled through the Bluetooth network 326 to cell phone 302”, [0071] and keeps providing input voice of the user to the “cellular phone 302”/ “electronic apparatus” of the Claim.  This is not a “second” apparatus. Figure 4, 406.  At 406 context information corresponding to the input command is utilized in executing the command and this context information includes speaker/user identification.  “[0087] Other context information may be the identification of the user or identification of the electronic device. ….”  Continuity of identity (same person issuing a command) would translate into continuity of execution.]
blocking the established session and establish a new session based on the  user of the first remote controller and the user of the second remote controller being not the same user, [In Mozer each user has his own device.  So, change of device means change of user and scenario of this limitation is taught by Mozer.]
the method further comprising:  
maintaining the established session based on the user of the first remote controller and the user of the second remote controller being the same user; [Mozer, Figures 2 and 3. As long as speech is coming the Bluetooth session is maintained.  The features of establishing a session and maintaining the session and “input start signal” pertain to the communication aspects and some such as the “input start signal” are inherent in the operation of Bluetooth.  But Mozer does not involve going from one user device to another and therefore this limitation is not taught by Mozer.]
transmitting the first audio data from the first remote controller and the second audio data from the second remote controller to the voice recognition server re-using the account information and the voice recognition command list voice recognition parameter of the maintained session;  [Mozer, Figures 1, 2, and 3, the speech coming from the remote/remote controller is sent to the server for recognition.]
receiving a result data corresponding to the transmitted the first audio data and the second audio data from the voice recognition server re-using the maintained session, and [Mozer, Figure 3, “recognizer 319” in the “Cellular phone 302.”  Figure 4, 406 and [0087], the parameters pertaining to a particular user are used to optimize the voice recognition and also include his preferred voice synthesis: goal is to have a consistent interface thus “re-using” the data.  “[0069] … In one embodiment, the remaining utterances ("John Smith cell") may be sent to a recognizer 319 on the cellular phone. Recognizer 319 may be optional for cellular phone 302. Recognizer 319 may be used to recognize the utterances in the context of contact information 322 stored within the cellular phone 302….”  Speech recognition of Mozer is for command execution; the “result” is the executed command which may be providing information such as a phone number.  “[0051] During a voice user interface session, a user may speak to device 101, and the speech input may include one or more utterances (e.g., words or phrases) which may comprise a verbal request made by the user. The speech is converted into digital form and processed by recognizer 104. Recognizer 104 may be programmed with a recognition sets corresponding to commands to be performed by the voice interface (e.g., Command 1, . . . Command N). In one embodiment, the initial recognition set includes only one utterance (i.e., the initiation word or phrase), and the recognition set is reconfigured with a new set of utterances corresponding to different commands after the initiation utterance is recognized. For example, recognizer 104 may include utterances in the recognition set to recognize commands such as "Turn Up Speaker", "Turn Down Speaker", "Establish Bluetooth Connnection", "Dial Mary", or "Search Restaurants". The recognizer 104 may recognize the user's input speech and output a command to execute the desired function….For example, recognizer 104 may recognize the utterance "search" as one of the commands in the recognition set and notify the controller 103 that the command "search" has been recognized. Program 109 running on controller 103 may instruct the controller to send the remainder of the verbal request (i.e. "Bob's Restaurant in Fremont") to a remote electronic device 106 through transceiver 118 and communication medium 110. Electronic device 106 may utilize a more sophisticated recognizer 114 which may recognize the remainder of the request. Electronic device 106 may execute the request and return to the voice user interface data which may be converted to speech by speech synthesizer 102. The speech may comprise a result and/or a further prompt. For example, speaker 111 may output, "Bob's Restaurant is located at 122 Odessa Blvd. Would you like their phone number?"”]
wherein the voice recognition command list includes a plurality of command corresponding to a plurality of functions provided by the electric apparatus. [Mozer, Figure 4, 406.  At 406 context information corresponding to the input command is utilized in executing the command and this context information includes speaker/user identification. Continuity of identity (same person issuing a command) would translate into continuity of execution.]

Moniz and Mozer pertain to receiving the speech at a device with low or no speech recognition capability and forwarding the speech to a device with higher processing power and recognition capability and it would have been obvious to modify the one step hop of Moniz with the two step hop of Mozer considering that Moniz mentions having several servers including some working as intermediates and Mozer teaches that its configuration can pertain to a remote control contacting a television set contacting a speech recognition server.  (Mozer:  “[0024] … One embodiment of the present invention includes systems and methods for two or more speech synthesis and/or recognition devices to operate in series. A first synthesizer and recognizer, in a small electronic device, may provide both a first voice user interface and communication with the second, or third, etc., remote speech synthesizers and/or recognizers. In this document, the term "remote" refers to any electronic device that is not in physical contact with (or physically part of) the small electronic device. This communication may be, for example, through a Bluetooth interface, a cell phone network, the Internet, radio frequency (RF) waves, or any other wired or wireless network or some combination thereof or the like….”)  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

The feature of determining that the user has changed by comparing voices together (as opposed to comparing the voice with a stored profile) is not taught by Moniz-Mozer.
Kracun teaches:
identifying whether a user of the first remote controller and a user of the second remote controller are a same user by comparing the first audio data received from the first remote controller with a second audio data received from a second remote controller, based on receiving the second audio data received from the second remote controller in a state where the session is established; [Kracun, Figure 2, “Diarization Module 218,” and the diarization of the speech between Speaker 1 and Speaker 2 at 228, 230, 232 in “audio data 236.”  “[0005] … A speaker diarization module may analyze the portion of the audio data that includes the hotword to identify characteristics of the user's speech and identify subsequently received audio data that includes speech from the same user. The speaker diarization module may analyze other subsequently received speech audio and identify audio portions where the speaker is not the same speaker as the hotword speaker. …”  See [0034]-[0035] as well.  “[0034] The diarization module 218 analyzes the audio data 212 and identifies the portions of the audio data spoken by different users….  The diarization models 234 may not be trained to identify speech from a particular person. The diarization module 218 applies the diarization models 234 to the audio data 212 to identify portions that are spoken by a common speaker even if the diarization model 234 does not include data for the same speaker. The diarization module 218 may identify patterns in portions spoken by the same person. For example, the diarization module 218 may identify portions with a common pitch.”  “Pitch” is a characteristic of “audio.”  Then the “audio editor 238” removes the portions of the audio spoken by other users such as speaker 2.  See [0037].  In [0038], Kracun also teaches “stitching together audio data 246 and audio data 250” which teaching the “combining” of the audio data in the Claim.] 
Moniz/Mozer and Kracun pertain to speech operated devices (Moniz to Amazon Alexa and Kracun to Ok Google) and it would have been obvious to combine the diarization of Kracun that determines whether the same speaker is still talking or it is someone else’s speech by comparison of voices (without identifying a speaker) with the system of combination to determine whether the same first speaker is still talking or not in the situation where pre-stored speech profiles are not available.  (Moniz:  “ … The server 120 may also determine (138) that the second audio data is associated with the first speaker ID. This determination may also be done by performing speaker identification or may be performed using other techniques.…”  Col. 6, lines 21-51.….”)  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Moniz, Figures 4 and 6:

    PNG
    media_image12.png
    454
    432
    media_image12.png
    Greyscale

    PNG
    media_image13.png
    329
    549
    media_image13.png
    Greyscale


Regarding Claim 14, Moniz teaches and suggest:
14. The method as claimed in claim 11, wherein the identifying of whether to maintain the established session includes identifying whether the user of the first remote controller and the user of the second remote controller are the same user by
comparing ID information of the first remote controller with ID information of the second remote controller. [Moniz, Figure 4, the “user profile storage 402” which is used for speaker identification includes the “Device ID” associated with the account of a particular speaker.  Additionally, the speaker identification is done based on speaker profile which includes “Device ID.” “…  For illustration, as shown in FIG. 4, the user profile storage 402 may include data regarding the devices associated with particular individual user accounts 404. In an example, the user profile storage 402 is a cloud-based storage. Each user profile 404 may include data such as device identifier (ID) data, speaker identifier (ID) data, voice profiles for users, internet protocol (IP) address data, name of device data, and location of device data for different devices. In addition, while not illustrated, each user profile 404 may include data regarding the locations of individual devices (including how close devices may be to each other in a home, if the device location is associated with a user bedroom, etc.), address data, or other such information….”  Col. 20, line 63 to Col. 21, line 20.]

Regarding Claim 15, Moniz teaches:
15. The method as claimed in claim 11, 
wherein the establishing of the session with the voice recognition server includes establishing a session using information about voice recognition of the electronic apparatus, and [Moniz continues the session and uses the information obtained from the first round of speech recognition, for example, identity of the Barack Obama as the subject of the question for the continued session when the speaker asks the second question from the second device. Col. 4, lines 14-52. See rejection of Claim 1 and the example of anaphora resolution.  See also the use of other types of “information about the voice recognition” in “… The server 120 may then determine (144), using the first speaker ID, the first and/or second device IDs and/or other information (such as the relative locations of devices 110a and 110b, the time between receipt of the first input audio data and second input audio data, or other information), that the word corresponds to the first entity from the first utterance. This may include determining that the first utterance and second utterance are part of the same conversation and thus the anaphora in the second utterance relates to the first utterance. The server 120 may then execute (146) a command corresponding to the second text using the first entity.”  Col. 6, lines 21-51. ]
wherein the maintaining of the established session includes maintaining the information about voice recognition, based on the second audio data being received from the second remote controller, and maintaining the established session. [Moniz see the example of Barack Obama as the subject of the question for the continued session when the speaker asks the second question from the second device. Col. 4, lines 14-52.  The system uses the information from the first part that President refers to Obama to determine the response to the second part.]

Regarding Claim 16, Moniz teaches:
16. The method as claimed in claim 15, wherein the information about voice recognition includes at least one of: usage terms and conditions, account information, a network status and a voice recognition command list. [Moniz uses a variety of methods for speaker identification:  “One or more techniques may be used by the system to obtain the speaker ID associated with an utterance. In one technique, audio speaker identification may be performed, where audio data corresponding to the utterance may be compared to stored data corresponding to individual speakers. The system can then match the utterance audio data to the stored data (or some other data indicating how an individual speaker sounds in pitch, volume, speech rate, vocabulary, semantic structure, etc.) to determine who spoke the utterance and thus obtain the ID corresponding to that speaker. ….”  Col. 22, line 45 to Col. 23, line 3.  “Other techniques of identifying the user may include use of visual information (for example facial recognition using a camera communicable with the system 100), identifying the user based on a unique passphrase or wakeword uttered by the user, identifying the user based on an email address or other account information linked to the input to the system (which may not necessarily be voiced based) or the like.”  Col. 6, lines 3-10.]

Regarding Claim 18, Moniz teaches:
18. The method as claimed in claim 16, wherein the voice recognition command list includes at least one of: application information used in the electronic apparatus, EPG data of the broadcast, and a command for a function provided by the electronic apparatus. [Moniz teaches voice recognition is performed in order to execute a command:  “… Once the entity is determined, the system may then complete command processing of the utterance using the identified entity.”  Abstract.  Figure 2, “NLU Module 260” detects the “commands” such as “call” in “call Mom” or “play” in “play music” by parsing and tagging the text obtained from speech.  The “electronic apparatus” in this context could be one of the “servers 120” that may be contacted by the Alexa devices for responding to a particular query or one of the components that executes the commands:  “3…. Once the entity is determined, the system may then complete command processing of the utterance using the identified entity.”  “A speech processing system may be configured as a relatively self-contained system where one device captures audio, performs speech processing, and executes a command corresponding to the input speech. Alternatively, a speech processing system may be configured as a distributed system where a number of different devices combine to capture audio of a spoken utterance, perform speech processing, and execute a command corresponding to the utterance. Although the present application describes a distributed system, the teachings of the present application may apply to any system configuration.”   Col. 2, lines 8-18.  “Depending on system configuration, a speech processing system may be capable of executing a number of different commands such as playing music, answering queries using an information source, opening communication connections, sending messages, shopping, etc.”  Col. 2, lines 19-23.  “The output data from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The system 100 may include more than one command processor 290, and the destination command processor 290 may be determined based on the NLU output. For example, if the NLU output includes a command to play music, the destination command processor 290 may be a music playing application, such as one located on device 110 or in a music playing appliance, configured to execute a music playing command. If the NLU output includes a search utterance (e.g., requesting the return of search results), the command processor 290 selected may include a search engine processor, such as one located on a search server, configured to execute a search command and determine search results, which may include output text data to be processed by a TTS engine and output from a device as synthesized speech.”  Col. 16, lines 16-34.]

Regarding Claim 19, Moniz teaches:
19. The method as claimed in claim 11, further comprising: 
Storing the first audio data received from the first remote controller; [Moniz, Figures 8 and 9 show the “Device 110” and “server 120,” respectively including “memory 806/906” and “storage 809/909.” ]
transmitting the first audio data to the voice recognition server using the established session; [Moniz, see rejection of Claim 11.]
combining second audio data received from the second remote controller with the stored first audio data; and 
transmitting the combined audio data to the voice recognition server. 
Moniz does not teach the details of storing and combining for transmitting of Claim 19.
Mozer teaches:
transmitting the first audio data to the voice recognition server using the established session; [[Mozer, Figure 2, speech input to the “electronic device voice user interface 201” / “user device” is first sent to the “electronic device with access point 202” / the “electronic apparatus” of the Claims is connected through the “network 203” with several “servers 204, 205, 206” all of which include a “speech recognition system 208, 214, 253.]
combining second audio data received from the second remote controller with the stored first audio data; and 
transmitting the combined audio data to the voice recognition server.[Mozer, In Figures 1, 2, and 3, the speech coming from the remote/remote controller and received at the central device 202 is sent to the server(s) 204, 205, 206 for recognition.]
	Combination of Moniz and Mozer warranted under the rationale provided for Claim 11 (Claim 1).  Moniz sends the speech directly to a server with recognition capabilities.  Mozer uses an intermediary which permits Mozer to select the most appropriate speech recognition server for the particular task.
	Neither reference expressly teaches “combining of audio data” before transmitting to the recognition server.
	Kracun teaches:
combining second audio data received from the second remote controller with the stored first audio data; and  [Kracun, Figure 2, 244.  Removing the “Speaker 2” portion 248 of the input audio and stitching/combining the portions 246 and 250 of “Speaker 1” speech.  “[0038] … In this instance, the audio editor 238 generates audio data 244 by stitching together audio data 246 and audio data 250.”  See also [0037] and [0050].
transmitting the combined audio data to the voice recognition server.[Kracun may use a server for speech recognition.  “[0004] . . .  Accordingly, utterances directed at the system… can be speech recognized, parsed and acted on by the system, either alone or in conjunction with the server via the network.”  “[0036] In some implementations, the diarization module 218 may process audio data that the system 200 is going to transmit to another computing device, such as a server or mobile phone.”]
Moniz/Mozer and Kracun pertain to speech controlled devices for executing a command and it would have been obvious to combine the audio stitching/combining of Kracun with the combination for the same reason that Kracun stitches/combines related (from the same speaker) portions of speech before further processing (recognition, e.g.) to provide proper context for a speech recognition server.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claims 1, 4-6, and 8-9 are device claims with limitations similar to the limitations of method Claims 11, 14-16, and 19 and are rejected under similar rationale.  

Regarding Claim 21, Moniz teaches:
21.	The electronic apparatus as claimed in claim 1, wherein the apparatus is further configured so that:
a user utters a spoken command to the first remote controller followed by the second remote controller; [Moniz, Figures 5A and 5B “User 1, 5a” moving from “Device 110a” to “Device 110b.”]
the spoken command corresponds to the first audio data received from the first remote controller and the second audio data received from the second remote controller; [Moniz, Figures 5A and 5B scenario.  Col. 21, lines 20-48.  The second command “Whe was he sworn in?” corresponds to both the first audio data, “how Old is the President?”, and the second audio data because without the “first audio data” that was input in Figure 5A to the first “Device 110a” the system would not know who the user is asking about.]
the electronic apparatus further comprising a memory configured to store the first audio data received from the first remote controller; and [Moniz, Figure 1A, the “server 120” has memory.  Figure 9, “memory 906.”]
the processor is configured to 
transmit the first audio data to the voice recognition server using the established session, [There is no intermediary device in Moniz.  The Alexa/ Speech controlled devices send the speech to recognition server 120.  (This Claim is written from the viewpoint of the TV of the instant Application which is an intermediary device between the devices that get the voice input and the server that does the speech recognition and other tasks.)]
combine the second audio data received from the second remote controller with the stored first audio data, and 
transmit the combined audio data to the voice recognition server.

Claim 1 (or 11) is written from the viewpoint of the TV of the instant Application which is an intermediary device between the devices that get the voice input and the server that does the speech recognition and other tasks.  Moniz does not have an intermediary device to get the speech from the user devices and then send to a speech recognition server.
Mozer teaches:
transmit the first audio data to the voice recognition server using the established session, [[Mozer, Figure 2, speech input to the “electronic device voice user interface 201” / “user device” is first sent to the “electronic device with access point 202” / the “electronic apparatus” of the Claims is connected through the “network 203” with several “servers 204, 205, 206” all of which include a “speech recognition system 208, 214, 253.]
combine the second audio data received from the second remote controller with the stored first audio data, and 
transmit the combined audio data to the voice recognition server. [Mozer, In Figures 1, 2, and 3, the speech coming from the remote/remote controller and received at the central device 202 is sent to the server(s) 204, 205, 206 for recognition.]
	Combination of Moniz and Mozer warranted under the rationale provided for Claim 11 (Claim 1).  Moniz sends the speech directly to a server with recognition capabilities.  Mozer uses an intermediary which permits Mozer to select the most appropriate speech recognition server for the particular task.
	Neither reference expressly teaches “combining of audio data” before transmitting to the recognition server.
	Kracun teaches:
combine the second audio data received from the second remote controller with the stored first audio data, and [Kracun, Figure 2, 244.  Removing the “Speaker 2” portion 248 of the input audio and stitching/combining the portions 246 and 250 of “Speaker 1” speech.  “[0038] … In this instance, the audio editor 238 generates audio data 244 by stitching together audio data 246 and audio data 250.”  See also [0037] and [0050].
transmit the combined audio data to the voice recognition server.[Kracun may use a server for speech recognition.  “[0004] . . .  Accordingly, utterances directed at the system… can be speech recognized, parsed and acted on by the system, either alone or in conjunction with the server via the network.”  “[0036] In some implementations, the diarization module 218 may process audio data that the system 200 is going to transmit to another computing device, such as a server or mobile phone.”]
Moniz/Mozer and Kracun pertain to speech controlled devices for executing a command and it would have been obvious to combine the audio stitching/combining of Kracun with the combination for the same reason that Kracun stitches/combines related (from the same speaker) portions of speech before further processing (recognition, e.g.) to provide proper context for a speech recognition server.  This combination falls under combining prior art elements according to known methods to yield predictable results or use of known technique to improve similar devices (methods, or products) in the same way. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claims 20 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Moniz and Mozer and Kracun and further in view of An (U.S. 20180182396).

An, Figure 1:

    PNG
    media_image14.png
    315
    495
    media_image14.png
    Greyscale



Regarding Claim 20, Moniz and Mozer and Kracun teach the existence of display. Moniz, the “Server 120” which is the electronic device of Claim 11 does not include a display but Figure 8, “Device 110” includes a “Display 109.”  Kracun does not teach displaying the progress of the recognition process.
Regarding Claim 20, An teaches:
20. The method as claimed in claim 11, further comprising: 
displaying information about a progress of voice recognition based on the first audio data being received from the first remote controller. [An, Figures 4 and 6-8 showing the display which includes the waveform of the speech and also includes the recognized speech and its corresponding speaker.  (Background of An:  “[0005] … wherein the control unit controls the display unit to display the converted text file in the form of time-series dialog information between a plurality of speakers classified on the basis of the speaker information.”)]
Moniz, Mozer, Kracun and An pertain to receiving the speech at a device and it would have been obvious to combine the display of information about speech recognition from An with the system of the combination to provide for a visual indicator of the process and results.  This combination falls under combining prior art elements according to known methods to yield predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 10 is a device Claim with limitations similar to the limitations of Claim 20 and is rejected under similar rationale.
Allowable Subject Matter
Claim 22 is allowed.
The following is an examiner’s statement of reasons for allowance: In view of each of the particular limitations of the independent Claims when considered in the order established by the Claim language and in the context of the language of the independent Claims when each Claim is considered as a whole, the independent Claims of this Application were not found in the prior art that was viewed.
In particular, the structure of Claim 22 comports with the Disclosure showing that the electronic device is Television set that is receiving Broadcast images (display) and the first and second devices are remote control devices for the control of the television set because the “voice recognition parameter” that is used to establish the “communication session” with the “recognition server” and is also used to maintain the same session or to terminate it and start a new session is defined in the Claim as: “wherein the voice recognition parameter includes source information identifying a source of the broadcast signal being received by the electronic apparatus and a state of the electronic apparatus,” which connects the speech recognition aspect of the Claim with the “source of the broadcast signal” whose content is being displayed and this feature in the context of the Claim as whole and including other features such as comparing the voices together (as opposed to a stored profile) and combining the received audio for sending to the recognition sever was not found in the art.  The user/speaker is talking about a particular television program (as opposed to his mother or his wife) and the content of his speech and the fact that the content pertains to the TV program that he is watching in addition to the fact that the voice has not changed causes the continuity of the session.

Support for the definition of the “voice recognition parameter including source information identifying a source of the broadcast signal” is found in the published Application at “[0101] The tuner 150 may receive broadcast signals from various sources such as terrestrial broadcast, cable broadcast, satellite broadcast, or the like. The tuner 150 may receive broadcast signals from sources such as analog broadcast or digital broadcast from various sources.”  “[0182] Here, the voice recognition parameter may include at least one of currently input source information and an apparatus status.”  “[0183] Also, the voice recognition command list may include at least one of application information used in the electronic apparatus 100, EPG data of a currently input source, and commands for functions provided by the electronic apparatus 100.”
Support for the “voice recognition parameter including … a state of the electronic apparatus” is found at “[0173] Referring to FIG. 10, when the electronic apparatus 100 determines that a new apparatus is connected to the electronic apparatus 100, the electronic apparatus 100 may determine whether the apparatus is in a standby state to receive a voice recognition result (S1005). [0174] Here, when the apparatus is not in the standby state to receive the voice recognition result, the electronic apparatus 100 may immediately request the voice recognition server 300 to stop voice recognition (S1020).”  “[0184] The control method of the electronic apparatus 100 may further include a step (S1115) of receiving the voice input start signal from the second external apparatus 200-2 in a state where the session is established, a step (S1120) of maintaining the established session, and a step (S1125) of processing voice recognition of audio data received from the second external apparatus 200-2 using the maintained session.”

Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”
Close Art of Record
In addition to the art applied to the Claims during the prosecution of the instant Application, note Edwards (U.S. 2019/0311721) that has been applied to the limitations to some degree.  

22.	An electronic apparatus comprising: (The TV 100 of Figure 1 of Jin is intended.)
a display; [Edwards, Figure 1, “controllable device/service 110, 112, 114” can be a TV.  See [0037]-[0038].]
a communicator comprising communication circuitry configured to communicate with a voice recognition server; and [Edwards, Figure 1, the “controllable devices 110, 112, 114” are in communication with the “hub 108” and in Figures 3A-3D, the “hub 310” is in communication with a “recognition engine 318.”  Figures 7 and 8 show the structure of the user device or Hub 702 which includes a “communication interface 719” and an antenna 722 and network entity 890 which can be a remote server and also includes a “communication interface 892.” ]
a processor configured to: [Edwards, Figure 7, “processor 718.” ]
control the communicator to receive a broadcast signal from an external source for display on the display by the electronic apparatus; [Edwards, Figure 1, the “controllable devices” or even the “hub 108” may be an Apple TV which receiving streaming media.  “[0042] Siri.RTM. may be available on many Apple Inc. devices and/or may be employed as an interaction modality, for example, of the Apple TV.RTM. (e.g., 4.sup.th generation) device. In the case of the Apple TV.RTM., Siri.RTM. may allow control over media playback via voice commands (e.g., "Watch Fast and Furious 7") provided to a remote-control handset which may include internal microphones. Further, the Apple TV.RTM. device may serve as a smart home hub for devices using Apple's HomeKit.RTM. framework….”]
control the communicator to receive a first voice input start signal and first audio data from a first remote controller configured to control operation of the electronic apparatus; [Edwards, Figure 1, the “user devices 102, 104, 106” can be a “remote control handset” or a “mobile phone” [0042] and [0056].]
control the communicator to establish a session via a voice recognition parameter with the voice recognition server, based on the first voice input start signal received from the first remote controller; [Edwards, Figure 3D, the “hub 310” (which can be the Apple TV as per [0042])  in communication with a “recognition engine 318.”  Figure 2: “[0060] … For example, the voice input module 202 may be included in the user device, 208, while the speech recognition module 204 may be implemented using a networked service….”  Edwards uses Siri where “Siri” is the “voice input start signal” of the Claim.  [0034] and Figure 10.]
store the first audio data received from the first remote controller; [Edwards, Figure 2, “speech profile storage 206.”  “[0059] FIG. 2 illustrates a block diagram of an example system 200 for generating and/or tuning a speech profile of a user of a user device (e.g., of the user of the example user device 102 of FIG. 1) …”  “[0060] System 200 may include a voice input module 202, a recognition module 204, a speech profile storage 206, and a user device 208….”
transmit the first audio data to the voice recognition server using the established session; [Edwards, Figure 2, and [0060] teaching that the “recognition module 204” may be a networked service.  Figures 7 and 8 showing the “communication interfaces 719, 892” of the user device 702 and the network entity 890 in wireless communication.]
after establishing the session with the voice recognition server, control the communicator to receive a second voice input start signal and second audio data from a second remote controller configured to control operation of the electronic apparatus, while the established session with the voice recognition server is maintained; [Edwards, Figure 3D shows the speech profiles for Alice and Bob and Figures 6A and 6B teach that when Alice and Bob issue voice commands the device can distinguish between the voices of the two.  Each user, Alice or Bob, is likely using his own “remote controller” which may be his phone or watch to contact the hub.  However, the aspect of the input device is not discussed in Edwards.  Focus is on the voice that arrives at the hub.] 
determine whether a user of the first remote controller and a user of the second remote controller are a same user by comparing the first voice input start signal received from the first remote controller with the second voice input start signal received from the second remote controller; [Edwards, Figures 5, 6A, and 6B, the speaker identity is determined from audio data (504) and the change in speaker identity is detected (Figure 6B, 626: set current profile=Bob).  However, the change in speaker identity is not achieved by comparing the voices together and rather with the previously stored profiles.]
based on the user of the first remote controller and the user of the second remote controller being the same user, maintain the established session with the voice recognition server, combine the second audio data with the stored first audio data, transmit the combined audio data to the voice recognition server using the maintained session with the voice recognition server, and control the communicator to receive a first result data corresponding to the combined audio data from the voice recognition server re-using the voice recognition parameter of the maintained session; and [Edwards, Figures 5, 6A, and 6B.  This limitation is taught by scenarios of Figures 6A and 6B.  When the user is identified and remains the same user, the context and anaphora resolution is based on the identity of the user and what he said in the chain or commands.  “[0098] An instruction may include a speaker-relative signifier, such as a word, phrase or other combination of phonemes that refers to different data depending on the identity of the speaker. For example, in the instruction "Call my wife," the phrase "my wife" is a speaker-relative signifier because it refers to different sets of contact information depending on the identity of the speaker.   …”  “[0107] …Accordingly, terms that Alice is accustomed to using on her user device (e.g., names she uses for controllable devices/services (e.g., smart home devices), names of her playlists, an identity of her husband and/or information associated with her husband, where her work is located, etc.) are available for Alice's use when speaking with the hub 606.”]
based on the user of the first remote controller and the user of the second remote controller not being the same user, block the established session, establishing a new session with the voice recognition server, transmit the second audio data to the voice recognition server using the established new session with the voice recognition server, and control the communicator to receive a second result data corresponding to the second audio data from the voice recognition server; [Edwards, Figures 5, 6A, and 6B.  A command/instruction is interpreted and executed according to the identity of the speaker.]
wherein the voice recognition parameter includes source information identifying a source of the broadcast signal being received by the electronic apparatus and a state of the electronic apparatus.

Carson(U.S. 2014/0365885) is another very close reference that causes conversation persistence across two different instances of a personal digital assistant that are implemented on two devices.  (See Figure 7 of Carson.)  Context and conversation persist across two or more instances of a digital assistant as the user moves from device to device.  “[0079] … The context information includes user-specific data, vocabulary, and/or preferences relevant to the user input. In some embodiments, the context information also includes software and hardware states of the device (e.g., user device 104 in FIG. 1) at the time the user request is received, and/or information related to the surrounding environment of the user at the time that the user request was received….”  The context is not taught to be the state of the “server system 108” that connects the two devices. 

	DeMerchant (U.S. 2018/0204577) lets each person who is watching TV to use a personalized keyword for addressing the TV so the TV knows who the person is and provides programming accordingly.  See Figure 3.
Gorodetski (US 20160217792) is a diarization reference and has one known/identified speaker and the rest are “compared” to him and the reference just determines that they are not him.  
Kracun (U.S. 2019/0115029) that is applied to the Claims is another reference that does not proceed to identify each speaker and rather determines that the second voice is from the same speaker by comparison of input voice to input voice (not voice to voice profile).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Carson (U.S. 2014/0365885), Figure 7.
DeMerchant (U.S. 2018/0204577), Figure 3.
Jang (U.S. 20170264939)
Shoemake (U.S. 20150243163)
Efrati (U.S. 2014/0359139) 
Lee (U.S. 20150379992) 
Retter (U.S. 20180322868):  “A method for operating a server system that includes a plurality of servers for processing a voice command recorded by a recording device connected, via an interface, to the server system includes, in response to the recording of the voice command, reading in a session activation signal from the recording device; checking if there is an association between the session activation signal and a session ID; if it is established that there is the association between the session activation signal and the session ID, ascertaining an availability of a prior server that previously processed a session assigned to the session ID; and activating the session on the prior server if it is available ….”  Abstract.  “[0007] The session activation signal can be provided, for example, at the beginning or shortly after the beginning of a recording of the voice command, more or less in response to the manipulation of a corresponding switch of the recording device, or upon the speaking of a particular keyword for activating a recording function of the recording device.”

1.	An electronic apparatus comprising:
a communicator comprising communication circuitry configured to communicate with a voice recognition server; and
a processor configured to:
control the communicator to establish a session via account information and a voice recognition command list with the  voice recognition server, based on a first audio data received from a first remote controller configured to control operation of the electronic device, 
identify whether a user of the first remote controller and a user of a second remote controller, configured to control operation of the electronic apparatus, are a same user by comparing the first audio data received from the first remote controller with a second audio data received from the second remote controller, based on receiving the second audio data received from the second remote controller in a state where the session is established,
based on the user of the first remote controller and the user of the second remote controller being the same user:
	maintain the established session,
	combine the second audio data with the first audio data,
control the communicator to transmit the combined audio data to the voice recognition server re-using the account information and the voice recognition command list of the maintained session, and
control the communicator to receive a first result data corresponding to the combined audio data from the voice recognition server re-using of the maintained session, and
based on the user of the first remote controller and the user of the second remote controller being not the same user:
	block the established session and establish a new session,
	transmit the second audio data to the voice recognition server using the established new session with the voice recognition server, and
control the communicator to receive a second result data corresponding to the second audio data from the voice recognition server,
wherein the voice recognition command list includes a plurality of command corresponding to a plurality of functions provided by the electronic apparatus. 

21.	The electronic apparatus as claimed in claim 1, wherein the apparatus is further configured so that:
a user utters a spoken command to the first remote controller followed by the second remote controller;
the spoken command corresponds to the first audio data received from the first remote controller and the second audio data received from the second remote controller;
the electronic apparatus further comprising a memory configured to store the first audio data received from the first remote controller; and
the processor is configured to 
transmit the first audio data to the voice recognition server using the established session, 
combine the second audio data received from the second remote controller with the stored first audio data, and 
transmit the combined audio data to the voice recognition server.

11.	A method of controlling an electronic apparatus, the method comprising:
receiving a first audio data from a first remote controller configured to control operation of the electronic device; 
establishing a session using account information and a voice recognition command list with a voice recognition server based on the first audio data;
identifying whether a user of the first remote controller and a uuser of a second remote controller, configured to control operation of the electronic apparatus, are a same user by comparing the first audio data received from the first remote controller with a second audio data received from a second remote controller, based on receiving the second audio data received from the second remote controller in a state where the session is established;
based on the user of the first remote controller and the user of the second remote controller being the same user:
maintaining the established session,
combining the second audio data with the first audio data,
transmitting the combined audio data to the voice recognition server re-using the account information and the voice recognition command list voice recognition parameter of the maintained session; and
receiving a first result data corresponding to the transmitted the combined audio data from the voice recognition server re-using the maintained session, and
based on the user of the first remote controller and the user of the second remote controller being the same user:
	blocking the established session and establish a new session,
	transmitting the second audio data to the voice recognition server using the established new session with the voice recognition server, and
	receiving a second result data corresponding to the second audio data from the voice recognition server,
wherein the voice recognition command list includes a plurality of command corresponding to a plurality of functions provided by the electric apparatus.

22.	An electronic apparatus comprising: 
a display;
a communicator comprising communication circuitry configured to communicate with a voice recognition server; and 
a processor configured to:
control the communicator to receive a broadcast signal from an external source for display on the display by the electronic apparatus;
control the communicator to receive a first voice input start signal and first audio data from a first remote controller configured to control operation of the electronic apparatus;
control the communicator to establish a session via a voice recognition parameter with the voice recognition server, based on the first voice input start signal received from the first remote controller;
store the first audio data received from the first remote controller; 
transmit the first audio data to the voice recognition server using the established session;
after establishing the session with the voice recognition server, control the communicator to receive a second voice input start signal and second audio data from a second remote controller configured to control operation of the electronic apparatus, while the established session with the voice recognition server is maintained;
determine whether a user of the first remote controller and a user of the second remote controller are a same user by comparing the first voice input start signal received from the first remote controller with the second voice input start signal received from the second remote controller;
based on the user of the first remote controller and the user of the second remote controller being the same user, maintain the established session with the voice recognition server, combine the second audio data with the stored first audio data, transmit the combined audio data to the voice recognition server using the maintained session with the voice recognition server, and control the communicator to receive a first result data corresponding to the combined audio data from the voice recognition server re-using the voice recognition parameter of the maintained session; and
based on the user of the first remote controller and the user of the second remote controller not being the same user, block the established session, establishing a new session with the voice recognition server, transmit the second audio data to the voice recognition server using the established new session with the voice recognition server, and control the communicator to receive a second result data corresponding to the second audio data from the voice recognition server;
wherein the voice recognition parameter includes source information identifying a source of the broadcast signal being received by the electronic apparatus and a state of the electronic apparatus.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499.  The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659