Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Claims 1-6, 8-16, and 18-20 are pending.  Claims 1 and 11 are independent and have been amended.  Claims 7 and 17 have been canceled.  Claims, that included the phrase “external server,” have been amended to switch the phrase to “voice recognition server.”
This Application was published as U.S. 2019/0172460.
Earliest apparent priority 6 December 2017.
Claims 11-20 are method-claim equivalents of apparatus Claims 1-10.  
Applicant’s amendments and arguments are considered but are either unpersuasive or moot in view of the new grounds of rejection.
	
Please note that, as is, the Claim language is too close to the user moving from one Alexa device to another Alexa device in Moniz where the system maintains the continuity of the command execution by identifying the user and ascertaining that it is the same user that has moved to the second device.  The Claim language now includes a “voice recognition parameter” which is used to maintain the continuity of the speech recognition sessions.  The “voice recognition parameter” is defined as “input source information.”  The current rejection maps the “Input Source Information” to User Identification because the user is the source of the voice inputs.   Please see the discussion below, particularly on pages 15-19, and further define this phrase and other Claim terms and phrases with more particularity.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 1/12/2021 has been entered.
Specification
The disclosure is objected to because of the following informalities: correct paragraph [0128] of the published Application (same in the Specification as filed) as follows:
[0128] The sensor 224 may include at least one sensor and measure a physical quantity or sense an operating state of the [[electronic]] external apparatus [[100]] 200 to convert measured or sensed data into electrical signals. The sensor 224 may include various sensors (e.g., a motion sensor, a gyro sensor, an acceleration sensor, a gravity sensor, etc.) for detecting motion.

    PNG
    media_image1.png
    170
    502
    media_image1.png
    Greyscale

Reason:  This paragraph is a Written Description of Figure 5 and Figure 5 is a drawing of the EXTERNAL apparatus 200 i.e. the remotes not the TV 100.  

    PNG
    media_image2.png
    547
    575
    media_image2.png
    Greyscale

See “[0123] Referring to FIG. 5, the external apparatus 200 for controlling the electronic apparatus 100 ….”  Therefore, the “sensor 224” is part of the “external apparatus 200.”  Additionally, to have an “acceleration sensor” in a TV (electronic apparatus 100) does not make sense because the TV is not expected to move.  Whereas motion detectors in remote controls (external apparatus 200) are normal.

Appropriate correction is required.
Specification
The specification is objected to as failing to provide proper antecedent basis for the claimed subject matter.  See 37 CFR 1.75(d)(1) and MPEP § 608.01(o).  Correction of the following is required: 
Independent Claims are amended to include “wherein the voice recognition parameter includes at least one of input source information or an electronic apparatus state.”  
Specification does not provide an antecedent for the phrase “input source information.”  

Suggestion:  define the phrase inside the Claim language with terminology that finds antecedence in the Specification.
For example if this “input source information” pertains to the Broadcast Source for the TV (see Specification at [0101] and discussion below), then provide:
wherein the input source information identifies a source of broadcast signals being received by the electronic apparatus and displayed to a user,

wherein the user utters a spoken command to the first external electronic apparatus followed by the second electronic apparatus, and
wherein the spoken command is translated into the first voice input start signal and the second voice input start signal.

If the Source Information is the identification of the first external device, define it as the identification of the first external device.  (See Specification [0176] and discussion below.)
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-6, 8-16, and 18-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Independent Claims 1 and 11 are indefinite.  The dependent Claims inherit the indefinites and do not include language that would remove the ambiguity.
An antecedent basis issue together with the terminology of the Claim are causing ambiguity.
The added wherein clause states:  “wherein the voice recognition parameter includes at least one of input source information or an electronic apparatus state.”  The Claim includes “an electronic apparatus” in the preamble and then “a first external apparatus” and “a second external apparatus” in the body.  It is not clear to which of these the “state” refers.
If the “state” refers to the “state” of the “electronic apparatus” in the preamble, then the limitation should provide:  “wherein the voice recognition parameter includes at least one of input source information or [[an]] a state of the electronic apparatus [[state]].”  This ties the state to “the electronic apparatus” properly.  

While there is one “electronic apparatus” in the Claim and the other two are referred to as “external apparatus,” as provided in the Objection to the Specification above, the term “state,” as defined in the Specification, refers to the “sensed operating state of one of the external apparatuses 200.”  
On the other hand, the “state” could refer to the “Standby State” which could pertain to the “electronic apparatus” /TV or could equally pertain to the “external apparatuses” /Remotes because the Specification is not clear either: 
[0173] Referring to FIG. 10, when the electronic apparatus 100 determines that a new apparatus is connected to the electronic apparatus 100, the electronic apparatus 100 may determine whether the apparatus is in a standby state to receive a voice recognition result (S1005). 
[0174] Here, when the apparatus is not in the standby state to receive the voice recognition result, the electronic apparatus 100 may immediately request the voice recognition server 300 to stop voice recognition (S1020).

IN general:  The “electronic apparatus 100” and the two “external apparatuses 200” are very different devices with different roles and every time the Specification refers to “apparatus” alone an ambiguity may arise, depending on context.  Claim language can be used to clarify such ambiguities by defining each term with particularity.

Further, because in the Claim, the “voice recognition parameter” includes the “state,” and the “voice recognition parameter” is known before the “second external apparatus” even enters the picture, if the “state” belongs to one of the “external apparatuses” it would be to the “first external apparatus.”
1.	An electronic apparatus comprising:
a communicator comprising communication circuitry configured to communicate with a voice recognition server; and
a processor configured to:
control the communicator to establish a session via a voice recognition parameter with the  voice recognition server, based on a first voice input start signal received from a first external apparatus, 
maintain the established session based on a second voice input start signal and audio data received from a second external apparatus in a state where the session is established, 
control the communicator to transmit the received audio data from the second external apparatus to the voice recognition server using the maintained session, and
control the communicator to receive a result data corresponding to the audio data from the voice recognition server re-using the voice recognition parameter of the maintained session,
wherein the voice recognition parameter includes at least one of input source information or [[an electronic]] a state of the first external apparatus [[state]].
OR
1.	An electronic apparatus comprising:
a communicator comprising communication circuitry configured to communicate with a voice recognition server; and
a processor configured to:
control the communicator to establish a session via a voice recognition parameter with the  voice recognition server, based on a first voice input start signal received from a first external apparatus, 
maintain the established session based on a second voice input start signal and audio data received from a second external apparatus in a state where the session is established, 
control the communicator to transmit the received audio data from the second external apparatus to the voice recognition server using the maintained session, and
control the communicator to receive a result data corresponding to the audio data from the voice recognition server re-using the voice recognition parameter of the maintained session,
wherein the voice recognition parameter includes at least one of input source information or [[an ]] a state of the electronic apparatus [[state]].  (Or even “a standby state of the electronic apparatus” if the “standby state” is intended.)

The Claim language refers to the two options in the alternative.  Thus, mapping to the indefinite option is not necessary.
Response to Arguments
Parallel independent Claims 1 and 11 are amended to state:
1.	An electronic apparatus comprising:
a communicator comprising communication circuitry configured to communicate with a voice recognition server; and
a processor configured to:
control the communicator to establish a session via a voice recognition parameter with the  voice recognition server, based on a first voice input start signal received from a first external apparatus, 
maintain the established session based on a second voice input start signal and audio data received from a second external apparatus in a state where the session is established, 
control the communicator to transmit the received audio data from the second external apparatus to the voice recognition server using the maintained session, and
control the communicator to receive a result data corresponding to the audio data from the voice recognition server re-using the voice recognition parameter of the maintained session,
wherein the voice recognition parameter includes at least one of input source information or an electronic apparatus state.

11.	A method of controlling an electronic apparatus, the method comprising:
receiving a first voice input start signal from a first external apparatus; 
establishing a session using a voice recognition parameter with a voice recognition server based on the first voice input start signal;  
receiving a second voice input start signal and audio data from a second external apparatus in a state where the session is established;  
maintaining the established session based on the second voice input start signal; and 
transmitting the received audio data from the second external apparatus to the voice recognition server re-using the voice recognition parameter of the maintained session, and
receiving a result data corresponding to the audio data from the voice recognition server using the maintained session,
wherein the voice recognition parameter includes at least one of input source information or an electronic apparatus state.

Figures 7 and 8 and the corresponding Written Description best depict the main idea of the instant Application which has be emphasized by Applicant’s Arguments:


    PNG
    media_image3.png
    484
    392
    media_image3.png
    Greyscale
   
    PNG
    media_image4.png
    492
    371
    media_image4.png
    Greyscale

Key appears to be the definition for “maintaining a communication session” which is accomplished through the “voice recognition parameter.”  Applicant refers to Figure 5, which is a diagram of an external apparatus 200, and Figure 6 as key. Figure 6 pertains to the “maintaining of a session” and provides:   


    PNG
    media_image5.png
    45
    702
    media_image5.png
    Greyscale


    PNG
    media_image6.png
    192
    703
    media_image6.png
    Greyscale

	
The paragraphs cited by the Applicant (see p. 11) and other pertinent paragraphs are provided below.  Paragraph [0141] characterizes the prior art from the viewpoint of the inventors and paragraph [0143] enumerates the operations that the prior art presumably performs and are omitted from the instant invention.
 [0003] An electronic apparatus that receives audio data and transmits the audio data to a voice recognition server may establish a session with the server. The electronic apparatus may use an external apparatus to receive the audio data. Here, in the case of a switching operation in which the external apparatus that receives the audio data is changed to another external apparatus, the existing session is blocked and a new session is connected. 
[0004] That is, in the related art, in the case of a switching operation of attempting to recognize another external apparatus while receiving the audio data using the external apparatus, a session with the existing server is blocked and a new session is established. In this process, unnecessary processing time and waste of traffic for connecting the server occur.
…
[0070] The voice recognition parameter may include, for example, and without limitation, at least one of: currently input source information and an apparatus status. The voice recognition command list may include, for example, and without limitation, at least one of: application information used in the electronic apparatus 100, EPG data of a currently input source, and commands for functions provided by the electronic apparatus 100. The electronic apparatus 100 may maintain information about an existing session as it is, and thus stability of the session may be ensured.
…
[0078] The voice recognition server 300 may perform only the STT (Speech To Text) function whereas a separate server may perform the search function. In this case, the server performing the STT (Speech To Text) function may convert the digital voice signal into the text information and transmit the converted text information to the separate server performing the search function. The electronic apparatus 100 according to an example embodiment of the present disclosure may maintain an established session using information about the existing established session without establishing a new session in the case of a switching operation, and thus an unnecessary processing time and traffic for a server connection may not be wasted.
…
[0139] FIG. 6 is a diagram illustrating an example operation of maintaining a session in a switching process and comparing it with related art that does not maintain a session.
[0140] Referring to FIG. 6, a switching process in the related art and a switching process in the present disclosure may be compared.
[0141] In the related art, when there is a voice recognition start command from the first external apparatus 200-1, a session with the voice recognition server 300 was established. Then, audio data was received from the first external apparatus 200-1 and a voice recognition process was performed. Here, it is assumed that there is switching in which a voice input start signal is received from the second external apparatus 200-2. In the related art, when there is switching, voice recognition was terminated with respect to the audio data received from the first external apparatus 200-1, and the session with the existing voice recognition server 300 was blocked. Here, blocking the session means that the session is closed (e.g., ended). Then, voice recognition was started with respect to the audio data received from the second external apparatus 200-2, a new session was established with the voice recognition server 300, and the voice recognition process was performed. That is, in the related art, when there is voice recognition switching from the first external apparatus 200-1 to the second external apparatus 200-2, the existing session was blocked and a new session was connected. 
[0142] On the other hand, the electronic apparatus 100 according to an example embodiment of the present disclosure may maintain an existing session. For example, since the electronic apparatus 100 may maintain the existing session even when there is switching while performing the voice recognition process on the audio data received from the first external apparatus 200-1, the electronic apparatus 100 may continuously perform voice recognition on the audio data received from the second external apparatus 200-2.
[0143] Referring to FIG. 6, the electronic apparatus 100 according to an example embodiment of the present disclosure may omit an operation of terminating voice recognition with the first external apparatus 200-1, an operation of blocking the existing session established for processing the audio data received from the first external apparatus 200-1, and an operation of establishing a new session for processing the audio data received from the second external apparatus 200-2, compared to the related art. 
[0144] Therefore, the electronic apparatus 100 may reduce a processing time of the entire voice recognition process by a time of the omitted operations. As described above, the electronic apparatus 100 according to an example embodiment of the present disclosure may maintain the established session using information on the established session without establishing a new session when there is a switching operation, and thus unnecessary processing time and the traffic for server connection may not be wasted.
…
[0175] Here, when the voice recognition result is not received, the electronic apparatus 100 may determine whether the voice recognition result is received (S1010), waits until the voice recognition result is received, and when the voice recognition result is received, may store the voice recognition result in the memory 140 (S1015). Then, the electronic apparatus 100 may request the voice recognition server 300 to stop voice recognition immediately (S1020). 

[0176] The electronic apparatus 100 may reuse voice recognition parameters used in an existing established session and change or process some parameters to use them (S1025). In this case, the processed parameters may be a current time or ID information of the apparatus. Also, the electronic apparatus 100 may reuse a recognition command used in the existing established session (S1030). The electronic apparatus 100 may maintain information about the existing session as it is, and thus stability of the session may be ensured.

(Published Application.)

Note also the following drawings of the instant Application:
   
    PNG
    media_image7.png
    583
    404
    media_image7.png
    Greyscale
       
    PNG
    media_image8.png
    432
    299
    media_image8.png
    Greyscale


Figure 10, S1025 pertains to “Reuse and Process Recognition Parameter.”

Voice Recognition Parameter
The phrase “voice recognition parameter” is now key because it is used for “establishing a session using a voice recognition parameter with a voice recognition server based on the first voice input start signal;” and then re-used for “transmitting the received audio data from the second external apparatus to the voice recognition server re-using the voice recognition parameter of the maintained session,” and is defined as “wherein the voice recognition parameter includes at least one of input source information or an electronic apparatus state.”
Input Source Information
The Specification does not provide antecedent basis for the phrase “input source information.”  
In one place in the Specification, “Source” is defined as a broadcast source:  “[0101] The tuner 150 may receive broadcast signals from various sources such as terrestrial broadcast, cable broadcast, satellite broadcast, or the like. The tuner 150 may receive broadcast signals from sources such as analog broadcast or digital broadcast from various sources.”  “[0103] … The broadcast signal may include video, audio, and additional data (e.g., an EPG (Electronic Program Guide).”
On the other hand, the Specification also includes:  “[0176] The electronic apparatus 100 may reuse voice recognition parameters used in an existing established session and change or process some parameters to use them (S1025). In this case, the processed parameters may be a current time or ID information of the apparatus….”  In this case, “source” is probably the “external apparatus 200.” Although, the above passage is not clear as to which “apparatus” it refers: 100 or 200?
Electronic Apparatus State
 As for the “Electronic Apparatus State,” it is not clear whether the “State” pertains to the first or second external apparatuses (the Remotes) or the electronic apparatus which is the subject of the preamble (the TV).   
See the 112(b) rejection above.
Specification provides:  “[0128] The sensor 224 may include at least one sensor and measure a physical quantity or sense an operating state of the electronic apparatus 100 {Examiner note: this should be “external electronic apparatus 200”} to convert measured or sensed data into electrical signals. The sensor 224 may include various sensors (e.g., a motion sensor, a gyro sensor, an acceleration sensor, a gravity sensor, etc.) for detecting motion.” As provided in the Objections to the Specification and Claims, above, Examiner believes that paragraph [0128] of the Specification should be amended to reflect that the sensor 224 is part of the EXTERNAL electronic apparatus 200.  See the supporting material in the Objection.  
On the other hand, the “state” may refer to a state of the electronic apparatus 100 (TV) in which case it may be referring to a “standby state”:  “[0173] Referring to FIG. 10, when the electronic apparatus 100 determines that a new apparatus is connected to the electronic apparatus 100, the electronic apparatus 100 may determine whether the apparatus is in a standby state to receive a voice recognition result (S1005).”

Claim Language Clarification Needed
Based on the current Claim language the maintaining the session is based on a “parameter” whose definition is either broad or indefinite: “establishing a session using a voice recognition parameter …maintaining the established session … re-using the voice recognition parameter of the maintained session … wherein the voice recognition parameter includes at least one of input source information or an electronic apparatus state.”

Two choices are provided for the “voice recognition parameter”: either “input source information” or “electronic apparatus state.”  As set forth above, neither choice is particular at best and is indefinite in the worst case.

What is this parameter?  Is it the identifier of the TV Channel (ABC Or PBS) that is being broadcast from the TV/device 100?  Is it some gesture or motion of the remote/external devices 200? Is it the Identifier of the TV?  Is it the identifier of the First Remote?  Is it the Identifier of the Second Remote?  Which is it?  Currently, the “input source information” in the last “wherein clause” is mapped to the User/Speaker information/identity because he is the source of the input to the External Apparatuses.  Under this interpretation, the “voice recognition parameter” is merely a Speaker ID.

Please provide at least one narrow Claim with clear and particular claim language.  For example, provide a picture Claim that uses the TV and first remote and second remote instead of the generic “apparatus” and uses Speaker Identifier or any other particular aspect that is intended by the “voice recognition parameter” until an agreement about allowability can be reached in concept and then you can determine how much broader you can claim in view of the references that become available to you based on a more particular language.

Response to Arguments Regarding  “Maintaining the Session”
Applicant argues:

    PNG
    media_image9.png
    273
    633
    media_image9.png
    Greyscale


    PNG
    media_image10.png
    83
    635
    media_image10.png
    Greyscale
 (Applicant’s Response, p. 11.)
As Provided in Previous Office actions:  
With respect to Moniz, the entire point of Moniz is to maintain continuity when the user moves from device to device and room to room as shown in Figures 5A and 5B.  “For example, as shown in FIG. 5A, a user 15a in Room 1 speaks an utterance to device 110a and asks a question such as "How old is the President?" The system may then process the audio of the utterance, determine an answer, and send output audio data back to device 110a to respond "Barack Obama is fifty-five years old." The user may then walk from Room 1 to Room 2 (shown in FIG. 5B) and speak a new utterance to device 110b asking "when was he sworn in?" The system may be configured to recognize that the word "he" in the second utterance corresponds to the entity referred to in the first utterance.”  Col. 21, lines 32-43.  The teachings of Moniz do not contain anything to the contrary of “maintaining” of the sessions.  Rather, they are directed to many different methods of maintaining continuity when the speaker changes device or even when two different speakers use the same device or different devices.  One of Moniz’s scenarios, i.e., Figures 5A and 5B, map to the Claims.
Moniz teaches continuity of communication between the input device and server when the speaker jumps from input device/ “first external apparatus” to input device/ “second external apparatus” and Mozer teaches an intermediary / “electronic apparatus” between the input devices / “first and second external apparatuses” and the server.  The In the Claims it appears that the TV is just sitting there as an intermediary when it comes to the communication session; key is the handover between the remotes.  This is taught by Moniz.  (Additionally, Alexa device are used for turning other appliances on and off.  Thus, a simple turning on and off of a TV would not be far from teachings of Moniz.)
Moniz has the speech controlled device (Alexa) directly communicating with the server or servers.  Mozer is a 3-device reference that was cited for teaching an intermediary device such as a TV that has to receive the voice from the remote and transmit it to an ASR server.  “Systems and methods for improving the interaction between a user and a small electronic device such as a Bluetooth headset are described. The use of a voice user interface in electronic devices may be used. In one embodiment, recognition processing limitations of some devices are overcome by employing speech synthesizers and recognizers in series where one electronic device responds to simple audio commands and sends audio requests to a remote device with more significant recognition analysis capability….”  Mozer, Abstract.  Voice control is taught by both Moniz and Mozer where either the speech controlled device (Alexa in Moniz) or the electronic device with access point (TV in Mozer) is controlled by the content of the input speech.

Applicant’s Arguments Directed to New Claim Language
Applicant refers to the limitation added by amendment and certain passages of Moniz and in this respect, Applicant dismisses Moniz as follows:

    PNG
    media_image11.png
    121
    640
    media_image11.png
    Greyscale

(Applicant’s Response, p. 15.)
In Reply, first note the language of dependent Claim 13 (same as Claim 3) which further defines the method of the independent Claim for “determining whether to maintain the established session” and includes the same process of Moniz which uses user identification for determining that the same user is speaking and the same session must be continued:  “13. The method as claimed in claim 12, wherein the determining of whether to maintain the established session includes determining whether the user of the first external apparatus and the user of the second external apparatus are the same user by comparing the first voice input start signal received from the first external apparatus with the second voice input start signal received from the second external apparatus.”  Thus, not only Moniz teaches the broad language of the independent Claim, it also teaches the further limitation which is claimed in the dependents down the line and it is by user identification that is disparagingly characterized by the Applicant.
	Further, In Reply, Moniz keeps the continuity of speech recognition and command execution when the user/speaker moves from one room to another and is therefore going from one Alexa device to another Alexa device.  See Figures 5A and 5B of Moniz.  The identification of the speaker though his voice is performed to assist in determination of the intent of the command and for anaphora resolution.  The user asks in the first room and from the first Alexa device "How old is the President?" and when the user walks from Room 1 to Room 2 and asks from a second Alexa device "when was he sworn in?" the central server knows that the “he” in this second question refers to the same “President” in the first question based on the identity of the user.  See Moniz, Col. 21, lines 20-50.  Continuity is maintained between a session between the first Alexa device and the server and the second Alexa device and the server.
	
Applicant further argues:

    PNG
    media_image12.png
    324
    633
    media_image12.png
    Greyscale

(Applicant’s Response, p. 15.)
In Reply, as provided in the Objections, 112 Rejection, and the Discussion of the amended language above, the “voice recognition parameter” is not defined with particularity or clarity.  As is, the “voice recognition parameter” is mapped to the user identity which corresponds to the “input source information” of the Claim.  The input is coming from the speaking user and the speaking user is an “input source.”  Under this interpretation, the “input source information” is user/speaker identity.  In Moniz, the identity of the user, as determined from his voice, is used as a parameter to maintain the continuity of sessions as the user moves from one room/device to another room/device.

Patentability of the other independent Claims is argued based on their similarity to Claim 1. Accordingly, the above provides a reply to those arguments as well.
Patentability of the dependent Claims is argued based on their dependence from their base independent Claims. Accordingly, the above provides a reply to those arguments as well.

Note Also:
Additionally, and as provided in the previous Office actions, the “maintaining” is defined in the Specification by what the instant Application “does not do” which allegedly the prior art “does.”  (See “[0143] Referring to FIG. 6, the electronic apparatus 100 according to an example embodiment of the present disclosure may omit an operation of terminating voice recognition with the first external apparatus 200-1, an operation of blocking the existing session established for processing the audio data received from the first external apparatus 200-1, and an operation of establishing a new session for processing the audio data received from the second external apparatus 200-2, compared to the related art.”  Published Application.)  Thus, even if such detail were included into the Claim language, this type of argument is premised on reading into the prior art (including the cited art) steps that are not present in the disclosure of the prior art.  In other words, when the prior art does not teach what steps its switching operation entails, we have to assume that the switching operation of prior art includes the above steps, in order to distinguish the Claim from the prior art.  This we cannot do.  Even when an invention is directed to a simplification, normally it uses some superior method that can be claimed.  We look at what the Claim HAS that the prior art does not teach.  

The “server” in the Claims has been amended every time since the start of the prosecution:
Claims of 12/05/2018:

    PNG
    media_image13.png
    106
    505
    media_image13.png
    Greyscale

Claims of 8/21/2020:

    PNG
    media_image14.png
    122
    649
    media_image14.png
    Greyscale

Claims of 12/14/2020:

    PNG
    media_image15.png
    110
    656
    media_image15.png
    Greyscale

Claims of 1/12/2021:

    PNG
    media_image16.png
    114
    644
    media_image16.png
    Greyscale

The Disclosure that provides support for the Claims shows an external voice recognition server and there are no embodiments in which the voice recognition server is internal to the device 100 and additionally the point of this server is to perform voice recognition.  

    PNG
    media_image17.png
    550
    520
    media_image17.png
    Greyscale


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 11, 15-16 and 18 and 1, 5-6, and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Moniz (U.S. 10,482,885) (filed 16 November 2016) in view of Mozer (U.S. 2009/0204410).

Claims 1, 5-6, and 8 are device claims with limitations similar to the limitations of method Claims 11, 15-16, and 18 and are rejected under similar rationale.  The structural components are noted in the rejection of method claims to cover the device claims.

Moniz is directed to Amazon’s Alexa:

    PNG
    media_image18.png
    423
    570
    media_image18.png
    Greyscale

    PNG
    media_image19.png
    427
    517
    media_image19.png
    Greyscale



    PNG
    media_image20.png
    458
    639
    media_image20.png
    Greyscale


    PNG
    media_image21.png
    454
    625
    media_image21.png
    Greyscale


    PNG
    media_image22.png
    468
    757
    media_image22.png
    Greyscale


Regarding Claim 11, Moniz teaches:
11. A method of controlling an electronic apparatus, [Moniz, is directed to “command execution” and the commands include the “control” of the “speech controlled devices 110” to perform the command:  “Depending on system configuration, a speech processing system may be capable of executing a number of different commands such as playing music, answering queries using an information source, opening communication connections, sending messages, shopping, etc….”  Col. 2, lines 19-39.  Figure 1A, 146:  “… The server 120 may then execute (146) a command corresponding to the second text using the first entity.”  Col. 6, lines 21-51.  Figure 9 shows the hardware components of the “Server 120” / “electronic apparatus” included “communication circuitry” / “I/O Device Interfaces 902,” “Controllers/Processors 904” and “Memory 906.”  The “server 120” is not shown as having a display.]
the method comprising: [In the instant Application the “electronic apparatus” is a TV that is being controlled by commands given to two different remote control devices which are the “first and second external apparatus” of the Claim.  The TV communicates with a speech recognition server.  In Moniz the device that is being controlled and the external apparatuses are the same “device 110” or the device that is controlled may be some other media player that Alexa controls.]
receiving a first voice input start signal from a first external apparatus; [Moniz, Figure 1A:  The “Server 120” receives the voice input from the “speech controlled device 110a”/ “first external apparatus.”  Figure 1A, also shows “First input audio 11a” which can be input after a “wakeword” / “First voice input start signal” of the Claim.   “The wakeword detection module 220 works in conjunction with other components of the device 110, for example a microphone (not illustrated) to detect keywords in audio 11....”  Col. 10, lines 11-19.  “Thus, the wakeword detection module 220 may compare audio data to stored models or data to detect a wakeword…”  Col. 10, lines 56-57.  See Col. 23, lines 19-24, where the users begin with “Alexa” and continue with their command such as:  “Alexa, play some Weird Al.”  Further, the “speech controlled devices 110a, 110b” are communicating via the “networks 199” with the “Servers 120.”]
establishing a session using a voice recognition parameter with a voice recognition server based on the first voice input start signal; [Moniz, the “first voice input start signal” is mapped to “wakeword” of Moniz and in response to receiving a “wakeword” / “first voice input start signal,” the Alexa device starts communicating with the server which means that it has to establish a communication session.  The servers 120 perform speech recognition on the received audio data and also identify the speaker which is mapped to “using a voice recognition parameter” of the Claim:  Figure 1A, 132: determine first audio data is associated with first speaker ID.  Figures 1A-1D, the “Speech controlled device 110a, 110b” establish a communication session with the “servers 120” that perform the “speech processing” functions including speech recognition:  “1 … performing automatic speech recognition on the first input audio data to obtain first text data; processing the first text data to determine that the first text data includes a name of a first person; storing association data associating between a first speaker identifier (ID) associated with the first speaker, a first device ID associated with the first speech-controlled device, and a first entity ID associated with the first person;…”  “The server 120 may receive (130), from the first device 110a, first audio data corresponding to the first utterance. The server 120 may determine that the first audio data is associated with a first device ID (e.g., an ID associated with device 110a). The server 120 may also determine (132) that the first audio data is associated with a first speaker ID. For example, the server 120 may determine that the user 1 spoke the first utterance. This determination may be done by performing speaker identification on the first audio data to determine that the first audio data corresponds to user 1.”  Col. 5, lines 51-65.  Figure 2 shows that the “Server(s) 120” includes the “Automatic Speech Recognition 250” and “NLU 260.”  If more than one “server” is included in the “Server(s) 120,” one server may relay the speech to another for voice recognition.  “The server(s) 120 (which may be one or more different physical devices) may be capable of performing traditional speech processing (e.g., ASR, NLU, command processing, etc.) as described herein. A single server 120 may perform all speech processing or multiple servers 120 may combine to perform all speech processing.”  Col. 5, lines 7-13.  “Once the wakeword is detected, the local device 110 may "wake" and begin transmitting audio data 111 corresponding to input audio 11 to the server(s) 120 for speech processing. Audio data corresponding to that audio may be sent to a server 120 for routing to a recipient device or may be sent to the server for speech processing for interpretation of the included speech (either for purposes of enabling voice-communications and/or for purposes of executing a command in the speech). The audio data 111 may include data corresponding to the wakeword, or the portion of the audio data corresponding to the wakeword may be removed by the local device 110 prior to sending.”  Col. 11, lines 16-27.  There is no intermediary device in Moniz.  (This Claim is written from the viewpoint of the TV of the instant Application which is an intermediary device between the devices that get the voice input and the server that does the speech recognition and other tasks.)  ]  
receiving a second voice input start signal and audio data from a second external apparatus in a state where the session is established;  [Moniz, Figure 1A, the “Server 120” receives the second wakeword / “second voice input start signal” from the “speech controlled device 110b”/ “second external apparatus” and this wakeword is associated with “Second input audio 11b”/ “audio data.”  The session has been established.  The point of Moniz is not to lose continuity when the user moves from device to device and room to room as shown in Figures 5A and 5B.  As shown in various scenarios of Col. 23, the system (server) knows it is dealing with the same thread of speech inputs if either the same user ID (voice or wakeword indicating same person) or same user account (different devices in the same house) are used when the user moves from device to device.  Each wakeword/voice input start signal has its own audio data (command, query) following the wakeword.]
maintaining the established session based on the second voice input start signal; and [Moniz does not indicate that it drops the session.  The point of Moniz is not to lose continuity of NLP when the user moves from device to device and room to room as shown in Figures 5A and 5B and Col. 23, e.g.  “At a later point in time, a second speech-controlled device 110b may capture audio of a second spoken utterance (i.e., second input audio 11b) from first user 5a. The server 120 may receive (136), from the second device, second audio data corresponding to the second utterance. The server 120 may determine that the second audio data is associated with a second device ID (e.g., an ID associated with device 110b). The server 120 may also determine (138) that the second audio data is associated with the first speaker ID. This determination may also be done by performing speaker identification or may be performed using other techniques. The server 120 may process (140) the second audio data to determine second text (for example user ASR processing). The server 120 may then determine (142) that the second text includes a word corresponding to an entity, but the entity is not itself represented in the second audio data and therefore the word may constitute anaphora, exophora, or the like. This determination may be made using an NLU component such as a named entity recognition component 262 (discussed below) or other component. The server 120 may then determine (144), using the first speaker ID, the first and/or second device IDs and/or other information (such as the relative locations of devices 110a and 110b, the time between receipt of the first input audio data and second input audio data, or other information), that the word corresponds to the first entity from the first utterance. This may include determining that the first utterance and second utterance are part of the same conversation and thus the anaphora in the second utterance relates to the first utterance. The server 120 may then execute (146) a command corresponding to the second text using the first entity.”  Col. 6, lines 21-51.  (Moniz is silent regarding the communication related features.  However, so are the Claim and its supporting Specification. The Claim merely uses the term “maintaining” which is taught by the example of Figures 5A and 5B of Moniz.  Additionally, the supporting Specification includes no specifics regarding by what method the session is maintained.)]
transmitting the received audio data from the second external apparatus to the voice recognition server re-using the voice recognition parameter of the maintained session, and [Moniz, Figure 1A, the “speech controlled device 110b”/ “second external apparatus” transmits the speech that it receives “Second input audio 11b” / “received audio data” to the “Server 120.”   Here, the server uses / “re-uses” the same User ID / “voice recognition parameter” of the first speaker to determine that it is the same user talking and this information is used in the NLU processing:  “… The server 120 may determine that the second audio data is associated with a second device ID (e.g., an ID associated with device 110b). The server 120 may also determine (138) that the second audio data is associated with the first speaker ID….. The server 120 may then determine (144), using the first speaker ID, the first and/or second device IDs and/or other information (such as the relative locations of devices 110a and 110b, the time between receipt of the first input audio data and second input audio data, or other information), that the word corresponds to the first entity from the first utterance. ….”  Col. 6, lines 21-51.]
receiving a result data corresponding to the audio data from the voice recognition server using the maintained session, [Moniz, Figure 1A, 134 and 140.  The ASR and NLU of the servers 120 operate on the audio data that is being received from the two “speech controlled devices 110a and 11b” in tandem and generate a result that is either executing a command or responding to a query such as “Alexa, where is the nearest Starbucks?” or “Alexa, play some Weird Al.”  Col. 23, lines 18-23.]
wherein the voice recognition parameter includes at least one of input source information or an electronic apparatus state. [Moniz, Figure 1A and Figure 4: the information regarding the device 110 or the user 5a both constitute “input source information” and as provided above, the server 120 uses the speaker/user ID to determine that the same user is speaking from a different device and maintain the continuity of the session. Additionally, the information regarding both the device and the user are stored in the user profile and are used for identifying the user and can be considered part of the voice recognition parameters.  (Note the Objections above.  The Claim is quite broad with respect to what the “input source information” is and indefinite with respect to the “electronic apparatus state.”)]

Moniz teaches that the voice input begins with the “wakeword” “Alexa” which teaches the “voice input start signal” of the Claim.  “"Alexa, where is the nearest Starbucks?"”  Col. 23, line 20.Figure 2, “wakeword detection module 220” on the “device 110.”  However, the “voice input start signal” of the Claim appears to be just a Bluetooth signal from one device to another and is not a trigger or wake word.
Moniz shows two sets of devices: the speech controlled devices that the user talks to and the remote servers which perform the speech recognition.
Moniz teaches that more than one set of servers may be involved such that the received speech at one server is sent to another server with ASR capability.
However, the instant Application has the 3-player system where a remote control receives the speech and communicates it to the TV via a Bluetooth connection and the TV then communicates the message to the recognition server.
Accordingly, while when two different servers are used in Moniz, Moniz can teach the 3-player configuration of the Claim, a second reference is added.

Mozer, Figure 1:

    PNG
    media_image23.png
    350
    498
    media_image23.png
    Greyscale


    PNG
    media_image24.png
    641
    456
    media_image24.png
    Greyscale

Mozer teaches:
11. A method of controlling an electronic apparatus, [Mozer, “[0003] …Other small electronic products such as television remote controls have become covered in buttons and capabilities that are overwhelming to non-technical users…”  “[0024] …The electronic device may be small and light enough to be worn like … or some other form of headgear or bodily apparel. It can also contain functions of a vehicle, a navigation device, a clock, a radio, a remote control such as used for controlling a television set, etc….”]
the method comprising: [Mozer, Figure 2, Claim viewed from viewpoint of the intermediary “Electronic device with access point 202” which is like the TV in Figure 1 of the instant Application.]
receiving a first voice input start signal from a first external apparatus; [Mozer, Figure 2, the “electronic device with voice user interface 201” teaches the “first external apparatus” of the Claim and the voice input to this device is received at the “electronic device with access point 202” from whose point of view the Claim is drafted.  “[0024] … The electronic device may be small and light enough to be worn like jewelry or to be embedded in clothing, shoes, a cap or helmet, or some other form of headgear or bodily apparel. It can also contain functions of a vehicle, a navigation device, a clock, a radio, a remote control such as used for controlling a television set, etc. … Thus, the small electronic device associated with the first synthesizer and recognizer may contain a Bluetooth interface, a cell phone, an internet address, and the like.”  The “first voice input start signal” is “connection 209” “[0059] …In this example, connection 209 may be a Bluetooth wireless connection and connection 210 may be a cellular or 802.11 wireless connection….”  “[0105] …The connection to the remote units, which may be a radio frequency or infra-red signal, an ultrasonic device, Bluetooth connection, WiFI, Wimax, cable or other wired or wireless connection, allows the small electronic device to both control the operation of the remote units and to retrieve desired information from the remote units….”  Bluetooth sends a start/initiation signal.]
establishing a session via a voice recognition parameter with an voice recognition server based on the first voice input start signal; [Mozer, Figure 2, the “electronic device with access point 202” is connected through the “network 203” with several “servers 204, 205, 206” all of which included a “speech recognition system 208, 214, 253.”  The “voice recognition parameter” is mapped to user identification (see Mozer Figure 4, 406 and [0087]) and user identification is a parameter that is taken into account in the establishing of the communication session. ]
receiving a second voice input start signal and audio data from a second external apparatus in a state where the session is established; [Mozer, Figures 2 and 3. The “Bluetooth headset 301” which is the same as the “Electronic Device with Voice User Interface 201” and is also “the headset 301, coupled through the Bluetooth network 326 to cell phone 302”, [0071] and keeps providing input voice of the user to the “cellular phone 302”/ “electronic apparatus” of the Claim.  This is not a “second” apparatus.  In Mozer each user has his own device.]  
maintaining the established session based on the second voice input start signal; [Mozer, Figures 2 and 3. As long as speech is coming the Bluetooth session is maintained.  The features of establishing a session and maintaining the session and “input start signal” pertain to the communication aspects and some such as the “input start signal” are inherent in the operation of Bluetooth.  But Mozer does not involve going from one user device to another and therefore this limitation is not taught by Mozer.]
transmitting the received audio data from the second external apparatus to the voice recognition server re-using the voice recognition parameter of the maintained session, and [Mozer, Figures 1, 2, and 3, the speech coming from the remote/external apparatus is sent to the server for recognition.]
receiving a result data corresponding to the audio data from the voice recognition server using the maintained session,  [Mozer, Figure 3, “recognizer 319” in the “Cellular phone 302.”   “[0069] … In one embodiment, the remaining utterances ("John Smith cell") may be sent to a recognizer 319 on the cellular phone. Recognizer 319 may be optional for cellular phone 302. Recognizer 319 may be used to recognize the utterances in the context of contact information 322 stored within the cellular phone 302….”  Speech recognition of Mozer is for command execution; the “result” is the executed command which may be providing information such as a phone number.  “[0051] During a voice user interface session, a user may speak to device 101, and the speech input may include one or more utterances (e.g., words or phrases) which may comprise a verbal request made by the user. The speech is converted into digital form and processed by recognizer 104. Recognizer 104 may be programmed with a recognition sets corresponding to commands to be performed by the voice interface (e.g., Command 1, . . . Command N). In one embodiment, the initial recognition set includes only one utterance (i.e., the initiation word or phrase), and the recognition set is reconfigured with a new set of utterances corresponding to different commands after the initiation utterance is recognized. For example, recognizer 104 may include utterances in the recognition set to recognize commands such as "Turn Up Speaker", "Turn Down Speaker", "Establish Bluetooth Connnection", "Dial Mary", or "Search Restaurants". The recognizer 104 may recognize the user's input speech and output a command to execute the desired function….For example, recognizer 104 may recognize the utterance "search" as one of the commands in the recognition set and notify the controller 103 that the command "search" has been recognized. Program 109 running on controller 103 may instruct the controller to send the remainder of the verbal request (i.e. "Bob's Restaurant in Fremont") to a remote electronic device 106 through transceiver 118 and communication medium 110. Electronic device 106 may utilize a more sophisticated recognizer 114 which may recognize the remainder of the request. Electronic device 106 may execute the request and return to the voice user interface data which may be converted to speech by speech synthesizer 102. The speech may comprise a result and/or a further prompt. For example, speaker 111 may output, "Bob's Restaurant is located at 122 Odessa Blvd. Would you like their phone number?"”]
wherein the voice recognition parameter includes at least one of input source information or an electronic apparatus state. [Mozer, Figure 4, 406.  “Input Source Information” of the Claim is mapped to user identification because the User/Speaker is a source of the input voice to the external devices of the Claim.  At 406 context information corresponding to the input command is utilized in executing the command and this context information includes speaker/user identification.  “[0087] Other context information may be the identification of the user or identification of the electronic device. A remote device may use this information to access personal preferences, history, or other personal information located in a database, for example, or location accessible to the remote device in order to optimize the voice recognition process….”  Continuity of identity (same person issuing a command) would translate into continuity of execution.]

Moniz and Mozer pertain to receiving the speech at a device with low or no speech recognition capability and forwarding the speech to a device with higher processing power and recognition capability and it would have been obvious to modify the one step hop of Moniz with the two step hop of Mozer considering that Moniz mentions having several servers including some working as intermediates and Mozer teaches that its configuration can pertain to a remote control contacting a television set contacting a speech recognition server.  (Mozer:  “[0024] … One embodiment of the present invention includes systems and methods for two or more speech synthesis and/or recognition devices to operate in series. A first synthesizer and recognizer, in a small electronic device, may provide both a first voice user interface and communication with the second, or third, etc., remote speech synthesizers and/or recognizers. In this document, the term "remote" refers to any electronic device that is not in physical contact with (or physically part of) the small electronic device. This communication may be, for example, through a Bluetooth interface, a cell phone network, the Internet, radio frequency (RF) waves, or any other wired or wireless network or some combination thereof or the like….”)  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 15, Moniz teaches:
15. The method as claimed in claim 11, 
wherein the establishing of the session with the voice recognition server includes establishing a session using information about voice recognition of the electronic apparatus, and [Moniz continues the session and uses the information obtained from the first round of speech recognition, for example, identity of the Barack Obama as the subject of the question for the continued session when the speaker asks the second question from the second device. Col. 4, lines 14-52. See rejection of Claim 1 and the example of anaphora resolution.  See also the use of other types of “information about the voice recognition” in “… The server 120 may then determine (144), using the first speaker ID, the first and/or second device IDs and/or other information (such as the relative locations of devices 110a and 110b, the time between receipt of the first input audio data and second input audio data, or other information), that the word corresponds to the first entity from the first utterance. This may include determining that the first utterance and second utterance are part of the same conversation and thus the anaphora in the second utterance relates to the first utterance. The server 120 may then execute (146) a command corresponding to the second text using the first entity.”  Col. 6, lines 21-51. ]
wherein the maintaining of the established session includes maintaining the information about voice recognition, based on the second voice input start signal being received from the second external apparatus, and maintaining the established session. [Moniz see the example of Barack Obama as the subject of the question for the continued session when the speaker asks the second question from the second device. Col. 4, lines 14-52.  The system uses the information from the first part that President refers to Obama to determine the response to the second part.]

Regarding Claim 16, Moniz teaches:
16. The method as claimed in claim 15, wherein the information about voice recognition includes at least one of: usage terms and conditions, account information, a network status and a voice recognition command list. [Moniz uses a variety of methods for speaker identification:  “One or more techniques may be used by the system to obtain the speaker ID associated with an utterance. In one technique, audio speaker identification may be performed, where audio data corresponding to the utterance may be compared to stored data corresponding to individual speakers. The system can then match the utterance audio data to the stored data (or some other data indicating how an individual speaker sounds in pitch, volume, speech rate, vocabulary, semantic structure, etc.) to determine who spoke the utterance and thus obtain the ID corresponding to that speaker. ….”  Col. 22, line 45 to Col. 23, line 3.  “Other techniques of identifying the user may include use of visual information (for example facial recognition using a camera communicable with the system 100), identifying the user based on a unique passphrase or wakeword uttered by the user, identifying the user based on an email address or other account information linked to the input to the system (which may not necessarily be voiced based) or the like.”  Col. 6, lines 3-10.]

Regarding Claim 18, Moniz teaches:
18. The method as claimed in claim 16, wherein the voice recognition command list includes at least one of: application information used in the electronic apparatus, EPG data of a currently input source, and a command for a function provided by the electronic apparatus. [Moniz teaches voice recognition is performed in order to execute a command:  “… Once the entity is determined, the system may then complete command processing of the utterance using the identified entity.”  Abstract.  Figure 2, “NLU Module 260” detects the “commands” such as “call” in “call Mom” or “play” in “play music” by parsing and tagging the text obtained from speech.  The “electronic apparatus” in this context could be one of the “servers 120” that may be contacted by the Alexa devices for responding to a particular query or one of the components that executes the commands:  “3…. Once the entity is determined, the system may then complete command processing of the utterance using the identified entity.”  “A speech processing system may be configured as a relatively self-contained system where one device captures audio, performs speech processing, and executes a command corresponding to the input speech. Alternatively, a speech processing system may be configured as a distributed system where a number of different devices combine to capture audio of a spoken utterance, perform speech processing, and execute a command corresponding to the utterance. Although the present application describes a distributed system, the teachings of the present application may apply to any system configuration.”   Col. 2, lines 8-18.  “Depending on system configuration, a speech processing system may be capable of executing a number of different commands such as playing music, answering queries using an information source, opening communication connections, sending messages, shopping, etc.”  Col. 2, lines 19-23.  “The output data from the NLU processing (which may include tagged text, commands, etc.) may then be sent to a command processor 290, which may be located on a same or separate server 120 as part of system 100. The system 100 may include more than one command processor 290, and the destination command processor 290 may be determined based on the NLU output. For example, if the NLU output includes a command to play music, the destination command processor 290 may be a music playing application, such as one located on device 110 or in a music playing appliance, configured to execute a music playing command. If the NLU output includes a search utterance (e.g., requesting the return of search results), the command processor 290 selected may include a search engine processor, such as one located on a search server, configured to execute a search command and determine search results, which may include output text data to be processed by a TTS engine and output from a device as synthesized speech.”  Col. 16, lines 16-34.]

Claims 12-14 and 2-4 are rejected under 35 U.S.C. 103 as being unpatentable over Moniz and Mozer and further in view of Mehta (U.S. 20170180486).
Regarding Claim 12, Moniz teaches:
12. The method as claimed in claim 11, further comprising 
determining whether to maintain the established session before maintaining the established session, 
determining whether a user of the first external apparatus and a user of the second external apparatus are a same user, [Moniz, Figure 1A, 138:  “Determine second audio data is associated with first speaker ID.”  The User ID is the basis of maintaining continuity of the session in Moniz as provided in rejection of Claim 1. ]
maintaining the established session based on the user of the first external apparatus and the user of the second external apparatus being the same user, and [Moniz, Figure 1A, determines whether the speech is coming from the same user and uses the identity of the speaker:   “At a later point in time, a second speech-controlled device 110b may capture audio of a second spoken utterance (i.e., second input audio 11b) from first user 5a. The server 120 may receive (136), from the second device, second audio data corresponding to the second utterance. The server 120 may determine that the second audio data is associated with a second device ID (e.g., an ID associated with device 110b). The server 120 may also determine (138) that the second audio data is associated with the first speaker ID. …”  Col. 6, lines 21-51.  Moniz teaches that determination of continuity of speaker is for the purpose of anaphora resolution which is another way of saying that the same NLU session is continued:  “Certain speech processing systems may be configured such that a user may have access to many different local devices that can capture the user's speech and/or output audio or video data in response to a command. Multiple different local devices may be linked to a single user account, such as a household account, that may include information used to process incoming utterances. For example, a user's home may be configured with many local devices that all communicate to the same back-end platform that performs the ASR, NLU, command execution, etc. In such systems, it may be possible for a conversation to take place between the user and the system using more than one local device. For example, the user may start a conversation while in one room, walk to another room, and desire to continue the conversation. In another example, one user may engage in one conversation with one local device in a home, while another user in the same home may engage in a different conversation with a different local device in the same home. For example, one user may be standing in the kitchen talking to a first device in the kitchen and ask "How old is the President?" After an answer is spoken aloud by the first device (e.g., "Barack Obama is fifty-five years old"), the same user walks into the living room and asks a second device in the living room "when was he sworn in?" In still another example, one user may engage in one conversation with one local device in a home and another user may wish to enter the same conversation while in a different room and proximate to a different local device. For example, one user may be standing in the kitchen talking to a first device in the kitchen and ask "How old is the President?" After an answer is spoken aloud by the first device (e.g., "Barack Obama is fifty-five years old"), a second user in the living room may overhear the answer and ask a second device in the living room "when was he sworn in?" In the above examples, in order to properly respond to the second question, the system needs to be configured to understand that the anaphora of the second question refers to the first question and therefore the two questions are part of the same conversation even if originating at different devices.”  Col. 4, lines 14-52.]
blocking the established session and establishing a new session when the user of the first external apparatus and the user of the second external apparatus are not the same user. 

Moniz recognizes when the speaker has changed.  But does not discuss blocking a continuing session.
In Mozer each user has his own device.

A 3rd reference is cited that covers the initiating and maintaining of communication session between a hub device and several participant devices.
Mehta:

    PNG
    media_image25.png
    421
    556
    media_image25.png
    Greyscale

Mehta (See Figure 10 and claim 8 of Mehta) teaches:
12. The method as claimed in claim 11, further comprising 
determining whether to maintain the established session before maintaining the established session, [Mehta, Figure 9 which comes before Figure 10 includes a “is the session recovered and responsive? 932” decision steps which if Yes then goes “continue with session 938.”  See the description of Figure 9.  [0120].]
determining whether a user of the first external apparatus and a user of the second external apparatus are a same user, [Mehta includes an “authentication step” and the users of both devices have to be same.  It is the same call that is transferred to a new device because the connection with the first device is no longer acceptable. “[0075] … After the second device is automatically authenticated (via USIN or user login), the EMS continues the communication session with the same information provided before without requiring the second device to establish a new communication session.”  “8…  d) associating, by the server application, a new communication device with the communication session when provided with authentication to join the communication session and detecting when the first communication device is dropped from the communication session;”]
maintaining the established session based on the user of the first external apparatus and the user of the second external apparatus being the same user, and [Mehta, user remains the same.  See [0073] to [0077].  “[0077] … In further embodiments, a new communication device provides authentication comprising a user login associated with the previous communication device in the communication session.]
blocking the established session and establishing a new session when the user of the first external apparatus and the user of the second external apparatus are not the same user. [Mehta, This is implied from the teachings.  In Figure 9, if the system of Mehta does not keep the session alive, then the session would be terminated.  Additionally, if the second device is not authenticated, there would be no transfer of the communication session to the new device.  “[0007] Another advantage of the systems, devices, methods, and media described herein is that they enable session persistence through periods of poor communication quality by managing session parameter values to extend a communication session when normal parameter values would result in termination of the communication session….”  “[0012] … In some embodiments, the first communication device is dropped from the emergency communication session due to poor signal, device shutdown, user switching devices, running out of batteries, or any combination thereof. In some embodiments, the new communication device obtains authentication by providing user login information identical to user login information for the first communication device, wherein the first and new communication devices are associated with a user account….”  “[0062] …In some embodiments, a flow evaluation channel detects communication session quality and whether a communication device is dropped or disconnected from a communication session.”]

    PNG
    media_image26.png
    672
    799
    media_image26.png
    Greyscale

Moniz, Mozer, and Mehta pertain to communications and it would have been obvious to combine the system of combination which leaves out the details of establishing communications between devices with the particular communication-technology related steps that are present in Mehta for completeness.  Further, system of Mehta can use the speaker recognition of Moniz as a criterion for dropping one device and connecting to another considering that Mehta includes speaker authentication.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Regarding Claim 13, Moniz teaches:
13. The method as claimed in claim 12, wherein the determining of whether to maintain the established session includes determining whether the user of the first external apparatus and the user of the second external apparatus are the same user by comparing the first voice input start signal received from the first external apparatus with the second voice input start signal received from the second external apparatus. [Moniz uses a variety of methods for speaker identification.  Moniz teaches that one way of identifying that the input is by the same user is by user ID and user identification may be done by user-specific wakewords.  Wakewords were mapped to the first and second “voice input start signal” of the Claim.  “Other techniques of identifying the user may include …, identifying the user based on a unique passphrase or wakeword uttered by the user ….”  Col. 6, lines 2-10.  Wakeword of Moniz was mapped to the “voice input start signals” of the Claim.  The wakewords are not subjected to speech recognition and rather are identified by comparing waveforms/signals:  “One or more techniques may be used by the system to obtain the speaker ID associated with an utterance. In one technique, audio speaker identification may be performed, where audio data corresponding to the utterance may be compared to stored data corresponding to individual speakers. ...”  Col. 22, line 45 to Col. 23, line 3.]
Moniz, Figures 4 and 6:

    PNG
    media_image27.png
    454
    432
    media_image27.png
    Greyscale

    PNG
    media_image28.png
    329
    549
    media_image28.png
    Greyscale


Regarding Claim 14, Moniz teaches and suggest:
14. The method as claimed in claim 12, wherein the determining of whether to maintain the established session includes determining whether the user of the first external apparatus and the user of the second external apparatus are the same user by
comparing ID information of the first external apparatus with ID information of the second external apparatus. [Moniz, Figure 4, the “user profile storage 402” which is used for speaker identification includes the “Device ID” associated with the account of a particular speaker.  Additionally, the speaker identification is done based on speaker profile which includes “Device ID.” “…  For illustration, as shown in FIG. 4, the user profile storage 402 may include data regarding the devices associated with particular individual user accounts 404. In an example, the user profile storage 402 is a cloud-based storage. Each user profile 404 may include data such as device identifier (ID) data, speaker identifier (ID) data, voice profiles for users, internet protocol (IP) address data, name of device data, and location of device data for different devices. In addition, while not illustrated, each user profile 404 may include data regarding the locations of individual devices (including how close devices may be to each other in a home, if the device location is associated with a user bedroom, etc.), address data, or other such information….”  Col. 20, line 63 to Col. 21, line 20.]

Claims 2-4 are device claims with limitations similar to the limitations of method Claims 12-14 and are rejected under similar rationale.  

Claims 19-20 and 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Moniz and Mozer and further in view of An (U.S. 20180182396).
Regarding Claim 19, Moniz teaches:
19. The method as claimed in claim 11, further comprising: 
storing first audio data received from the first external apparatus; [Moniz, Figures 8 and 9 show the “Device 110” and “server 120,” respectively including “memory 806/906” and “storage 809/909.” ]
transmitting the first audio data to the voice recognition server using the established session; [Moniz, see rejection of Claim 11.]
combining second audio data received from the second external apparatus with the stored first audio data; and 
transmitting the combined audio data to the voice recognition server. 
Moniz does not teach the details of storing and combining for transmitting of Claim 19.
Mozer does not address this feature.
An teaches:
storing first audio data received from the first external apparatus; [An, Figure 1, different speakers at different devices are providing the speech which is recorded/stored.  “[0005] … a control unit including a voice recording unit configured to record a specific part of an input voice, ….”]
transmitting the first audio data to the voice recognition server using the established session; [An, audio is transmitted real time and also stored.]
combining second audio data received from the second external apparatus with the stored first audio data; and [An, Figure 1, “speech combiner 400.”  The speech is received from various devices and at various times and buffers the speech segments and combines them.]
transmitting the combined audio data to the voice recognition server. [An, “[0043] Meanwhile, the speech signal detector 100 may combine generated speech sessions according to an order of time points at which inputs of speech recognition signals are started and transmit a combined speech to the speech recognizer 200. For example, when there is a time point at which speech signals input from the plurality of microphones 1 overlap, the speech signal detector 100 may determine a priority thereof according to a time point at which input of each of the speech signals is started, combine the speech signals into a form of a single speech signal by attaching a subsequently input speech signal to an end of a previously input speech signal, and transmit the single speech signal to the speech recognizer 200.”]
Moniz, Mozer, and An pertain to speech recognition and processing and it would have been obvious to combine the express combining of the pieces of audio from different devices before transmission to a speech recognition server from An with the system of the combination to provide for a more complete picture of the received speech command, which is fragmented between the two portions of audio, before processing the combined speech and executing the command.  This combination falls under combining prior art elements according to known methods to yield predictable results or simple substitution of one known element for another to obtain predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.
An, Figure 1:

    PNG
    media_image29.png
    315
    495
    media_image29.png
    Greyscale



Regarding Claim 20, Moniz and Mozer teach the existence of display. Moniz, the “Server 120” which is the electronic device of Claim 11 does not include a display but Figure 8, “Device 110” includes a “Display 109.”
Regarding Claim 20, An teaches:
20. The method as claimed in claim 11, further comprising: 
displaying information about a progress of voice recognition based on the first voice input start signal being received from the first external apparatus. [An, Figures 4 and 6-8 showing the display which includes the waveform of the speech and also includes the recognized speech and its corresponding speaker.  (Background of An:  “[0005] … wherein the control unit controls the display unit to display the converted text file in the form of time-series dialog information between a plurality of speakers classified on the basis of the speaker information.”)]
Moniz, Mozer, and An pertain to receiving the speech at a device and it would have been obvious to combine the display of information about speech recognition from An with the system of the combination to provide for a visual indicator of the process and results.  This combination falls under combining prior art elements according to known methods to yield predictable results. See MPEP 2141, KSR, 550 U.S. at 418, 82 USPQ2d at 1396.

Claim 9 is a device Claim with limitations similar to the limitations of Claim 19 and is rejected under similar rationale.  See also Moniz, Figure 9, “memory 906.”

Claim 10 is a device Claim with limitations similar to the limitations of Claim 20 and is rejected under similar rationale.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Jang (U.S. 20170264939)
Shoemake (U.S. 20150243163)
Efrati (U.S. 2014/0359139) 
Lee (U.S. 20150379992) 
Retter (U.S. 20180322868):  “A method for operating a server system that includes a plurality of servers for processing a voice command recorded by a recording device connected, via an interface, to the server system includes, in response to the recording of the voice command, reading in a session activation signal from the recording device; checking if there is an association between the session activation signal and a session ID; if it is established that there is the association between the session activation signal and the session ID, ascertaining an availability of a prior server that previously processed a session assigned to the session ID; and activating the session on the prior server if it is available ….”  Abstract.  “[0007] The session activation signal can be provided, for example, at the beginning or shortly after the beginning of a recording of the voice command, more or less in response to the manipulation of a corresponding switch of the recording device, or upon the speaking of a particular keyword for activating a recording function of the recording device.”
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARIBA SIRJANI whose telephone number is (571)270-1499.  The examiner can normally be reached on 9 to 5, M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Fariba Sirjani/
Primary Examiner, Art Unit 2659