DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-4, 10-14 and 16-20 have been considered but are moot because of the new ground of rejection in view of Long, Gopalan and Ziraknejad.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-4, 10-14 and 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over Long et al. (US Patent 6539084; hereinafter “Long”) in view of Gopalan (US PG Pub 20170076720) and further in view of Ziraknejad et al. (US Patent 10248771; hereinafter “Ziraknejad”).

	As per claims 1, 19 and 20, Long discloses:
	A method, electronic device and non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of an electronic device (Long; Col. 5, lines 5-28 -  The software is loaded into the computer system 105 from the computer readable medium, and then executed by the computer system 105), provide an intercom service, via a first electronic device, a second electronic device, and a third electronic device (Long; Fig. 1, items 101-1 to 101-n; Col. 2, lines 39-43 – The first embodiment is illustrated in FIG. 1 and takes the form of an intercom system 100 including a plurality of intercom units 101-1 to 101-n being directly linked to a processor unit 104 of a computer system 105 via wired connections 102-1 to 102-n, respectively), by performing actions comprising: 	receiving, at the first electronic device, a first speech input including a message and a trigger phrase, wherein the message was spoken by a first speaker and the trigger phrase indicates a request to provide the intercom service (Long; Fig. 3, item 301; Col. 3, lines 38-67 and Col. 4, lines 1-25 - The intercom system 100 is configured so that a user can activate the system 100 by speaking a request phrase); 	in response to receiving the trigger phrase, causing each of the second electronic device and the third electronic device to provide an audible representation of the message (Long; Fig. 3, item 303; Col. 3, lines 38-67 and Col. 4, lines 1-25 - the request phrase is preferably received and recognized by the processor 505 which signals the remaining intercom units to re-broadcast a call phrase containing the name of the person being called);
	receiving, from the second electronic device, a second speech input including a first reply to the message (Long; Col. 3, lines 38-67 and Col. 4, lines 1-25 - The process continues at step 305, where after re-broadcasting the call phrase the remaining intercom units preferably listen for an answering response. At the next step 307, the intercom units 101-1 to 101-n relay any received response to the processor unit 104);
	receiving, from the third electronic device, a third speech input including a second reply to the message (Long; Col. 3, lines 38-67 and Col. 4, lines 1-25 - The process continues at step 305, where after re-broadcasting the call phrase the remaining intercom units preferably listen for an answering response. At the next step 307, the intercom units 101-1 to 101-n relay any received response to the processor unit 104); and
	in response to determining that a first acoustic transmission metric is greater than a second acoustic transmission metric, excluding the third electronic device from providing audible representations of subsequent messages (Long; Col. 3, lines 38-67 and Col. 4, lines 1-25 - the processor 505 decides which of the remaining intercom units can hear the response most clearly by comparing audio signals from the remaining intercom units. The process continues at step 309, where the selected intercom unit is signaled by the processor 505 and a private two-way audio connection is set up between the selected intercom unit and the intercom unit (hereinafter "originating intercom unit") which initially received the spoken request phrase (thereby excluding the other intercom units that were not selected)).
	Long, however, fails to disclose receiving a first acoustic fingerprint including a first acoustic transmission metric associated with a first spatial relationship between the second electronic device and a speaker of the first reply to the message; receiving a second acoustic fingerprint including a second acoustic transmission metric associated with a second spatial relationship between the third electronic device and a speaker of the second reply to the message; and in response to determining that the speaker of the first reply to the message and the speaker of the second reply to the message are a same speaker, determining that the first acoustic transmission metric is greater than the second acoustic transmission metric.	Gopalan does teach receiving a first acoustic fingerprint including a first acoustic transmission metric associated with a first spatial relationship between the second electronic device and a speaker of the first reply to the message (Gopalan; Fig. 4, item 404; p. 0077 - At 404, one or more audio signal metric values may be received from each voice-enabled device. An audio signal metric value may be for a beamformed audio signal associated with audio input that is received at a voice-enabled device. An audio signal metric value may include a signal-to-noise ratio, a spectral centroid measure, a speech energy level (e.g., a 4 Hz modulation energy), a spectral flux, a particular percentile frequency (e.g., 90.sup.th percentile frequency), a periodicity, a clarity, a harmonicity, and so on); receiving a second acoustic fingerprint including a second acoustic transmission metric associated with a second spatial relationship between the third electronic device and a speaker of the second reply to the message (Gopalan; Fig. 4, item 404; p. 0077 - At 404, one or more audio signal metric values may be received from each voice-enabled device. An audio signal metric value may be for a beamformed audio signal associated with audio input that is received at a voice-enabled device. An audio signal metric value may include a signal-to-noise ratio, a spectral centroid measure, a speech energy level (e.g., a 4 Hz modulation energy), a spectral flux, a particular percentile frequency (e.g., 90.sup.th percentile frequency), a periodicity, a clarity, a harmonicity, and so on); and in response to determining that the speaker of the first reply to the message and the speaker of the second reply to the message are a same speaker, determining that the first acoustic transmission metric is greater than the second acoustic transmission metric (Gopalan; p. 0018 - the service provider 102 may arbitrate between multiple voice-enabled devices that detect audio input from a same audio source… the voice-enabled device 104(1) may send one or more audio signal metric values 110(1) to the service provider 102, while the voice-enabled device 104(N) may send one or more audio signal metric values 110(M). The service provider 102 may rank the voice-enabled devices 104(1) and 104(N) based on the audio signal metric values, as illustrated at 112 in FIG. 1. The service provider 102 may select a voice-enabled device from the ranking (e.g., a top ranked device)… Meanwhile, the service provider 102 may disregard (or refrain from processing) the audio signal from the non-selected device, the voice-enabled device 104(N)).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method and non-transitory computer-readable storage medium of Long to include receiving a first acoustic fingerprint including a first acoustic transmission metric associated with a first spatial relationship between the second electronic device and a speaker of the first reply to the message; receiving a second acoustic fingerprint including a second acoustic transmission metric associated with a second spatial relationship between the third electronic device and a speaker of the second reply to the message; and in response to determining that the speaker of the first reply to the message and the speaker of the second reply to the message are a same speaker, determining that the first acoustic transmission metric is greater than the second acoustic transmission metric, as taught by Gopalan, in order to analyze a variety of audio signal metric values for the voice-enabled devices to designate a voice-enabled device to handle processing of the audio input, which may enhance the user's experience by avoiding duplicate input processing (Gopalan; p. 0008).
	Long in view of Gopalan, however, fail to disclose wherein the first acoustic fingerprint includes a first set of one or more embeddings that emphasize speaker specific characteristics of the speaker of the first reply, wherein the second acoustic fingerprint includes a second set of one or more embeddings that emphasize speaker specific characteristics of the speaker of the second reply, and comparing the first acoustic fingerprint to the second acoustic fingerprint to determine whether the speaker of the first reply and the speaker of the second reply are the same.	Ziraknejad does teach wherein the first acoustic fingerprint includes a first set of one or more embeddings that emphasize speaker specific characteristics of the speaker of the first reply (Ziraknejad; Fig. 11, item 1104; Col. 24, lines 61-67, Col. 25, lines 1-18 – receiving first and second biometric identifiers for performing biometric identification (e.g., a feature vector or a voice print)), wherein the second acoustic fingerprint includes a second set of one or more embeddings that emphasize speaker specific characteristics of the speaker of the second reply (Ziraknejad; Fig. 11, item 1104; Col. 24, lines 61-67, Col. 25, lines 1-18 – receiving first and second biometric identifiers for performing biometric identification (e.g., a feature vector or a voice print)), and comparing the first acoustic fingerprint to the second acoustic fingerprint to determine whether the speaker of the first reply and the speaker of the second reply are the same (Ziraknejad; Fig. 11, item 1108; Col. 25, lines 37-54 - the server 1030 determines a first and a second match score based on a comparison between the first and second biometric identifiers and a first and a second enrollment biometric identifier, respectively).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method and non-transitory computer-readable storage medium of Long to include wherein the first acoustic fingerprint includes a first set of one or more embeddings that emphasize speaker specific characteristics of the speaker of the first reply, wherein the second acoustic fingerprint includes a second set of one or more embeddings that emphasize speaker specific characteristics of the speaker of the second reply, and comparing the first acoustic fingerprint to the second acoustic fingerprint to determine whether the speaker of the first reply and the speaker of the second reply are the same, as taught by Ziraknejad, in order to accurately verify the identity of the user based on an authentication score (Ziraknejad; Col. 1, lines 54-57).

	As per claim 2, Long in view of Gopalan disclose:	The method of claim 1, upon which claim 2 depends.	And, further Gopalan discloses determining that the speaker of the first reply and the second speaker of the second reply are the same speaker based on a comparison of the first acoustic fingerprint and the second acoustic fingerprint (Gopalan; p. 0033 - In a further example, the initial processing may select voice-enabled devices that determined audio signals that have a threshold amount of similarity to each other (e.g., indicating that the devices heard the same utterance). An amount of similarity between audio signals may be determined through, for instance, statistical analysis using techniques, such as Kullback-Leibler (KL) distance/divergence, dynamic time warping, intra/inter cluster differences based on Euclidian distance (e.g., intra/inter cluster correlation), and so on).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long to include determining that the speaker of the first reply and the second speaker of the second reply are the same speaker based on a comparison of the first acoustic fingerprint and the second acoustic fingerprint, as taught by Gopalan, in order to analyze a variety of audio signal metric values for the voice-enabled devices to designate a voice-enabled device to handle processing of the audio input, which may enhance the user's experience by avoiding duplicate input processing (Gopalan; p. 0008).

	As per claim 3, Long in view of Gopalan disclose:
	The method of claim 1, upon which claim 3 corresponds.	And further, Gopalan teaches generating a first user-device correspondence that is between the first speaker and the first electronic device; and in response to determining that the speaker of the first reply to the message and the speaker of the second reply to the message are a same speaker, generating a first user-device correspondence that is between the same user and the second electronic device and generating a third user-device correspondence that is between the same user and the third electronic device (Gopalan; p. 0031 - the initial processing may select voice-enabled devices that determined audio signals at substantially the same time (e.g., within a window of time). To illustrate, two voice-enabled devices may be selected if the devices each generated an audio signal within a threshold amount of time of each other (e.g., within a same span of time—window of time). The selection may be based on time-stamps for the audio signals. Each time-stamp may indicate a time that the audio signal was generated. If the audio signals are generated close to each other in time, this may indicate, for example, that the devices heard the same utterance from a user; also see p. 0033).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long to include generating a first user-device correspondence that is between the first speaker and the first electronic device; and in response to determining that the speaker of the first reply to the message and the speaker of the second reply to the message are a same speaker, generating a first user-device correspondence that is between the same user and the second electronic device and generating a third user-device correspondence that is between the same user and the third electronic device, as taught by Gopalan, in order to analyze a variety of audio signal metric values for the voice-enabled devices to designate a voice-enabled device to handle processing of the audio input, which may enhance the user's experience by avoiding duplicate input processing (Gopalan; p. 0008).

	As per claim 4, Long in view of Gopalan disclose:
	The method of claim 3, upon which claim 4 depends.	And further, Gopalan discloses in response to determining that the first acoustic transmission metric is greater than the second acoustic transmission metric, terminating the third user-device correspondence that is between the same user and the third electronic device, while maintaining a second user-device correspondence that is between the same user and the second electronic device (Gopalan; p. 0018 - the service provider 102 may arbitrate between multiple voice-enabled devices that detect audio input from a same audio source… the voice-enabled device 104(1) may send one or more audio signal metric values 110(1) to the service provider 102, while the voice-enabled device 104(N) may send one or more audio signal metric values 110(M). The service provider 102 may rank the voice-enabled devices 104(1) and 104(N) based on the audio signal metric values, as illustrated at 112 in FIG. 1. The service provider 102 may select a voice-enabled device from the ranking (e.g., a top ranked device)… Meanwhile, the service provider 102 may disregard (or refrain from processing) the audio signal from the non-selected device, the voice-enabled device 104(N)).	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long to include in response to determining that the first acoustic transmission metric is greater than the second acoustic transmission metric, terminating the third user-device correspondence that is between the same user and the third electronic device, while maintaining a second user-device correspondence that is between the same user and the second electronic device, as taught by Gopalan, in order to analyze a variety of audio signal metric values for the voice-enabled devices to designate a voice-enabled device to handle processing of the audio input, which may enhance the user's experience by avoiding duplicate input processing (Gopalan; p. 0008).

	As per claim 10, Long in view of Gopalan disclose:
	The method of claim 1, wherein the same speaker is a second user that is different from a the first user and excluding the third electronic device from providing audible representations of subsequent messages includes excluding the third electronic device from providing audible representations of subsequent messages spoken by the same user (Long; Col. 3, lines 38-67 and Col. 4, lines 1-25 - the processor 505 decides which of the remaining intercom units can hear the response most clearly by comparing audio signals from the remaining intercom units. The process continues at step 309, where the selected intercom unit is signaled by the processor 505 and a private two-way audio connection is set up between the selected intercom unit and the intercom unit (hereinafter "originating intercom unit") which initially received the spoken request phrase (thereby excluding the other intercom units that were not selected)).	And further, Gopalan teaches enabling the third electronic device from providing other audible representations of additional subsequent messages spoken from a third user that is different from the first user and the second user (Gopalan; p. 0018 - the service provider 102 may arbitrate between multiple voice-enabled devices that detect audio input from a same audio source… the voice-enabled device 104(1) may send one or more audio signal metric values 110(1) to the service provider 102, while the voice-enabled device 104(N) may send one or more audio signal metric values 110(M). The service provider 102 may rank the voice-enabled devices 104(1) and 104(N) based on the audio signal metric values, as illustrated at 112 in FIG. 1. The service provider 102 may select a voice-enabled device from the ranking (e.g., a top ranked device)… Meanwhile, the service provider 102 may disregard (or refrain from processing) the audio signal from the non-selected device, the voice-enabled device 104(N)).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long to include enabling the third electronic device from providing other audible representations of additional subsequent messages spoken from a third user that is different from the first user and the second user, as taught by Gopalan, in order to analyze a variety of audio signal metric values for the voice-enabled devices to designate a voice-enabled device to handle processing of the audio input, which may enhance the user's experience by avoiding duplicate input processing (Gopalan; p. 0008).

	As per claim 11, Long in view of Gopalan disclose:	The method of claim 10, further comprising: 	And, further Gopalan does teach generating a device-user mapping, wherein the first user is mapped to the first device based on third acoustic fingerprint included the first speech input, the second user is mapped to the second device based on the first acoustic fingerprint, and the third user is mapped to the third device based on a fourth acoustic fingerprint included in a fourth speech input, received at the third device, and the fourth speech input includes a third reply to the message spoken by the third user (Gopalan; p. 0029 - if multiple voice-enabled devices are located within a home, the arbitration module 214 may perform initial processing to identify a sub-set of the multiple devices that may potentially be best for interacting with a user. The arbitration module 214 may perform the initial processing at runtime (e.g., in real-time when an arbitration process is to be performed) and/or beforehand; p. 0077 - At 404, one or more audio signal metric values may be received from each voice-enabled device. An audio signal metric value may be for a beamformed audio signal associated with audio input that is received at a voice-enabled device. An audio signal metric value may include a signal-to-noise ratio, a spectral centroid measure, a speech energy level (e.g., a 4 Hz modulation energy), a spectral flux, a particular percentile frequency (e.g., 90.sup.th percentile frequency), a periodicity, a clarity, a harmonicity, and so on).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long to include generating a device-user mapping, wherein the first user is mapped to the first device based on third acoustic fingerprint included the first speech input, the second user is mapped to the second device based on the first acoustic fingerprint, and the third user is mapped to the third device based on a fourth acoustic fingerprint included in a fourth speech input, received at the third device, and the fourth speech input includes a third reply to the message spoken by the third user, as taught by Gopalan, in order to analyze a variety of audio signal metric values for the voice-enabled devices to designate a voice-enabled device to handle processing of the audio input, which may enhance the user's experience by avoiding duplicate input processing (Gopalan; p. 0008).

	As per claim 12, Long in view of Gopalan disclose:
	The method of claim 11, further comprising: causing the first device to provide an audible representation of the first reply to the message; inhibiting the first device from providing an audible representation of the second reply to the message; and causing the first device to provide an audible representation of the third reply to the message (Long; Col. 3, lines 38-67 and Col. 4, lines 1-25 - the processor 505 decides which of the remaining intercom units can hear the response most clearly by comparing audio signals from the remaining intercom units. The process continues at step 309, where the selected intercom unit is signaled by the processor 505 and a private two-way audio connection is set up between the selected intercom unit and the intercom unit (hereinafter "originating intercom unit") which initially received the spoken request phrase (thereby excluding the other intercom units that were not selected)).

	As per claim 13, Long in view of Gopalan disclose:
	The method of claim 11, wherein in the device-user mapping, a fourth user is mapped to the second device based on a fifth acoustic fingerprint included in a fifth speech input, received at the second device, and the fifth speech input includes a fourth reply to the message spoken by the fourth user (Gopalan; p. 0029 - if multiple voice-enabled devices are located within a home, the arbitration module 214 may perform initial processing to identify a sub-set of the multiple devices that may potentially be best for interacting with a user. The arbitration module 214 may perform the initial processing at runtime (e.g., in real-time when an arbitration process is to be performed) and/or beforehand).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long to include generating a device-user mapping, wherein the first user is mapped to the first device based on third acoustic fingerprint included the first speech input, the second user is mapped to the second device based on the first acoustic fingerprint, and the third user is mapped to the third device based on a fourth acoustic fingerprint included in a fourth speech input, received at the third device, and the fourth speech input includes a third reply to the message spoken by the third user, as taught by Gopalan, in order to analyze a variety of audio signal metric values for the voice-enabled devices to designate a voice-enabled device to handle processing of the audio input, which may enhance the user's experience by avoiding duplicate input processing (Gopalan; p. 0008).

	As per claim 14, Long in view of Gopalan:
	The method of claim 11, upon which claim 14 depends.	And, further, Gopalan teaches employing the device-user map to associate additional speech inputs received at the first with the first user; employing the device-user map to associate additional speech inputs received at the second with the second user; and employing the device-user map to associate additional speech inputs received at the second with the third user (Gopalan; p. 0029 - if multiple voice-enabled devices are located within a home, the arbitration module 214 may perform initial processing to identify a sub-set of the multiple devices that may potentially be best for interacting with a user. The arbitration module 214 may perform the initial processing at runtime (e.g., in real-time when an arbitration process is to be performed) and/or beforehand; p. 0077 - At 404, one or more audio signal metric values may be received from each voice-enabled device. An audio signal metric value may be for a beamformed audio signal associated with audio input that is received at a voice-enabled device. An audio signal metric value may include a signal-to-noise ratio, a spectral centroid measure, a speech energy level (e.g., a 4 Hz modulation energy), a spectral flux, a particular percentile frequency (e.g., 90.sup.th percentile frequency), a periodicity, a clarity, a harmonicity, and so on).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long to include employing the device-user map to associate additional speech inputs received at the first with the first user; employing the device-user map to associate additional speech inputs received at the second with the second user; and employing the device-user map to associate additional speech inputs received at the second with the third user, as taught by Gopalan, in order to analyze a variety of audio signal metric values for the voice-enabled devices to designate a voice-enabled device to handle processing of the audio input, which may enhance the user's experience by avoiding duplicate input processing (Gopalan; p. 0008).

	As per claim 16, Long in view of Gopalan disclose:
	The method of claim 1, upon which claim 16 discloses.	And, further, Gopalan teaches wherein each of the first electronic device, the second electronic device, and the third electronic device is a smart speaker device (Gopalan p. 0015 - The architecture 100 includes a service provider 102 configured to communicate with a plurality of voice-enabled devices 104(1)-(N) (collectively “the voice-enabled devices 104”) to facilitate various processing).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long to include wherein each of the first electronic device, the second electronic device, and the third electronic device is a smart speaker device, as taught by Gopalan, in order to analyze a variety of audio signal metric values for the voice-enabled devices to designate a voice-enabled device to handle processing of the audio input, which may enhance the user's experience by avoiding duplicate input processing (Gopalan; p. 0008).

	As per claim 17, Long in view Gopalan disclose:
	The method of claim 1, further comprising: in response to determining that the first acoustic transmission metric is greater than the second acoustic transmission metric, causing the first electronic device to provide an audible representation of the first reply to the message and inhibiting the first electronic device from providing an audible representation of the second reply to the message (Long; Col. 3, lines 38-67 and Col. 4, lines 1-25 - the processor 505 decides which of the remaining intercom units can hear the response most clearly by comparing audio signals from the remaining intercom units. The process continues at step 309, where the selected intercom unit is signaled by the processor 505 and a private two-way audio connection is set up between the selected intercom unit and the intercom unit (hereinafter "originating intercom unit") which initially received the spoken request phrase (thereby excluding the other intercom units that were not selected)).

	As per claim 18, Long in view Gopalan disclose:
	The method of claim 1, upon which claim 18 depends.	And further, Gopalan teaches wherein a comparison between the first acoustic transmission metric and the second acoustic transmission metric indicates that the same speaker is closer to the second electronic device than to the third electronic device (Gopalan; p. 0030 - In one example, the initial processing may select voice-enabled devices that are located within a predetermined distance/proximity to each other and/or an audio source… The predetermined distance/proximity may be set to any value, such as an average distance (determined over time) at which a user can be heard by a voice-enabled device when speaking at a particular decibel level).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long to include wherein a comparison between the first acoustic transmission metric and the second acoustic transmission metric indicates that the same speaker is closer to the second electronic device than to the third electronic device, as taught by Gopalan, in order to analyze a variety of audio signal metric values for the voice-enabled devices to designate a voice-enabled device to handle processing of the audio input, which may enhance the user's experience by avoiding duplicate input processing (Gopalan; p. 0008).

	Claims 5-9 are rejected under 35 U.S.C. 103 as being unpatentable over Long in view of Gopalan and Ziraknejad and further in view of Chen (US PG Pub 20220122615).

	As per claim 5, Long in view of Gopalan disclose:
	The method of claim 1, upon which claim 5 depends.	Long in view of Gopalan, however, fail to disclose wherein the first acoustic fingerprint is a first vector that embeds first acoustic features of the first reply to the message in an acoustic-feature vector space and the second acoustic fingerprint is a second vector that embeds the second reply to the message in an acoustic-feature vector space.	Chen does teach wherein the first acoustic fingerprint is a first vector that embeds first acoustic features of the first reply to the message in an acoustic-feature vector space and the second acoustic fingerprint is a second vector that embeds the second reply to the message in an acoustic-feature vector space (Chen; p. 0024 - At 130, the speech segments obtained at 120 may be clustered into a plurality of clusters. Through the clustering operation at 130, the speech segments may be merged based on similarity, such that there is a one-to-one correspondence between the resulted clusters and the speakers. Firstly, speaker feature vectors or embedding vectors may be obtained for the speech segments, e.g., i-vectors, x-vectors, etc. Then speech similarity scoring may be performed with the speaker feature vectors among the speech segments; also see p. 0034).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long and Gopalan to include wherein the first acoustic fingerprint is a first vector that embeds first acoustic features of the first reply to the message in an acoustic-feature vector space and the second acoustic fingerprint is a second vector that embeds the second reply to the message in an acoustic-feature vector space, as taught by Chen, in order to determine “who spoke when?” within an audio stream, and time intervals during which each speaker is active (Chen; p. 0001).

As per claim 6, Long in view of Gopalan and Chen disclose:	The method of claim 5, upon which claim 6 depends.
	And further, Chen discloses generating one or more fingerprint clusters within the acoustic-feature cluster space, wherein the one or more fingerprint clusters include at least the first acoustic fingerprint, the second acoustic fingerprint, and a third acoustic fingerprint that is a third vector that embeds third acoustic features of the message in the acoustic-feature vector space; and determining that the speaker of the first reply to the message and the speaker of the second reply to the message are a same speaker based on identifying that each of the first acoustic fingerprint and the second acoustic fingerprint are included in a first cluster of the one or more clusters (Chen; p. 0024 - At 130, the speech segments obtained at 120 may be clustered into a plurality of clusters. Through the clustering operation at 130, the speech segments may be merged based on similarity, such that there is a one-to-one correspondence between the resulted clusters and the speakers. Firstly, speaker feature vectors or embedding vectors may be obtained for the speech segments, e.g., i-vectors, x-vectors, etc. Then speech similarity scoring may be performed with the speaker feature vectors among the speech segments; also see p. 0034).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long and Gopalan to include generating one or more fingerprint clusters within the acoustic-feature cluster space, wherein the one or more fingerprint clusters include at least the first acoustic fingerprint, the second acoustic fingerprint, and a third acoustic fingerprint that is a third vector that embeds third acoustic features of the message in the acoustic-feature vector space; and determining that the speaker of the first reply to the message and the speaker of the second reply to the message are a same speaker based on identifying that each of the first acoustic fingerprint and the second acoustic fingerprint are included in a first cluster of the one or more clusters, as taught by Chen, in order to determine “who spoke when?” within an audio stream, and time intervals during which each speaker is active (Chen; p. 0001).

	As per claim 7, Long in view of Gopalan and Chen disclose:
	The method of claim 6, upon which claim 7 depends.
	And further, Chen discloses determining a distance metric that encodes a distance between the first vector and the second vector in the acoustic-feature vector space; and determining that the first acoustic transmission metric is greater than the second acoustic transmission metric based on the distance metric (Chen; p. 0024 - the speech similarity scoring may be based on. e.g., probabilistic linear discriminant analysis (PLDA), Bayesian information criterion (BIC), generalized likelihood ratio (GLR), Kullback-Leibler divergence (KLD), etc. Thereafter, the speech segments may be merged based on similarity scores under a predetermined clustering strategy. e.g., agglomerative hierarchical clustering (AHC), etc. For example, those speech segments having high similarity scores among each other may be merged into a cluster; also see p. 0034).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long and Gopalan to include determining a distance metric that encodes a distance between the first vector and the second vector in the acoustic-feature vector space; and determining that the first acoustic transmission metric is greater than the second acoustic transmission metric based on the distance metric, as taught by Chen, in order to determine “who spoke when?” within an audio stream, and time intervals during which each speaker is active (Chen; p. 0001).

	As per claim 8, Long in view of Gopalan and Chen disclose:
	The method of claim 6, upon which claim 8 depends.
	And further, Chen discloses employing a neural network (NN) to generate the first vector based on the second speech input; employing the NN to generate the second vector based on the second speech input; employing the NN to generate the third vector based on the third speech input; and employing an unsupervised clustering algorithm to generate the one or more fingerprint clusters (Chen; p. 0034 -  According to an exemplary process of extracting speaker bottleneck features, for each speech frame in the audio steam, a speaker acoustic feature of the speech frame may be extracted, and then a speaker bottleneck feature of the speech frame may be generated based on the speaker acoustic feature through a neural network. e.g., deep neural network (DNN). The DNN may be trained to classify among a number of N speakers with the loss function to be cross-entropy).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long and Gopalan to include employing a neural network (NN) to generate the first vector based on the second speech input; employing the NN to generate the second vector based on the second speech input; employing the NN to generate the third vector based on the third speech input; and employing an unsupervised clustering algorithm to generate the one or more fingerprint clusters, as taught by Chen, in order to determine “who spoke when?” within an audio stream, and time intervals during which each speaker is active (Chen; p. 0001).

	As per claim 9, Long in view of Gopalan and Chen disclose:
	The method of claim 8, upon which claim 9 depends.
	And further, Chen discloses wherein a supervised deep learning (DL) algorithm was employed to train the NN (Chen; p. 0034 -  According to an exemplary process of extracting speaker bottleneck features, for each speech frame in the audio steam, a speaker acoustic feature of the speech frame may be extracted, and then a speaker bottleneck feature of the speech frame may be generated based on the speaker acoustic feature through a neural network. e.g., deep neural network (DNN). The DNN may be trained to classify among a number of N speakers with the loss function to be cross-entropy).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long and Gopalan to include wherein a supervised deep learning (DL) algorithm was employed to train the NN, as taught by Chen, in order to determine “who spoke when?” within an audio stream, and time intervals during which each speaker is active (Chen; p. 0001).

	Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Long in view of Gopalan and Ziraknejad and further in view of Kummer (US PG Pub 20150162006).

	As per claim 15, Long in view of Gopalan disclose:
	The method of claim 10, upon which claim 15 depends.	Long in view of Gopalan, however, fail to disclose receiving, at the third electronic device, a fifth speech input; in response to determining that the fifth speech input was spoken by the second user, not providing the fifth speech input to the first device; receiving, at the third electronic device, a sixth speech input; and in response to determining that the sixth speech input was spoken by the third user, providing the sixth speech input to the first electronic device.	Kummer does teach receiving, at the third electronic device, a fifth speech input; in response to determining that the fifth speech input was spoken by the second user, not providing the fifth speech input to the first device; receiving, at the third electronic device, a sixth speech input; and in response to determining that the sixth speech input was spoken by the third user, providing the sixth speech input to the first electronic device (Kummer; p. 0118 - Referring again to FIG. 4, the method 400 may include determining a permission of the speaker to control the identified device(s) 408. In some cases, the voice command engine 370 may determine a permission status to control the identified device(s). The permission status may be based on the determined speaker identity and/or the identified device(s) to control).
	Therefore, it would have been obvious to one of ordinary skill in the art to modify the method of Long and Gopalan to include receiving, at the third electronic device, a fifth speech input; in response to determining that the fifth speech input was spoken by the second user, not providing the fifth speech input to the first device; receiving, at the third electronic device, a sixth speech input; and in response to determining that the sixth speech input was spoken by the third user, providing the sixth speech input to the first electronic device, as taught by Kummer, in order to enhance safety and security of the home automation system by having the voice command engine prohibit otherwise undesirable controls from being implemented (Kummer; p. 0118).

	Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art made of record and not relied upon includes:
Feuz (US PG Pub 20190079724) which discloses techniques related to improved intercom-style communication using a plurality of computing devices distributed about an environment. In various implementations, voice input may be received, e.g., at a microphone of a first computing device of multiple computing devices, from a first user. The voice input may be analyzed and, based on the analyzing, it may be determined that the first user intends to convey a message to a second user. A location of the second user relative to the multiple computing devices may be determined, so that, based on the location of the second user, a second computing device may be selected from the multiple computing devices that is capable of providing audio or visual output that is perceptible to the second user. The second computing device may then be operated to provide audio or visual output that conveys the message to the second user (Feuz; Abstract).
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Rodrigo A Chavez whose telephone number is (571)270-0139. The examiner can normally be reached Monday - Friday 9-6 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on 5712727602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RODRIGO A CHAVEZ/Examiner, Art Unit 2658

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658