DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
In response to the Office Action mailed 5/10/2022, applicant has submitted an amendment filed 8/11/2022.
Claim(s) 1, 10, 11, 12, and 19, has/have been amended.  
Response to Arguments
Applicant argues that “A-G explicitly teaches that when a device identifies, with a received model, the user of the model, the device takes no action” and “Thus, if A-G allegedly performs the claimed ‘identifying…’ step, A-G cannot perform the claimed ‘initiating…” and “providing…” steps (Amendment, page 13).
Applicant’s statements are true assuming that the initiating of a task based on the second speech input and providing a result based on the initiated task follow the identification of the speaker of the second speech input as the user.  As claimed, however, there is no requirement that the “initiating…” and the “providing…” steps follow the “identifying..”.
More specifically:
Listing a series of steps in a sequence does not mean that the series of steps must be performed in that sequence, and the language of the steps does not imply that the “initiating…” and “providing..” steps necessarily follow the “identifying…”.  For example, “initiating a task based on the identifying, with the adjusted user-specific acoustic model, of the speaker of the second speech input as the user” would necessarily follow the identifying because the initiating could not possibly be done until after the speaker has been identified as the user.  As currently claimed, “initiating a task based on the second speech input” and “providing a result based on the initiated task” does not need to follow the “identifying…” step because no relationship to the “identifying…” step is claimed.
The list of steps performed at the another electronic device are a series of steps that the another electronic device must perform, and as long as the prior art performs steps that read on the claimed steps (even if the steps are performed in a different order).  Put another way, if the claim recites that the another electronic device performs steps A, B, C, D, and E (with no particular order for C, D, and E, implied by the claim language), then if the prior art teaches/suggests the sequence A, B, D, E, C, this still meets the claim limitations because the sequence A, B, D, E, C means that the device performs steps A, B, C, D, and E.
In this case, A-G describes where a user D (not the user of the device A) can speak an utterance and device A’s microphone can capture a signal representing the utterance (i.e. “receive a… speech input from a speaker”) and encode the microphone-captured utterance audio signal in an audio signal (paragraph 41) and also where a microphone may record an utterance and provide the recording to a feature extraction module that generates the audio signal which the speaker verification module uses to generate a first score (i.e. a first score that indicates a likelihood that the utterance was spoken by the a first user).  The initiating of a feature extraction that generates an encoded version of a microphone-captured/recorded utterance can be interpreted as “initiating a task based on the second speech input” (initiating a feature extraction task based on the second speech input which is recorded/captured by a microphone and which is encoded so that speaker verification can be performed) and providing the encoded utterance audio signal to the speaker verification module can be interpreted as “providing a result based on the initiated task” (i.e. providing a result of the feature extraction to the speaker verification module based on the initiated feature extraction).  The “identifying…” (i.e. the speaker verification) is performed based on the “result” (the audio signal), thereby “identifying, with the adjusted user-specific acoustic model, the speaker of the second speech input as the user” (i.e. identifying the encoded utterance audio signal as being spoken by the user B, thereby identifying user B as the speaker of the microphone-captured utterance)
Therefore, the previous prior art references still suggest the independent claims, as amended, and so adjusted prior art rejections (adjusted as needed based on the amendments to the claims) based on the same references are provided below.
Claim Interpretation
	“the electronic device” in line 3 of claim 19 is interpreted as referring to “an electronic device” in line 3 of claim 11, and not to “[the] another electronic device”.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 5, 6, 7, 10, 11, 12, 16, 17, 18, 24, 25, and 26 is/are rejected under 35 U.S.C. 103 as being unpatentable over Alvarez Guevara et al. (US 2016/0019896), hereafter A-G, in view of Kim et al. (US 2016/0093304), hereafter Kim.

As per Claims 1 and 11-12, A-G suggests (along with its medium and device equivalents) A method, comprising: at an electronic device having one or more processors:… providing… user-specific acoustic model to another electronic device; and at the another electronic device: receiving the… user-specific acoustic model; receiving a second speech input from a speaker; identifying, with the… user-specific acoustic model, the speaker of the second speech input as… user; initiating a task based on the second speech input; and providing a result based on the initiated task (paragraphs 39-41, 47-49, 52-53, 55-57, 73-74; [all paragraphs and Figures are cited for each limitation with “key” paragraphs and Figures pertaining to each limitation identified below, i.e. all other paragraphs and Figures not specifically referenced for any particular limitation are eligible to provide context and additional support]
“A method, comprising: at an electronic device having one or more processors:… providing… to another electronic device;”: paragraphs 47-49, 55; user device B [“an electronic device” that is suggested to have “one or more processors” since it is a device] provides speaker model B to user device A [“another electronic device”] [see paragraph 55], where the speaker model B is used to determine whether an utterance received by user device A is spoken by user B [see paragraphs 48, 49] where a speaker model may be generated by any appropriate method including training/registration [paragraph 47].  Speaker identification information, such as a speaker model, can be interpreted as a “user-specific acoustic model” in the sense that it “models” the “acoustics” of a “specific” “user’s” voice.
“and at the another electronic device: receiving the…; receiving a second speech input from a speaker; identifying, with the… the speaker of the second speech input as… user”: paragraphs 41, 47-49, 52-53, 55-57, 73-74; user device A [the “another electronic device”] receives speaker model B [“the…” speaker identification information], receives/records/captures, via a microphone [see paragraphs 41 and 73-74], an audio signal/utterance [“speech input” which is “second” relative to some other speech input such as a registration phrase described in paragraph 47] and identifies speaker model B as the model with a highest score, thereby identifying user B [“user”] as the speaker of the audio signal/utterance [paragraphs 48-49, 52-53, 55-57]
“initiating a task based on the second speech input; and providing a result based on the initiated task: paragraphs 41, 47-49, 52-53, 55-57, 73-74; “initiating a” feature extraction “task” based on the microphone-captured/recorded audio signal/utterance being received, and providing a feature extraction result [i.e. an encoded version of the microphone-captured/recorded utterance which is produced by the “initiated” feature extraction “task”] to the speaker verification process that identifies the speaker of the microphone-captured/recorded utterance as user B)
A-G does not, but Kim suggests A method, comprising: at an electronic device having one or more processors: initiating a user-specific acoustic model on the electronic device; receiving a plurality of speech inputs including a first speech input, each of the plurality of speech inputs associated with a user of the electronic device; adjusting the user-specific acoustic model based on the plurality of speech inputs; providing the adjusted user-specific acoustic model to another electronic device; and at the another electronic device: receiving the adjusted user-specific acoustic model; receiving a second speech input from a speaker; identifying, with the adjusted user-specific acoustic model, the speaker of the second speech input as the user; initiating a task based on the second speech input; and providing a result based on the initiated task (Figures 3-5; paragraphs 42, 45-47, 49-56; 
Kim describes where a speaker profile [speaker identification data which can be interpreted as a “user-specific acoustic model”] used to verify that a speaker of an utterance is a particular user is updated with newly received speech from a user’s natural interaction with a virtual assistant in order to adapt to changes in the user’s voice over time [paragraph 55], where plural changes suggests that multiple updates of the speaker profile occur such that the most recent updated speaker profile can be interpreted as an “adjusted user-specific acoustic model” [an adjusted/updated “model” of the “acoustics” of the voice of a “specific” “user”], where adapting to changes in the user’s voice over time suggests where multiple speech inputs [“a plurality of speech inputs”] are used to update the speaker profile [“adjust the user-specific acoustic model”] over a period of time.  Paragraph 45 describes where a speaker profile can be built using “utterances of the trigger phrase” and where generating a speaker profile is done using speech from the user’s natural interaction with the virtual assistant.  Paragraph 47 describes where speaker identification is performed using a speaker profile by comparing audio input to “voice prints of the speaker profile”.  Paragraph 49 describes where an identity of a speaker is a user associated with a speaker profile [among speaker profiles for multiple users] that most closely matches the audio input [similar to A-G].  Paragraphs 50-51 describe where identifying the user represented by a speaker profile as the speaker of an audio input leads to the audio input being added to the speaker profile [suggesting where “voice prints of the speaker profile” in paragraph 47 are previous audio inputs from the speaker profile’s corresponding user], where the adding of the most recently received audio input replaces the oldest received audio input in the speaker profile [suggesting that the adding of the most recently received audio input “updates”/”adjusts” the speaker profile to reflect any changes to the speaker profile user’s voice] and where, after completing block 506, the user device returns to block 302 [suggesting that, over time, as a result of running the same audio input processing over and over again, the speaker profile is updated/”adjusted” using multiple/”a plurality of” audio/”speech” inputs]
Kim suggests “A method, comprising: at an electronic device having one or more processors: initiating a user-specific acoustic model on the electronic device; receiving a plurality of speech inputs including a first speech input, each of the plurality of speech inputs associated with a user of the electronic device; adjusting the user-specific acoustic model based on the plurality of speech inputs; providing the adjusted user-specific acoustic model to another electronic device; and wherein at the another electronic device: receiving the adjusted user-specific acoustic model; receiving a second speech input from a speaker; and identifying, with the adjusted user-specific acoustic model, the speaker of the second speech input as the user”: where the speaker models of A-G are, instead, speaker profiles made of audio inputs that are determined to be spoken by a respective speaker profile’s corresponding user, where the speaker profile “models” the “acoustics” of a “specific” “user’s” voice, where the user device B [the “electronic device”] “initiates” the speaker profile for user B [“user-specific acoustic model”] in order to perform speaker identification on audio inputs, “receives”, over time, “a plurality of” audio-inputs/”speech inputs” spoken by user B [i.e. the audio inputs are “associated with a user of the electronic device”], identifies the audio inputs spoken by user B as audio inputs spoken by user B, adjusts/updates the speaker profile for user B by adding the audio/speech inputs identified as spoken by user B to the speaker profile  [“adjusting the user-specific acoustic model based on the plurality of speech inputs”], and where the speaker profile for user B is the “adjusted user-specific acoustic model” which is provided to the user device A [the “another electronic device”] so that the user device A can use the speaker profile for user B to identify whether an audio input received by user device A [“second speech input” relative to the audio inputs received by user device B which are used to update the speaker profile for user B] is spoken by user B, and where user B is identified as the speaker of the audio input received by user device A)
	Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of speaker identification data with another because the prior art teaches the claimed invention except for the substitution of speaker identification data which is not necessarily updated based on a plurality of speech inputs identified, by speaker recognition, as spoken by a particular user with speaker identification data which is.  Kim teaches that speaker identification data which is updated based on a plurality of speech inputs identified, by speaker recognition, as spoken by a particular user was known in the art.  One of ordinary skill in the art could have substituted one type of speaker identification data with another to obtain the predictable results of a system where a user device B sends speaker identification data B to a user device A so that the user device A can determine whether a speech input was spoken by a user B corresponding to the speaker identification B (as per A-G) where the speaker identification data B is a speaker profile for user B that is updated, by the user device B, by adding audio inputs identified as spoken by user B to the speaker profile for user B (as per Kim).
	
As per Claim 5, 16, and 24, and A-G does not, but Kim suggests (along with its medium and device equivalents) wherein receiving the plurality of speech inputs comprises: receiving one or more speech inputs of the plurality of speech inputs at the electronic device (Figures 3-5; paragraphs 42, 45-47, 49-56; 
Kim describes where a speaker profile [speaker identification data which can be interpreted as a “user-specific acoustic model”] used to verify that a speaker of an utterance is a particular user is updated with newly received speech from a user’s natural interaction with a virtual assistant in order to adapt to changes in the user’s voice over time [paragraph 55], where plural changes suggests that multiple updates of the speaker profile occur such that the most recent updated speaker profile can be interpreted as an “adjusted user-specific acoustic model” [an adjusted/updated “model” of the “acoustics” of the voice of a “specific” “user”], where adapting to changes in the user’s voice over time suggests where multiple speech inputs [“a plurality of speech inputs”] are used to update the speaker profile [“adjust the user-specific acoustic model”] over a period of time.  Paragraph 45 describes where a speaker profile can be built using “utterances of the trigger phrase” and where generating a speaker profile is done using speech from the user’s natural interaction with the virtual assistant.  Paragraph 47 describes where speaker identification is performed using a speaker profile by comparing audio input to “voice prints of the speaker profile”.  Paragraph 49 describes where an identity of a speaker is a user associated with a speaker profile [among speaker profiles for multiple users] that most closely matches the audio input [similar to A-G].  Paragraphs 50-51 describe where identifying the user represented by a speaker profile as the speaker of an audio input leads to the audio input being added to the speaker profile [suggesting where “voice prints of the speaker profile” in paragraph 47 are previous audio inputs from the speaker profile’s corresponding user], where the adding of the most recently received audio input replaces the oldest received audio input in the speaker profile [suggesting that the adding of the most recently received audio input “updates”/”adjusts” the speaker profile to reflect any changes to the speaker profile user’s voice] and where, after completing block 506, the user device returns to block 302 [suggesting that, over time, as a result of running the same audio input processing over and over again, the speaker profile is updated/”adjusted” using multiple/”a plurality of” audio/”speech” inputs]
Kim suggests “wherein receiving the plurality of speech inputs comprises: receiving one or more speech inputs of the plurality of speech inputs at the electronic device”: where the speaker models of A-G are, instead, speaker profiles made of audio inputs that are determined to be spoken by a respective speaker profile’s corresponding user, where the speaker profile “models” the “acoustics” of a “specific” “user’s” voice, where the user device B [the “electronic device”] “receives” [“at the electronic device” user device B], over time, “a plurality of” audio-inputs/”speech inputs” [any one or more of which can be the “one or more speech inputs” of claims 5-7] spoken by user B [i.e. the audio inputs are “associated with a user of the electronic device”])
	Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of speaker identification data with another because the prior art teaches the claimed invention except for the substitution of speaker identification data which is not necessarily updated based on a plurality of speech inputs identified, by speaker recognition, as spoken by a particular user with speaker identification data which is.  Kim teaches that speaker identification data which is updated based on a plurality of speech inputs identified, by speaker recognition, as spoken by a particular user was known in the art.  One of ordinary skill in the art could have substituted one type of speaker identification data with another to obtain the predictable results of a system where a user device B sends speaker identification data B to a user device A so that the user device A can determine whether a speech input was spoken by a user B corresponding to the speaker identification B (as per A-G) where the speaker identification data B is a speaker profile for user B that is updated, by the user device B, by adding audio inputs identified as spoken by user B to the speaker profile for user B (as per Kim).

	As per Claim 6, 17, and 25, A-G does not, but Kim suggests (along with its medium and device equivalents) wherein receiving the one or more speech inputs of the plurality of speech inputs at the electronic device comprises: obtaining the one or more speech inputs of the plurality of speech inputs from a user utterance corresponding to a phone call (Figures 3-5; paragraphs 16-17, 37, 42, 45-47, 49-56;
Kim describes where a speaker profile [speaker identification data which can be interpreted as a “user-specific acoustic model”] used to verify that a speaker of an utterance is a particular user is updated with newly received speech from a user’s natural interaction with a virtual assistant in order to adapt to changes in the user’s voice over time [paragraph 55], where plural changes suggests that multiple updates of the speaker profile occur such that the most recent updated speaker profile can be interpreted as an “adjusted user-specific acoustic model” [an adjusted/updated “model” of the “acoustics” of the voice of a “specific” “user”], where adapting to changes in the user’s voice over time suggests where multiple speech inputs [“a plurality of speech inputs”] are used to update the speaker profile [“adjust the user-specific acoustic model”] over a period of time.  Paragraph 45 describes where a speaker profile can be built using “utterances of the trigger phrase” and where generating a speaker profile is done using speech from the user’s natural interaction with the virtual assistant.  
Paragraph 37 of Kim further suggests an example of an utterance that includes/contains a trigger phrase [where the trigger phrase is an utterance that can be used to build a speaker profile as per paragraph 45] followed by a command “Call Mom”.
Kim thus suggests “wherein receiving the one or more speech inputs of the plurality of speech inputs at the electronic device comprises: obtaining the one or more speech inputs of the plurality of speech inputs from a user utterance corresponding to a phone call”: where receiving, at user device B, one of the speech inputs that are used to update the speaker profile for user B includes obtaining one of the speech inputs [e.g. the trigger phrase component of a trigger phrase-command audio input] from a “Hey Siri, Call Mom” trigger phrase-command utterance, where “Hey Siri, Call Mom” “correspond[s] to a phone call” in the sense that it is used to initiate a phone call to Mom.)
Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of speaker identification data with another because the prior art teaches the claimed invention except for the substitution of speaker identification data which is not necessarily updated based on a plurality of speech inputs (where one of the speech inputs is part of an utterance of “Hey Siri, Call Mom”) identified, by speaker recognition, as spoken by a particular user with speaker identification data which is.  Kim teaches that speaker identification data which is updated based on a plurality of speech inputs (where one of the speech inputs is part of an utterance of “Hey Siri, Call Mom”) identified, by speaker recognition, as spoken by a particular user was known in the art.  One of ordinary skill in the art could have substituted one type of speaker identification data with another to obtain the predictable results of a system where a user device B sends speaker identification data B to a user device A so that the user device A can determine whether a speech input was spoken by a user B corresponding to the speaker identification B (as per A-G) where the speaker identification data B is a speaker profile for user B that is updated, by the user device B, by adding audio inputs (where one of the audio inputs is part of an utterance of “Hey Siri, Call Mom”) identified as spoken by user B to the speaker profile for user B (as per Kim).

As per Claim 7, 18, and 26, A-G does not, but Kim suggests (along with its medium and device equivalents) wherein receiving the one or more speech inputs of the plurality of speech inputs at the electronic device comprises: obtaining the one or more speech inputs of the plurality of speech inputs from a user utterance corresponding to a request for a digital assistant (Figures 3-5; paragraphs 16-17, 37, 42, 45-47, 49-56;
Kim describes where a speaker profile [speaker identification data which can be interpreted as a “user-specific acoustic model”] used to verify that a speaker of an utterance is a particular user is updated with newly received speech from a user’s natural interaction with a virtual assistant in order to adapt to changes in the user’s voice over time [paragraph 55], where plural changes suggests that multiple updates of the speaker profile occur such that the most recent updated speaker profile can be interpreted as an “adjusted user-specific acoustic model” [an adjusted/updated “model” of the “acoustics” of the voice of a “specific” “user”], where adapting to changes in the user’s voice over time suggests where multiple speech inputs [“a plurality of speech inputs”] are used to update the speaker profile [“adjust the user-specific acoustic model”] over a period of time.  Paragraph 45 describes where a speaker profile can be built using “utterances of the trigger phrase” and where generating a speaker profile is done using speech from the user’s natural interaction with the virtual assistant.  
Paragraph 37 of Kim further suggests an example of an utterance that includes/contains a trigger phrase [where the trigger phrase is an utterance that can be used to build a speaker profile as per paragraph 45] followed by a command “Call Mom”, where “Hey Siri, Call Mom” is suggested to be a request for a virtual assistant to initiate a phone call to Mom.  Paragraphs 16-17 describe where virtual assistant and digital assistant can refer to the same thing.
Kim thus suggests “wherein receiving the one or more speech inputs of the plurality of speech inputs at the electronic device comprises: obtaining the one or more speech inputs of the plurality of speech inputs from a user utterance corresponding to a request for a digital assistant”: where receiving, at user device B, one of the speech inputs that are used to update the speaker profile for user B includes obtaining one of the speech inputs [e.g. the trigger phrase component of a trigger phrase-command audio input] from a “Hey Siri, Call Mom” trigger phrase-command utterance, where “Hey Siri, Call Mom” “correspond[s] to a request for a digital assistant” in the sense that it is used to request a virtual assistant [“digital assistant”] to initiate a phone call to Mom.)
Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of speaker identification data with another because the prior art teaches the claimed invention except for the substitution of speaker identification data which is not necessarily updated based on a plurality of speech inputs (where one of the speech inputs is part of an utterance of “Hey Siri, Call Mom”) identified, by speaker recognition, as spoken by a particular user with speaker identification data which is.  Kim teaches that speaker identification data which is updated based on a plurality of speech inputs (where one of the speech inputs is part of an utterance of “Hey Siri, Call Mom”) identified, by speaker recognition, as spoken by a particular user was known in the art.  One of ordinary skill in the art could have substituted one type of speaker identification data with another to obtain the predictable results of a system where a user device B sends speaker identification data B to a user device A so that the user device A can determine whether a speech input was spoken by a user B corresponding to the speaker identification B (as per A-G) where the speaker identification data B is a speaker profile for user B that is updated, by the user device B, by adding audio inputs (where one of the audio inputs is part of an utterance of “Hey Siri, Call Mom”) identified as spoken by user B to the speaker profile for user B (as per Kim).

	As per Claim 10, A-G suggests at the another electronic device, failing to identify, with the… , a second speaker of a third speech input as the user (paragraphs 39-41; 47-49, 52-53, 55-57; 73-74; user device A [the “another electronic device”] receives speaker model B [“the…” speaker identification information], receives an audio signal/utterance [“speech input” which is “third” relative to some other speech input such as a registration phrase described in paragraph 47] and identifies another speaker model other than speaker model B as the model with a highest score, thereby failing to identify user B [“user”] as the speaker of the “third” audio signal/utterance [paragraphs 48-49, 52-53, 55-57] because the user corresponding to the another speaker model is identified as the speaker of the audio signal/utterance.  Paragraphs 39-41 suggest where multiple utterances are spoken by respective different speakers.).
A-G does not, but Kim suggests at the another electronic device, failing to identify, with the adjusted user-specific acoustic model, the speaker of the second speech input as the user (Figures 3-5; paragraphs 42, 45-47, 49-56; 
Kim describes where a speaker profile [speaker identification data which can be interpreted as a “user-specific acoustic model”] used to verify that a speaker of an utterance is a particular user is updated with newly received speech from a user’s natural interaction with a virtual assistant in order to adapt to changes in the user’s voice over time [paragraph 55], where plural changes suggests that multiple updates of the speaker profile occur such that the most recent updated speaker profile can be interpreted as an “adjusted user-specific acoustic model” [an adjusted/updated “model” of the “acoustics” of the voice of a “specific” “user”], where adapting to changes in the user’s voice over time suggests where multiple speech inputs [“a plurality of speech inputs”] are used to update the speaker profile [“adjust the user-specific acoustic model”] over a period of time.  Paragraph 45 describes where a speaker profile can be built using “utterances of the trigger phrase” and where generating a speaker profile is done using speech from the user’s natural interaction with the virtual assistant.  Paragraph 47 describes where speaker identification is performed using a speaker profile by comparing audio input to “voice prints of the speaker profile”.  Paragraph 49 describes where an identity of a speaker is a user associated with a speaker profile [among speaker profiles for multiple users] that most closely matches the audio input [similar to A-G].  Paragraphs 50-51 describe where identifying the user represented by a speaker profile as the speaker of an audio input leads to the audio input being added to the speaker profile [suggesting where “voice prints of the speaker profile” in paragraph 47 are previous audio inputs from the speaker profile’s corresponding user], where the adding of the most recently received audio input replaces the oldest received audio input in the speaker profile [suggesting that the adding of the most recently received audio input “updates”/”adjusts” the speaker profile to reflect any changes to the speaker profile user’s voice] and where, after completing block 506, the user device returns to block 302 [suggesting that, over time, as a result of running the same audio input processing over and over again, the speaker profile is updated/”adjusted” using multiple/”a plurality of” audio/”speech” inputs]
Kim suggests “at the another electronic device, failing to identify, with the adjusted user-specific acoustic model, the speaker of the second speech input as the user”: where the speaker models of A-G are, instead, speaker profiles made of audio inputs that are determined to be spoken by a respective speaker profile’s corresponding user, where the speaker profile “models” the “acoustics” of a “specific” “user’s” voice, where the user device B [the “electronic device”] adjusts/updates the speaker profile for user B by adding the audio/speech inputs identified as spoken by user B to the speaker profile  [“adjusting the user-specific acoustic model based on the plurality of speech inputs”], and where the speaker profile for user B is the “adjusted user-specific acoustic model” which is provided to the user device A [the “another electronic device”] so that the user device A can use the speaker profile for user B to identify whether an audio input received by user device A [“second speech input” relative to the audio inputs received by user device B which are used to update the speaker profile for user B] is spoken by user B [and for claim 10, the audio input is identified as being spoken by someone other than user B and thus user device A fails to identify the speaker of the audio input received by user device A as user B])
	Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing to perform a simple substitution of one type of speaker identification data with another because the prior art teaches the claimed invention except for the substitution of speaker identification data which is not necessarily updated based on a plurality of speech inputs spoken by a particular user with speaker identification data which is.  Kim teaches that speaker identification data which is updated based on a plurality of speech inputs spoken by a particular user was known in the art.  One of ordinary skill in the art could have substituted one type of speaker identification data with another to obtain the predictable results of a system where a user device B sends speaker identification data B to a user device A so that the user device A can determine whether a speech input was spoken by a user B corresponding to the speaker identification B (as per A-G) where the speaker identification data B is a speaker profile for user B that is updated, by the user device B, by adding audio inputs identified as spoken by user B to the speaker profile for user B (as per Kim).
Allowable Subject Matter
Claim 2-4, 8-9, 13-15, 19-23, 27-28, are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  
	As per Claim(s) 2 (and similarly claim[s] 13 and 21), the prior art of record does not teach or suggest the combination of all limitations in claim(s) 1 and 2 together, including (i.e. in combination with the remaining limitations in claim[s] 1 and 2) wherein providing the adjusted user-specific acoustic model to the another electronic device comprises: determining whether the adjusted user-specific acoustic model has been trained on a threshold number of speech inputs; in accordance with a determination that the user-specific acoustic model has been trained on the threshold number of speech inputs, providing the adjusted user-specific acoustic model to the another electronic device; and in accordance with a determination that the adjusted user-specific acoustic model has not been trained on the threshold number of speech inputs: further adjusting the adjusted user-specific model based on a second plurality of speech inputs and a plurality of speech results; and providing the further adjusted user-specific acoustic model to the another electronic device.
	Upon further search (in response to the amendment filed 8/12/2022):
	2003/0088414 teaches “during the enrolment it is tested whether the model has already been trained sufficiently. If not, some further utterances are received and the partially completed model is adapted to the new utterances. In this way enrolment is also quick compared to having to start from scratch” (paragraph 18).  Paragraph 76 describes where a number of input utterances required depends on data that has already been collected (suggesting that “enough” training data is based on a number of input utterances).  This reference describes continuing to adapt a partially completed model based on whether the model has been trained sufficiently, but does not appear to specifically describe providing the model to another device based on whether a threshold number of utterances has been used to train the model.
9990926 teaches passive enrollment in a speaker ID system when sufficient amount of speech is accumulated for a particular speaker (col. 2, line 46 – col. 3, line 7).  This reference does not appear to adjust speaker ID information until a sufficient number of samples have been accumulated (the reference appears to wait until enough samples are accumulated before enrolling a speaker).

As per Claim(s) 3 (and similarly claim[s] 14 and 22), the prior art of record does not teach or suggest the combination of all limitations in claim(s) 1 and 3 together, including (i.e. in combination with the remaining limitations in claim[s] 1 and 3) wherein identifying, with the adjusted user- specific acoustic model, the speaker of the second speech input comprises: providing the second speech input to the adjusted user-specific acoustic model to provide a first speech result and a first accuracy score corresponding to the first speech result; providing the second speech input to another user-specific acoustic model to provide a second speech result and a second accuracy score corresponding to the second speech result; and identifying the speaker of the second speech input based on the first accuracy score and the second accuracy score.
2004/0111261 teaches “The score can be used to recognize the identity of the user either by comparing the score of a given speaker model to scores obtained against other speaker models (also known as identification) or thresholding the score to make an acceptance/rejection decision (also known as verification)” (paragraph 29).  This reference appears to describe a “speech result” (an acceptance/rejection decision) based on an “accuracy score” (thresholding a score of a given speaker model).  This reference does not specifically teach where speech input is provided to the speaker models (comparing the speech input model to the speaker models does not provide the speech input to the speaker models, see paragraphs 25 and 28)
2016/0293167 teaches “Speech data corresponding to a particular utterance is input to the input layer of the neural network (1304). This may, for instance, correspond to recorded audio data 110 being provided to the input layer of the neural network 120 that is stored and run on client device 104. This may also correspond to verification utterance 1102 being provided to input layer 711 of speaker verification model 910” (paragraph 161).  Paragraph 52 describes where a speaker identifier module can be configured to verify that a voice input corresponds to a particular predetermined speaker, or may be used to determine which speaker, form among multiple speaker identities, spoke the utterance.  Paragraph 151 describes where a decision about an identity can be made based on comparing a score to a threshold.  This reference suggests providing speech input to a user-specific speaker verification model and obtaining a score and verifying a user as a speaker if the score exceeds a threshold.  This reference does not appear to describe where input speech is provided to each of a plurality of user-specific models and where the scores are used to identify the speaker.
2022/0122615 (LATE filing date) teaches “At 260, frame alignment may be performed through the speaker classification model established at 250. Speech frames in the audio stream may be provided to the speaker classification model, e.g., HMM, to align to respective HMM states of the HMM, and accordingly to align to respective speakers” (paragraph 42, see also paragraph 28).  The speaker classification model does not appear to be described as user-specific (as opposed to one model for a plurality of speakers).
6519563 teaches “The model which is used for comparison with the features extracted from the speech utterance is known as a "speaker dependent" model, since it is generated from training speech of a particular, single speaker. Models which are derived from training speech of a plurality of different speakers are known as "speaker independent" models, and are commonly used, for example, in speech recognition tasks. In its simplest form, speaker verification may be performed by merely comparing the test utterance features against those of the speaker dependent model, determining a "score" representing the quality of the match therebetween, and then making the decision to verify (or not) the claimed identity of the speaker based on a comparison of the score to a predetermined threshold. One common difficulty with this approach is that it is particularly difficult to set the threshold in a manner which results in a reasonably high quality of verification accuracy (i.e., the infrequency with which misverification--either false positive or false negative results--occurs). In particular, the predetermined threshold must be set in a speaker dependent manner--the same threshold that works well for one speaker is not likely to work well for another” (col. 2, lines 12-33).  This reference teaches away from making a decision to verify a speaker identity based on comparison of a score to a threshold.

As per Claim(s) 4 (and similarly claim[s] 15 and 23), the prior art of record does not teach or suggest the combination of all limitations in claim(s) 1 and 4 together, including (i.e. in combination with the remaining limitations in claim[s] 1 and 4) wherein receiving the plurality of speech inputs comprises: receiving one or more speech inputs of the plurality of speech inputs from the another electronic device.
Upon further search (in response to the amendment filed 8/12/2022):
2013/0132094 teaches “where the captured voice input data is to be forwarded to another device” (paragraph 27
10515623 teaches “While a device may be operable for certain processing (e.g., detecting motion, playing music, etc.) it may not be configured to capture and send audio to a remote device for speech processing. To enable a device to capture and send audio to a remote device for speech processing (or otherwise process audio for speech processing), a wake command may be executed. A wake command is a command for a device of the speech-controlled system to capture audio of a spoken utterance for purposes of processing and execution of a command included in the utterance. In traditional speech-controlled systems the wake command may be a wakeword which is spoken to, and recognized by, a local device, which then captures the audio for an utterance and either processes it or forwards audio data of the utterance to another device for processing. The local device may continually listen for the wakeword and may disregard any audio detected that does not include the wakeword or is not preceded by the wakeword” (col. 3, line 63 – col. 4, line 13).  In the context of the rejection based on A-G and Kim, it appears to be illogical for speech inputs from a device’s user to be received from another device for the purposes of training a speaker model (especially when the device is within “hearing” distance of the device’s user).

	As per Claim(s) 8 (and similarly claim[s] 19 and 27, and consequently claim[s] 9, 20, and 28 which depend on claim[s] 8, 19, and 27), the prior art of record does not teach or suggest the combination of all limitations in claim(s) 1 and 8 together, including (i.e. in combination with the remaining limitations in claim[s] 1 and 8) providing the plurality of speech inputs to a user-independent acoustic model, the user-independent acoustic model providing a plurality of speech results based on a first predetermined portion of the plurality of speech inputs, wherein the user-independent acoustic model is based on a dataset, and wherein initializing the user-specific acoustic model comprises: initializing the user-specific acoustic model using the dataset.
Upon further search (in response to the amendment filed 8/12/2022):
2016/0234206 teaches “A speaker-independent acoustic model can recognize speech (more specifically can recognize a sound or a spoken word or phrase) from any person, including a person who has not submitted any speech audio for training the acoustic model. If the user speaks a predetermined password or pass code and the acoustic model recognizes it as the correct predetermined password or pass code, then the user is authenticated. Generally, more speech audio training data is required to create a speaker-independent model than a speaker-dependent model. This embodiment presents a “what you know” test” (paragraph 95; paragraph 94 also describes a speaker-dependent model).  This reference suggests where multiple speech results (multiple words of a phrase) can be provided by a user-independent acoustic model based on one spoken phrase (one speech input)
	2013/0311184 teaches “Herein, if the speaker identification can be identified by the speaker identification module 11, the speech recognition module 12 receives the speaker identification from the speaker identification module 11 and uses an acoustic model corresponding to the speaker identification to recognize a speech in the speech data (step S306). Otherwise, if the speaker identification can not be identified by the speaker identification module 11, a new speaker identification is created, and when the new speaker identification is received from the speaker identification module 11, the speech recognition module 12 uses a speaker independent acoustic model to recognize the speech in the speech data (step S308)” (paragraph 29).  In this reference, a speaker independent model is used if speaker identification can not be identified by the speaker identification module.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC YEN whose telephone number is (571)272-4249. The examiner can normally be reached M-F 12:00PM -8:30PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, RICHEMOND DORVIL can be reached on (571)272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





EY 8/15/2022
/ERIC YEN/Primary Examiner, Art Unit 2658