DETAILED ACTION
1.	This communication is in response to the Amendments and Arguments filed on 6/1/2022. Claims 1-13, 20-23 are pending and have been examined. Claims 14-19 are cancelled. 
Response to Amendments and Arguments

2.	Applicant's arguments with respect to claim rejections under 35 USC 103 have been fully considered, but they are not persuasive. 
In particular, the applicant argues that the references do not teach “processing the frame features for the audio data frame using the trained RNN model to generate direct output, of the trained RNN model, that includes a corresponding probability for each of a plurality of permutation invariant speaker labels ..” and “assigning an unknown label to the audio data frame in response to the corresponding probabilities, generated as direct output of the trained RNN model for the audio data frame, all failing to satisfy the threshold ..” In response, the examiner respectfully disagrees.
Note that YU teaches: [0003] “an audio indexing system that receives audio signals and indexes various characteristics of the signal, such as a speaker identity;” [0024] “a recurrent neural network” which reads on a trained RNN model; Fig. 3, which shows output layer of a DNN/RNN model; Fig. 5 (318), which shows state posterior probability for the current frame; [0021] “classification component 154 receives a set of observed features for each frame of the input data or signal, and outputs classification result 150 comprising a state label for each frame based on the set of observed features for that frame” where the classification component is a part of the audio indexing system (based on the RNN model), and the citations read on “assigning a corresponding one of the plurality of speaker labels to the audio data frame in response to the corresponding probability, for the audio data frame, satisfying a threshold” and [0054] “verification measure 356 comprises a likelihood measure, and can be in the form of a numerical likelihood score that indicates how likely hypothesis 352 is an accurate prediction of the observation” where likelihood reads on probability and hypothesis reads on speaker labeling.
GERL teaches: [0074] “Then, the likelihoods <read on probability> are fed to this neural network to perform the speaker recognition based on these input variables” and [0027] “The detecting step may further comprise comparing the likelihood functions for the received speech input with a predetermined threshold .. if a likelihood function .. is below the predetermined threshold, it may be determined that the speech input does not match the corresponding speaker model. If no match with any of the speaker models is determined, it is determined that the speech input corresponds to an unknown speaker.”  
DYU teaches: [0026] “these techniques can be implemented into a neural network's structure itself, solving the label permutation problem;” [0007-0008] “solutions to the label ambiguity or label permutation problem .. compensate for permutations in the training label,” [0020] “to conduct permutation invariant training (“PIT”) of deep learning models for talker-independent multi-talker scenarios” and [0025] “employ permutation invariant training .. of deep learning models for speech separation that functions for independent talkers in a multi-talker signal.” DYU clearly teaches a ready mechanism (by training) to make speaker label permutation invariant. Once the RNN model is trained to be speaker label permutation invariant, when in real use, the system (based on RNN model) will not have permutation problem.
Claim Rejections - 35 USC § 103
3.	Claims 1, 4-13, 20-23 are rejected under 35 U.S.C. 103 as being unpatentable over Yu, et al. (US 20160140956; hereinafter YU) in view of Dong Yu (US 20170337924; hereinafter DYU), and further in view of Gerl, et al. (EP 2048656B1; hereinafter GERL).
As per claim 1, YU (Title: Prediction-based sequence recognition) discloses “A method of speaker diarization, the method implemented by one or more processors (YU, [0019], processor; [0003], an audio indexing system that receives audio signals and indexes various characteristics of the signal, such as a speaker identity .. a speaker recognition system that receives an audio input stream and identifies the various speakers that are speaking in the audio stream .. Another function often performed is speaker segmentation and tracking, also known as speaker diarization) and comprising:
generating a sequence of audio data frames for corresponding audio data (YU, [0019], recognition result 104 comprises phonemes for an utterance that is provided to a language model to identify a word sequence; [0021], each frame of the input data or signal);
for each of the audio data frames, and in the sequence: applying frame features for the audio data frame as input to a trained recurrent neural network (RNN) model, and processing the frame features for the audio data frame using the trained RNN model to generate direct output, of the trained RNN model, that includes a corresponding probability for [ each of a plurality of permutation invariant speaker labels ]; for each of a plurality of the audio data frames, assigning a corresponding one of the plurality of speaker labels to the audio data frame in response to the corresponding probability, for the audio data frame, satisfying a threshold (YU, [0024], sequence recognizer 110 comprises a recurrent neural network <read on trained RNN>; Fig. 3 <showing output layer>; [0021], classification component 154 receives a set of observed features for each frame of the input data or signal, and outputs classification result 150 comprising a state label for each frame based on the set of observed features for that frame <read on ‘assigning a corresponding one of the plurality of speaker labels to the audio data frame in response to the corresponding probability, for the audio data frame, satisfying a threshold’>; [0054], verification measure 356 comprises a likelihood measure, and can be in the form of a numerical likelihood score <read on probability> that indicates how likely hypothesis 352 is an accurate prediction of the observation); and 
for each of a second plurality of the audio data frames, [ assigning an unknown label to the audio data frame in response to the corresponding probabilities, generated as direct output of the trained RNN model for the audio data frame, all failing to satisfy the threshold ] (YU, [0024], a recurrent neural network <read on trained RNN>; Fig. 3 <showing output layer of a DNN/RNN>; Fig. 5, 318); and 
transmitting an indication of the speaker labels, the unknown labels, and their assignments to at least one additional component for further processing of the audio data based on the speaker labels and the unknown labels (YU, [0003], A speech processing system can also include .. an audio indexing system that receives audio signals and indexes various characteristics of the signal, such as a speaker identity, subject matter, emotion, etc. <where speaker identity reads on speaker label which can be fed/sent/transmitted to any other system components for any further processing per system design choice>; DYU, [Abstract], automatic speech recognition (“ASR”); [0007], talker-dependent models by assuming that the talker is known during the training time, which results in a closed set of target speakers at evaluation time; Fig. 2; [0089], PDA 59 can include an internal antenna and an infrared transmitter/receiver that allow for wireless communication with other computers as well as connection ports that allow for hardware connections to other computing devices).” 
YU does not expressly disclose “each of a plurality of permutation invariant speaker labels ..” However, this feature is taught by DYU (Title: Permutation invariant training for talker-independent multi-talker speech separation).
In the same field of endeavor, DYU teaches: [0026] “these techniques can be implemented into a neural network's structure itself, solving the label permutation problem;” [0007-0008] “solutions to the label ambiguity or label permutation problem .. compensate for permutations in the training label,” [0020] “to conduct permutation invariant training (“PIT”) of deep learning models for talker-independent multi-talker scenarios” and [0025] “employ permutation invariant training (“PIT”, also permutation invariant trained, in some syntactic contexts) of deep learning models for speech separation that functions for independent talkers in a multi-talker signal.”
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of DYU in the system taught by YU to enable permutation invariant training (“PIT”) of deep learning models for speaker diarization.
YU in view of DYU does not expressly disclose “assigning an unknown label to the audio data frame in response to the corresponding probabilities, for the audio data frame, all failing to satisfy the threshold ..” However, this feature is taught by GERL (Title: Speaker recognition).
In the same field of endeavor, GERL teaches: [0074] “Then, the likelihoods <read on probability> are fed to this neural network to perform the speaker recognition based on these input variables” and [0027] “The detecting step may further comprise comparing the likelihood functions for the received speech input with a predetermined threshold .. if a likelihood function .. is below the predetermined threshold, it may be determined that the speech input does not match the corresponding speaker model. If no match with any of the speaker models is determined, it is determined that the speech input corresponds to an unknown speaker.”  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of GERL in the system taught by YU and DYU for assigning speaker label (known or unknown speaker label) such as for the selection of the following speech recognizer – a speaker dependent (if known speaker) or speaker independent (if unknown speaker) speech recognizer.
As per Claim 4 (dependent on claim 1), YU in view of DYU and GERL further discloses “wherein the trained RNN model is trained to enable detection of different human speakers and to enable detection of a lack of any human speakers (YU, [0057], hypothesis 352 and verification 356 can collectively indicate that there is a high likelihood that the input signal includes speech from two different speakers; [0035], sequence recognizer 110 can be used for predicting .. clean speech frames, and/or noise frames <read on lack of human speaker>; [0056], prediction component 152 generates a second prediction from the noisy speech indicative of the noise without the speech).”  
As per Claim 5 (dependent on claim 4), YU in view of DYU and GERL further discloses “determining that a given speaker label of the plurality of speaker labels corresponds to the lack of any human speakers, wherein determining that the given speaker label corresponds to the lack of any human speakers comprises: performing further processing of one or more of the audio data frames having the assigned given speaker label to determine that the one or more of the audio data frames each include silence or background noise (YU, [0003], indexes various characteristics of the signal, such as a speaker identity <where index reads on label>; [0035], sequence recognizer 110 can be used for predicting .. clean speech frames, and/or noise frames <read on lack of human speaker>; [0056], prediction component 152 generates a second prediction from the noisy speech indicative of the noise without the speech).” 
As per Claim 6 (dependent on claim 5), YU in view of DYU and GERL further discloses “wherein transmitting the indication of the speaker labels and their assignments to the at least one additional component for further processing of the audio data based on the speaker labels comprises: identifying, in the indication of the speaker labels and their assignments, portions of the audio data that include silence or background noise (YU, [0003], indexes various characteristics of the signal, such as a speaker identity <where index reads on label>; [0056], prediction component 152 generates a second prediction from the noisy speech indicative of the noise without the speech).” 
As per Claim 7 (dependent on claim 1), YU in view of DYU and GERL further discloses “wherein the frame features for each of the audio data frames comprise Mel-frequency cepstral coefficients of the audio data frame (YU, [0019], the feature vectors 108 can be Mel Cepstrum features (e.g., Mel-frequency cepstral coefficients (MFCCs)), linear predictive Cepstral coefficients (LPCCs), among a wide variety of other acoustic or non-acoustic features).”  
As per Claim 8 (dependent on claim 1), YU in view of DYU and GERL further discloses “receiving, via one or more network interfaces, the audio data as part of a speech processing request transmitted utilizing an application programming interface; wherein generating the sequence of audio data frames, applying the frame features of the audio data frames, processing the frame features of the audio data frames, and assigning the speaker labels and the unknown labels to the audio data frames are performed in response to receiving the speech processing request; and wherein transmitting the indication of the speaker labels, the unknown labels and their assignments is via one or more of the network interfaces, and is in response to the speech processing request (see Claim 1 rejections. YU, [0098], A user may enter commands <read on request> and information into the computer 810 through input devices such as .. a microphone 863 .. user input interface; [0100], When used in a LAN networking environment, the computer 810 is connected to the LAN 871 through a network interface).” 
 As per Claim 9 (dependent on claim 1), YU in view of DYU and GERL further discloses “wherein the audio data is streaming audio data that is based on output from one or more microphones of a client device, wherein the client device includes an automated assistant interface for interfacing with an automated assistant, and wherein the streaming audio data is received in response to invocation of the automated assistant via the client device (YU, [0003], the microphone that captured the audio input stream; [0098], user input interface; [0100], a network interface; [0067], They are functional parts of the systems or devices to which they belong and are activated <read on invocation> by, and facilitate the functionality of the other components or items in those systems); and 
wherein transmitting the indication of the speaker labels, the unknown labels, and their assignments to at least one additional component for further processing of the audio data based on the speaker labels and the unknown labels comprises transmitting the indication of the speaker labels, the unknown labels, and their assignments to an automated assistant component of the automated assistant (see Claim 1 rejections. YU, [0003], indexes various characteristics of the signal, such as a speaker identity <where index reads on label which can be fed/sent/transmitted to any other system components for any further processing per system design choice>; [0057], hypothesis 352 and verification 356 can collectively indicate that there is a high likelihood that the input signal includes speech from two different speakers. Using this information, classification component 154 can break the input into two different speech streams for processing <read on speaker-dependent speech processing>).”  
As per Claim 10 (dependent on claim 9), YU in view of DYU and GERL further discloses “wherein the automated assistant component of the automated assistant is an automatic speech recognition (ASR) component that processes the audio data to generate text corresponding to the audio data (YU, [0018], speech recognition; [0003], a speech recognizer receives an audio input signal and, in general, recognizes speech in the audio signal, and may transcribe the speech into text).” 
As per Claim 11 (dependent on claim 10), YU in view of DYU and GERL further discloses “wherein the ASR component utilizes the speaker labels to identify a transition between speakers in the audio data and, based on the transition, alters processing of the audio data that follows the transition (YU, [0003], indexes various characteristics of the signal, such as a speaker identity <where index reads on label>; [0057], hypothesis 352 and verification 356 can collectively indicate that there is a high likelihood that the input signal includes speech from two different speakers. Using this information, classification component 154 can break the input into two different speech streams for processing <read on speaker-dependent speech recognition>; [0003], speech recognition).”  
As per Claim 12 (dependent on claim 9), YU in view of DYU and GERL further discloses “wherein the at least one additional component of the automated assistant includes a natural language understanding component (YU, [0003], A speech processing system can also include speech understanding (or natural language understanding) systems, that receive an audio signal, identify the speech in the signal, and identify an interpretation of the content of that speech).” 
As per Claim 13 (dependent on claim 9), YU in view of DYU and GERL further discloses “wherein the automated assistant generates a response based on the further processing of the audio data based on the speaker labels, and causes the response to be rendered at the client device (YU, [0003], indexes various characteristics of the signal, such as a speaker identity <where index reads on label>; [0057], hypothesis 352 and verification 356 can collectively indicate that there is a high likelihood that the input signal includes speech from two different speakers. Using this information, classification component 154 can break the input into two different speech streams for processing <read on speaker-dependent speech processing for a response that can be rendered anywhere at a client device or elsewhere as system design choice>).”
Claim 20 (similar in scope to claim 1) is rejected under the same rationale as applied above for claim 1. YU also teaches: [0098] “A user may enter commands and information into the computer 810 through input devices such as .. a microphone 863.”
As per Claim 21 (dependent on claim 20), YU in view of DYU and GERL further discloses “wherein using, by the automated assistant, the assigned speaker labels and the assigned unknown labels in processing of the stream of audio data comprises using the assigned speaker labels in performing automatic speech recognition (DYU, [Abstract], automatic speech recognition (“ASR”); [0007], talker-dependent models by assuming that the talker is known during the training time, which results in a closed set of target speakers at evaluation time).”     
As per Claim 22 (dependent on claim 20), YU in view of DYU and GERL further discloses “wherein using, by the automated assistant, the assigned speaker labels and the assigned unknown labels in processing of the stream of audio data comprises using the assigned speaker labels and the assigned unknown labels in performing natural language understanding (DYU, automatic speech recognition (“ASR”) <also read on the well-known natural language understanding typically following ASR>; [0007], talker-dependent models by assuming that the talker is known during the training time, which results in a closed set of target speakers at evaluation time; [0011], talker-independent multi-talker scenarios).”     
Claim 23 (similar in scope to claim 1) is rejected under the same rationale as applied above for claim 1. YU also teaches: [0098] “A user may enter commands and information into the computer 810 through input devices such as .. a microphone 863” and [0067] “processors .. memory.” 
4.	Claims 2-3 are rejected under 35 U.S.C. 103 as being unpatentable over YU in view of DYU and GERL, and further in view of Catanzaro, et al. (US 20170148433; hereinafter CATANZARO).
As per Claim 2 (dependent on claim 1), YU in view of DYU and GERL further discloses “wherein the trained RNN model comprises [ a long short-term memory (LSTM) layer ].”
YU in view of DYU and GERL does not expressly disclose “a long short-term memory (LSTM) layer ..” However, this feature is taught by CATANZARO (Title: Deployed end-to-end speech recognition).
In the same field of endeavor, CATANZARO teaches: [0007] “a deep learning method involving long short term memory (LSTM) and recurrent neural network (RNN),”
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of CATANZARO in the system taught by YU, DYU and GERL to enable using an LSTM RNN for speaker identification.
As per Claim 3 (dependent on claim 2), YU in view of DYU, GERL and CATANZARO further discloses “wherein the trained RNN model further comprises an affine layer as a final layer, the affine layer having an output dimension that conforms to the plurality of speaker labels (CATANZARO, [0071], a typical feed-forward layer containing an affine transformation followed by a non-linearity; [0007], a deep learning method involving long short term memory (LSTM) and recurrent neural network (RNN) <where neural network reads on the affine layer component, and as a speaker identification classifier reads on ‘an output dimension that conforms to the plurality of speaker labels’).”   
Conclusion
5.	THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).   
	A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 		
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FENG-TZER TZENG whose telephone number is (571)272-4609. The examiner can normally be reached on M-F (8:30-5:00). The fax phone number where this application or proceeding is assigned is 571-273-4609.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir (SPE) can be reached on 571-272-7799. 
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/FENG-TZER TZENG/		6/7/2022Primary Examiner, Art Unit 2659