DETAILED ACTION
1.	This communication is in response to the Examiner’s Amendments confirmed on 8/25/2022. Claims 1, 4-13, 20-23 are pending and have been examined. Claims 2-3, 14-19 are cancelled.
2.	 All previous objections and rejections directed to the applicant’s disclosure and claims not discussed in this Office action have been withdrawn by the examiner.
Examiner’s Amendments
3.	Authorization for this examiner’s amendment was confirmed by Scott W. Higdon on 8/25/2022. This listing of claims will replace all prior versions, and listings, of claims in the application:
(Currently Amended)  A method of speaker diarization, the method implemented by one or more processors and comprising:
generating a sequence of audio data frames for corresponding audio data; 
for each of the audio data frames, and in the sequence:
applying frame features for the audio data frame as input to a trained recurrent neural network (RNN) model, and
processing the frame features for the audio data frame using the trained RNN model to generate direct output, of the trained RNN model, that includes a corresponding probability for each of a plurality of permutation invariant speaker labels, wherein the trained RNN model comprises: 
a long short-term memory (LSTM) layer, and
an affine layer as a final layer, the affine layer having an output dimension that conforms to the plurality of speaker labels; 
for each of a first plurality of the audio data frames, assigning a corresponding one of the plurality of speaker labels to the audio data frame in response to the corresponding probability, for the audio data frame, satisfying a threshold; 
for each of a second plurality of the audio data frames, assigning an unknown label to the audio data frame in response to the corresponding probabilities, generated as direct output of the trained RNN model for the audio data frame, all failing to satisfy the threshold; and
transmitting an indication of the speaker labels, the unknown labels, and their assignments to at least one additional component for further processing of the audio data based on the speaker labels and the unknown labels.
(Canceled)  
(Canceled)  
(Previously Presented)  The method of claim 1, wherein the trained RNN model is trained to enable detection of different human speakers and to enable detection of a lack of any human speakers.
(Original)  The method of claim 4, further comprising: 
determining that a given speaker label of the plurality of speaker labels corresponds to the lack of any human speakers, wherein determining that the given speaker label corresponds to the lack of any human speakers comprises:
performing further processing of one or more of the audio data frames having the assigned given speaker label to determine that the one or more of the audio data frames each include silence or background noise.
(Original)  The method of claim 5, wherein transmitting the indication of the speaker labels and their assignments to the at least one additional component for further processing of the audio data based on the speaker labels comprises: 
identifying, in the indication of the speaker labels and their assignments, portions of the audio data that include silence or background noise. 
(Previously Presented)  The method of claim 1, wherein the frame features for each of the audio data frames comprise Mel-frequency cepstral coefficients of the audio data frame.
(Previously Presented)  The method of claim 1, further comprising:
receiving, via one or more network interfaces, the audio data as part of a speech processing request transmitted utilizing an application programming interface;
wherein generating the sequence of audio data frames, applying the frame features of the audio data frames, processing the frame features of the audio data frames, and assigning the speaker labels and the unknown labels to the audio data frames are performed in response to receiving the speech processing request; and 
wherein transmitting the indication of the speaker labels, the unknown labels, and their assignments is via one or more of the network interfaces, and is in response to the speech processing request.
(Previously Presented)  The method of claim 1, 
wherein the audio data is streaming audio data that is based on output from one or more microphones of a client device, wherein the client device includes an automated assistant interface for interfacing with an automated assistant, and wherein the streaming audio data is received in response to invocation of the automated assistant via the client device; and
wherein transmitting the indication of the speaker labels, the unknown labels, and their assignments to at least one additional component for further processing of the audio data based on the speaker labels and the unknown labels comprises transmitting the indication of the speaker labels, the unknown labels, and their assignments to an automated assistant component of the automated assistant.
(Original)  The method of claim 9, wherein the automated assistant component of the automated assistant is an automatic speech recognition (ASR) component that processes the audio data to generate text corresponding to the audio data.
(Original)  The method of claim 10, wherein the ASR component utilizes the speaker labels to identify a transition between speakers in the audio data and, based on the transition, alters processing of the audio data that follows the transition.
(Previously Presented)  The method of claim 9, wherein the at least one additional component of the automated assistant includes a natural language understanding component.
(Previously Presented)  The method of claim 9, wherein the automated assistant generates a response based on the further processing of the audio data based on the speaker labels, and causes the response to be rendered at the client device.
(Canceled)  
(Canceled)
(Canceled)
(Canceled)
(Canceled)
(Canceled)
(Currently Amended) A method implemented by one or more processors, the method comprising:
receiving a stream of audio data that is based on output from one or more microphones of a client device, the client device including an automated assistant interface for an automated assistant, and the stream of audio data received in response to invocation of the automated assistant; 
generating a sequence of audio data frames as the stream of audio data is received; 
for each of the audio data frames, and in the sequence:
applying frame features for the audio data frame as input to a trained recurrent neural network (RNN) model;
processing the frame features for the audio data frame using the trained RNN model to generate direct output, of the trained RNN model, that includes a corresponding probability for each of a plurality of permutation invariant speaker labels, wherein the trained RNN model comprises: 
a long short-term memory (LSTM) layer, and
an affine layer as a final layer, the affine layer having an output dimension that conforms to the plurality of speaker labels; and
for each of a first plurality of the audio data frames, assigning a corresponding one of the plurality of speaker labels to the audio data frame in response to the corresponding probability, for the audio data frame, satisfying a threshold; 
for each of a second plurality of the audio data frames, assigning an unknown label to the audio data frame in response to the corresponding probabilities, generated as direct output of the trained RNN model for the audio data frame, all failing to satisfy the threshold; and
using, by the automated assistant, the assigned speaker labels and the assigned unknown labels in processing of the stream of audio data.
(Previously Presented) The method of claim 20, wherein using, by the automated assistant, the assigned speaker labels and the assigned unknown labels in processing of the stream of audio data comprises using the assigned speaker labels in performing automatic speech recognition.
(Previously Presented) The method of claim 20, wherein using, by the automated assistant, the assigned speaker labels and the assigned unknown labels in processing of the stream of audio data comprises using the assigned speaker labels and the assigned unknown labels in performing natural language understanding.
(Currently Amended) A client device, comprising:
one or more microphones;
an automated assistant interface for an automated assistant;
memory storing instructions;
one or more processors operable to execute the instructions to:
receive a stream of audio data that is based on output from the one or more microphones; 
generate a sequence of audio data frames as the stream of audio data is received; 
for each of the audio data frames, and in the sequence:
apply frame features for the audio data frame as input to a trained recurrent neural network (RNN) model;
process the frame features for the audio data frame using the trained RNN model to generate direct output, of the trained RNN model, that includes a corresponding probability for each of a plurality of permutation invariant speaker labels, wherein the trained RNN model comprises: 
a long short-term memory (LSTM) layer, and
an affine layer as a final layer, the affine layer having an output dimension that conforms to the plurality of speaker labels; and
for each of a first plurality of the audio data frames, assign a corresponding one of the plurality of speaker labels to the audio data frame in response to the corresponding
for each of a second plurality of the audio data frames, assign an unknown label to the audio data frame in response to the corresponding probabilities, generated as direct output of the trained RNN model for the audio data frame, all failing to satisfy the threshold; and
use, by the automated assistant, the assigned speaker labels and the assigned unknown labels in processing of the stream of audio data.


Reasons for Allowance
4.	Claims 1, 4-13, 20-23 are allowable. The following is the examiner’s statement of reason for allowance: The closest prior art of record cited are:  Yu, et al. (US 20160140956; Title: Prediction-based sequence recognition), Dong Yu (US 20170337924; Title: Permutation invariant training for talker-independent multi-talker speech separation), Gerl, et al. (EP 2048656B1; Title: Speaker recognition) and Catanzaro, et al. (US 20170148433; Title: Deployed end-to-end speech recognition). Details of the references’ teaching can be found in previous Office actions.
None of the above mentioned references either alone or in combination thereof teaches or makes obvious the specific combinations of limitations stated in the amended claims. 
Any comments considered necessary by the applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee. Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.
Conclusion
5.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to FENG-TZER TZENG whose telephone number is (571)272-4609. The examiner can normally be reached on M-F (8:00-5:30). The fax phone number where this application or proceeding is assigned is 571-273-4609.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir (SPE) can be reached on 571-272-7799. 
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/FENG-TZER TZENG/		8/26/2022
Primary Examiner, Art Unit 2659