DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claims 1, 3, 5, 10, 13, 15 and 20 are amended. Claims 4 and 14 are cancelled. Claims 1-3, 5-13 and 15-24 are presented for examination. 
Allowable Subject Matter
Claims 1-3, 5-13 and 15-24  allowed.
Following are the reasons for allowance: 
Cited prior art of Mesgarani ( US Pub: 20190066713)  and further in view of LeRoux ( US Pub: 20190318725) or any prior art searched or made in record or their combination thereof fails to teaches the concept of claims 1, 10  and 20 as a whole including obtaining a recording comprising speech from a plurality of speakers; processing the recording using a speaker neural network having speaker parameter values, wherein the speaker neural network is configured to process the recording in accordance with the speaker parameter values to generate, for each of a plurality of time steps in a time period that the recording spans, a respective plurality of per-time-step speaker representations for the time step, wherein each per-time-step speaker representation of the plurality of per-time-step speaker representations represents features of a respective identified speaker in the recording for the time step; clustering the respective pluralities of per-time-step speaker representations for the plurality of time steps to generate a plurality of clusters of per-time-step speaker representations, wherein each cluster of the plurality of clusters corresponds to a different speaker and includes multiple different per-time-step speaker representations identified for the corresponding speaker at multiple different time steps; generating a plurality of per-recording speaker representations, wherein each per- recording speaker representation is a centroid of a different one of the plurality of clusters that each corresponds to a different speaker and includes multiple different per-time-step speaker representations that have been identified for the corresponding speaker at multiple different time steps, and wherein each per-recording speaker representation represents features of a respective identified speaker in the recording; and processing the per-recording speaker representations and the recording using a separation neural network having separation parameter values and configured to process the recording and the plurality of per-recording speaker representations that each is a centroid of a different one of the plurality of clusters that each corresponds to a different speaker and includes multiple different per-time-step speaker representations that have been identified for the corresponding speaker at multiple different time steps by the speaker neural network in accordance with the separation parameter values to generate, for each per-recording speaker representation, a respective predicted isolated audio signal that corresponds to speech of one of the speakers in the recording; wherein the separation neural network comprises a stack of neural network blocks, a first neural network block in the stack configured to receive as input the recording and the plurality of per-recording speaker representations that each is a centroid of a different one of the plurality of clusters that each corresponds to a different speaker and includes multiple different per-time-step speaker representations that have been identified for the corresponding speaker at multiple different time steps generated by the speaker neural network, wherein the plurality of per-recording speaker representations satisfy:   
            
                
                    
                        {
                        
                            
                                c
                            
                            
                                i
                                 
                            
                        
                        }
                    
                    
                        i
                        =
                        1
                    
                    
                        N
                    
                
                 
                =
                k
                m
                e
                a
                n
                s
                
                    
                        
                            
                                
                                    
                                        h
                                    
                                    
                                        t
                                    
                                    
                                        i
                                    
                                
                            
                        
                        i
                        ,
                         
                        t
                         
                        :
                        N
                    
                
            
        wherein N represents a number of clusters of per-time-step speaker representations at different time steps t,             
                
                    
                        {
                        
                            
                                c
                            
                            
                                i
                                 
                            
                        
                        }
                    
                    
                        i
                        =
                        1
                    
                    
                        N
                    
                
                 
            
         represents a set of the respective centroids of the N clusters of per-time-step speaker representations at different time steps t, and             
                
                    
                        
                            
                                h
                            
                            
                                t
                            
                            
                                i
                            
                        
                    
                
                i
                ,
                 
                t
            
         represents a set of the N per-time-step speaker representations at different time steps t.

Newly cited ( cited by examiner)  prior art of Song ( US Pub: 20200043508) generally teaches the concept of  estimating the number of speakers and then use k-means clustering with the estimation. An x-means were forced to split at least 2 clusters by initializing it with 2 centroids. Note that there are usually multiple moving parts on complete diarization systems in the literature. In particular, more sophisticated clustering algorithms, overlapping test segments and calibration can be incorporated to improve the overall diarization performance. However Song does not explicitly teaches the concept of "clustering the respective pluralities of per-time-step speaker representations for the plurality of time steps to generate a plurality of clusters of per-time- step speaker representations; wherein each cluster of the plurality of clusters corresponds to a different speaker and includes multiple different per-time-step speaker representations identified for the corresponding speaker at multiple different time steps; generating a plurality of per-recording speaker representations, wherein each per-recording speaker representation is a centroid of a different one of the plurality of clusters that each corresponds to a different speaker and includes multiple different per-time-step speaker representations that have been identified for the corresponding speaker at multiple different time steps as suggested by the specific formula cited in claim 1 in combination with the claimed concept. 

Newly cited ( cited by examiner)   prior art of  Ganguly ( US Pub: 20210019556) generally teaches When centroids are selected at block 710 for each cluster, embodiments use Hamming distance that is identified at block 712 (can be facilitated by a Hamming component 112 (FIG. 1)) and assigns data points to a nearest cluster. Through each iteration at block 714, embodiments recompute approximate signatures of new centroids using bit-encoded vectors. This process is repeated from block 710 until convergence of data points are encompassed into their corresponding clustered centroid. In general, the number of iterations range between 50-100 iterations. Upon the execution of these iterations, if a limit is not reached, then the process is repeated from block 708 which is then followed by blocks 710, 712, 714 and 716. Blocks 714 & 716 can be facilitated by the combination of the estimating component 114 and the clustering component 110 working in combination (FIG. 1). If there are no more iterations to be executed at block 718, then the algorithm is completed at block 720 and the data points are clustered in their corresponding families along with their centroids within each cluster. In these embodiments, the transformation of Euclidean vectors to Hamming vectors is distance preserving such that the Euclidean vectors can be of close proximity if the Hamming vectors are also close or vice versa. In theory, centroid computation can be performed in Hamming space however, this idea does not validate that the centroid in the Euclidean space can represent the true centroid. Hence, these embodiments, assist to get accurate approximation such that the centroid in the Hamming space is proximate to the real centroids in the Euclidean space. Also, these embodiments ensure privacy preserving such that, from block 712, the vectors can be part of Hamming space and is not required to go back in the process to read the true Euclidean vectors ( fig 7). However Ganguly does not explicitly teaches the concept of "clustering the respective pluralities of per-time-step speaker representations for the plurality of time steps to generate a plurality of clusters of per-time- step speaker representations; wherein each cluster of the plurality of clusters corresponds to a different speaker and includes multiple different per-time-step speaker representations identified for the corresponding speaker at multiple different time steps; generating a plurality of per-recording speaker representations, wherein each per-recording speaker representation is a centroid of a different one of the plurality of clusters that each corresponds to a different speaker and includes multiple different per-time-step speaker representations that have been identified for the corresponding speaker at multiple different time steps as suggested by the specific formula cited in claim 1 in combination with the claimed concept.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RICHA MISHRA whose telephone number is (571)272-5357. The examiner can normally be reached M-T 7AM - 5:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Benny Tieu can be reached on (571)272-7490. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RICHA MISHRA/Primary Examiner, Art Unit 2674