DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 10/05/2021 has been entered.

Response to Amendment
Claims 1, 2, 5, 7, 10, 12, 15-17, and 20-22 are amended. Claims 4 and 14 are cancelled. Claims 23 and 24 are added. Claims 1-3, 5-13, 15-24 are presented for examination.
Response to Arguments
Applicant arguments filed on 10/5/2021 have been reviewed. Following are the response to applicant’s arguments: 
Applicant argues “Mesgarani’ s reconstruction mask does not read on and cannot be considered the claimed "separation neural network" because Mesgarani’ s mask is not a neural network” However Mesgarani teaches the concept that deep LTSM and neural based mask reconstruction ( Para 0109-0110, 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-3, 5-6, 8-13, 15-16, 18-21 and 23 are rejected under 35 U.S.C. 103 as being unpatentable over  Mesgarani ( US Pub: 20190066713)  and further in view of LeRoux ( US Pub: 20190318725) 
Regarding claim 1, Mesgarani teaches a method performed by one or more computers ( computers, Fig 10, Para 0244), the method comprising: obtaining a recording comprising speech from a plurality of speakers(input mixture signal, Para 0135; combined signal from the plurality of speakers, Para 0107) ; processing the recording using a speaker neural network having speaker parameter values (mixture signal is an input to the LTSM as neural signal, Para 0135, 0153-0156), wherein the speaker neural network is  configured to process the recording in 
accordance with the speaker parameter values ( parameter at time steps, Para 0223-0224)  to generate, for each of a plurality of time steps in a time period that the recording spans ( process each time step, Para 0181, 0191) , a respective plurality of per-time-step speaker representations for the time step, wherein each per-time-step speaker representation of the plurality of per-time-step speaker representations represents features of a respective identified speaker in the recording for the time step (  This representation is then used to estimate a multiplicative function (mask) for each source and for each encoder output at each time step. The source waveforms are then reconstructed by transforming the masked encoder features using a linear decoder module, Para 0181, Fig 17) ; ; generating a plurality of per-recording speaker representations, wherein each per- recording speaker representation is a centroid of a different one of the plurality of clusters and represents features of a respective identified speaker in the recording ( attractor point of each source is created using the centroid, Para 0081, 0089, 0223-0224) ; and processing the per-recording speaker representations and the recording using a separation neural network(clean speech is produced, Fig 10, 26, Fig 13-17 )   having separation parameter values and configured to process the recording and the plurality of per-recording speaker representations( clean target spectrogram, Para 0087; Para 0225,)  that each is a centroid of a different one of the plurality of clusters of the respective pluralities of per-time-step speaker representations generated by the speaker neural network in accordance with the separation parameter values to generate, for each per-recording speaker representation( fig 10 and Fig 26, Next, a reconstruction mask is estimated (at box 1040) for each source by finding the similarity of each T-F bin in the embedding space to each of the attractor vectors A, where the similarity metric is defined in Equation 8. This particular metric uses the inner product followed by a sigmoid function which monotonically scales the masks between [0, 1]. Intuitively, if an embedding of a T-F bin is closer to one attractor, then it means that it belongs to that source, and the resulting mask for that source will produce larger values for that T-F bin. Since the source separation masks for each TF bin should add up to one, particularly in difficult conditions, the sigmoid function (of Equation 8) can be replaced with softmax function:
M.sub.f,t,c=Softmax(Σ.sub.kA.sub.c,k×V.sub.ft,k), Para 0138; also from Para 0165)  and , a respective predicted isolated audio signal that corresponds to speech of one of the speakers in the recording ( clean speech using the centroid, Para 0223-0027) ; wherein the neural network comprises a stack of neural network blocks ( stack neural network, Para 0187, 0241) , a first neural network block in the stack configured to receive as input the recording and the plurality of per-recording speaker representations that each is a centroid of a different one of the plurality of clusters of the respective pluralities of per-time-step speaker representations generated by the speaker neural network  ( input recording ( 1002) and the mask 1040, Fig 10, Fig 14, Fig 12- attractor points are used)  
Mesgarani does not explicitly teaches separation neural network; wherein the separation neural network comprises a stack of neural network blocks ,clustering the respective pluralities of per-time-step speaker representations for the plurality of time steps to generate a plurality of clusters of per-time-step speaker representations;
However LeRoux teaches separation neural network ( separation model, Fig 11, Fig 9, Para 0072-0082),  herein the separation neural network comprises a stack of neural network blocks ( stack neural network, Para 0096) clustering the respective pluralities of per-time-step speaker representations for the plurality of time steps to generate a plurality of clusters of per-time-step speaker representations (The speaker assignment of each T-F unit can thus be inferred from the embeddings by simple clustering algorithms, to produce masks that isolate each single speaker, Para 0009, 0077, 0080, Fig 11A-B; neural networks are stacked, Para 0096) 

Regarding claim 2, Mesgarani as above in claim 1, teaches  further comprising jointly training the speaker neural network and the separation neural network, comprising: computing an error between (i) predicted isolated audio signals generated by the separation neural network using predicted per-recording speaker representations of speakers in an input recording and generated from the speaker neural network ( error is calculate based on masked signal and clean reference, Para 0139) , and (ii) ground-truth audio signals each corresponding to isolated speech of one of the speakers in the input recording and in accordance with an objective function ( ground truth binary mask, Para 0144) ; and updating the speaker parameter values of the speaker neural network and the separation parameter values  of the separation neural network in accordance with the computed error ( lowest error to update th network, Para 0134, 0144)  

Regarding claim 3, Mesgarani as above in claim 2, teaches  jointly training ( joint learning, Para 0220-0225)  the speaker neural network and the separation neural network further comprises: for each time-step of the recording, generating a plurality of per-time-step speaker representations, each per-time-step speaker representation representing features of a respective identified speaker in the recording at the time-( updating the attractor point etc., Para 0221-0222, 0181, 0217 ) 

Regarding claim 5, Mesgarani modified by  LeRoux as above in claim 1, teaches  wherein the respective pluralities of per-time-step speaker representations for the plurality of time steps comprises performing k-means clustering on the respective pluralities of per-time-step speaker representations  (each time frame ( time frequency analogous to time step in separate domain) using kmeans clustering, Para 0077, 0080, LeRoux) 
Regarding claim 6, Mesgarani as above in claim 1, teaches  wherein at least the speaker neural network has been trained on training data defining first recordings and a second recording, wherein the second recording comprises segments of audio from the first recordings ( mixture models, Fig 10; training the deep neural with different speakers, Para 0089-0092, 0130-0135) 


Regarding claim 8, Mesgarani, as above in claim 1, teaches  , wherein one or both of the speaker neural network and the separation neural network is a convolutional neural network ( convolutional neural network Para 0158, 0163) 
Regarding claim 9, Mesgarani as above in 8, teaches  wherein one or both of the speaker neural network and the separation neural network is a dilated convolutional neural network ( dilated convolutions, Para 0009, 0179-180) 

Regarding claim 10, arguments analogous to claim 1, are applicable. In addition Mesgarani teaches obtaining a data set ( datasets, Para 0196, 0148, 0170) and the identifying a source ( speaker could be a source or a person, Para 0180-0182) 
Regarding claim 11, Mesgarani as above in claim 10, teaches, wherein the data set comprises at least one of audio data, image data or video data ( speech/sound datasets, Para 0218, 0277, 170) 
 Regarding claim 20, arguments analogous to claim 10, are applicable. In addition Mesgarani teaches a system comprising one or more computers and one or more storage devices on which are stored instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform the method of claim 10 ( Para 0034) 
Regarding claim 12, arguments analogous to claim 2, are applicable. 
Regarding claim 13, arguments analogous to claim 3, are applicable. 
Regarding claim 14, arguments analogous to claim 4, are applicable. 
Regarding claim 15, arguments analogous to claim 5, are applicable. 
Regarding claim 16, arguments analogous to claim 6, are applicable. 
Regarding claim 18, arguments analogous to claim 8, are applicable. 
Regarding claim 19, arguments analogous to claim 9, are applicable. 


Regarding claim 21, Mesgarani as above in claim 1, teaches  wherein the stack of neural network blocks comprise a stack of residual convolutional blocks ( identity residual connection, Para 0189) , each block of the stack of residual convolutional blocks includes one or more convolutional layers (conv block with conv layers, Para 0189-0190) 

Regarding claim 23, arguments analogous to claim 21, are applicable 


s 22 and 24  are rejected under 35 U.S.C. 103 as being unpatentable over Mesgarani ( US Pub: 20190066713) and further in view of LeRoux ( US Pub: 20190318725) and further in view of Yi ( Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation) 

Regarding claim 22, Mesgarani modified by LeRoux as above in claim 21, does not explicitly teaches  wherein the one or more convolutional layers comprises a first convolutional layer, wherein each block of the stack of residual convolutional blocks is configured to process at least the plurality of per-recording speaker representations using feature-wise linear modulation, comprising: performing a linear transformation based on the plurality of per-recording speaker representations to generate a first vector and a second vector; performing an element-wise product operation between an output from the first convolutional layer and the first vector to generate an element-wise product; and adding the second vector with the element-wise product to generate an input for a non- linearity operation.
However Yi teaches wherein the one or more convolutional layers comprises a first convolutional layer, wherein each block of the stack of residual convolutional blocks is configured to process at least the speaker representations using feature-wise linear modulation, comprising: performing a linear transformation based on the speaker representations to generate a first vector and a second vector ( C vectors, Page 3. Under C. Estimating separation mask) ; performing an element-wise product operation between an output from the first convolutional layer and the first vector to generate an element-wise product ( mask ( vector) and the mixture representation) ; and adding the second vector with the element-wise product to generate an input for a non- linearity operation ( Page 3, Fig 1A-!c- ReLu is added to ensure the non –negativity) 
It would have been obvious having the teachings of Mesgarani and LeRoux to further include the concept of Yi before effective filing date to estimate the separation mask ( Page 3, under C, Yi) 

Regarding claim 24, arguments analogous to claim 23 are applicable. 

s 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Mesgarani ( US Pub: 20190066713)  and further in view of LeRoux ( US Pub: 20190318725)  and further in view of Lu ( US Pub: 20210074266 ) 

Regarding claim 7, Mesgarani as above in claim 6, does not explicitly teaches  wherein the second recording further comprises a segment of audio from a first recording that has been augmented by modifying a gain for the segment according to a randomly sampled gain modifier 
However Lu teaches wherein the second recording further comprises a segment of audio from a first recording of the first recordings that has been augmented by modifying a gain for the segment according to a randomly sampled gain modifier (the speech extraction model is trained by the following manner. The mixed audio training dataset and at least one set of predetermined gain compensation coefficients associated with at least one audiogram can be used as input to the input layer of the speech extraction model, and the compensated speech data corresponding to each mixed audio data frame in the mixed audio training dataset can be used as output to the output layer in the speech extraction model. In this way, the trained speech extraction model can have a weighting coefficient set and an offset coefficient set associated with each other, Para 0066-0068) 
It would have been obvious having the teachings of Mesgarani to further include the concept of Lu before effective filing date to have an optimized dataset for training  ( Para 0004, 0029, Lu) 

Regarding claim 17, arguments analogous to claim 7, are applicable. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RICHA MISHRA whose telephone number is (571)272-5357. The examiner can normally be reached M-T 7AM - 5:30PM.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Benny Tieu can be reached on (571)272-7490. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RICHA MISHRA/Primary Examiner, Art Unit 2674