DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 4 and 16-17 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Droppo et al. (US 2018/0254040).

Claims 1 and 16,
Droppo teaches an estimation device comprising: a memory; and processing circuitry coupled to the memory and configured to: receive an input of an input audio signal that is an audio signal in which sounds from a plurality of sound sources are mixed, and an input of supplemental information, and output an estimation result of mask information that identifies a mask for extracting a sound of any one ([Figs. 1-3] [0022] [0026] [0035-0041] [0047-0048] employing permutation invariant training (PIT) for speech separation that functions for independent talkers in a multi-talker signal; a number of frames of feature vectors 202 of the mixed signal are used as the input to as deep neural networks ("DNNs") to generate one frame of masks for each talker; a mask 1 206 and a mask 2 208 can be generated for each talker; the correct reference or target magnitude must be provided to the corresponding output layers for supervision; a model can have multiple output layers, at least one for each mixing source, and these output layers correspond to the same input mixture; a first estimate a set of masks [circumflex over (M)].sub.S(t, f) using a deep learning model and reconstruct the magnitude spectra; the use of masks allow the estimation to be constrained as masks may be invariant to input variabilities caused by energy difference; the mixed speech set can be sent together with the masks of source streams as a combined input 212; in which the first stage's separation results are used to inform the second-stage model to make better decision; the input to the second stage includes the mixed speech and the separated speech streams from the first stage; the final mask is the average of that from the first stage using only mixed speech as the input and that from the second stage that uses augmented features as the input; [circumflex over (M)].sub.S.sup.(1) and [circumflex over (M)].sub.S.sup.(2) refer to estimated masks for use in the first and second stage of the above described separation and tracing steps; specifically, the example equation for [circumflex over (M)].sub.S.sup.(1) refers to the generation of a first mask applying a first LSTM to the magnitude of the mixed spectrum in question (|Y|); the example equation for [circumflex over (M)].sub.S.sup.(2) refers to the generation of a second mask applying a second LSTM to the magnitude of the mixed spectrum in question (|Y|) as well as the element-wise product of the first mask results and the magnitude of the mixed spectrum in question (|Y|)).

Claims 4 and 17,
Droppo teaches a learning device comprising: a memory; and processing circuitry coupled to the memory and configured to: receive an input of a training input audio signal that is an audio signal in which sounds from a plurality of sound sources are mixed, and an input of supplemental information, and output an estimation result of mask information that identifies a mask for extracting a sound of any one of the sound sources included in an entire or a part of a signal identified by the supplemental information, the signal being included in the training input audio signal, cause a neural network to iterate a process of outputting the estimation result of the mask information, update parameters of the neural network, based on a result of a comparison between information corresponding to the estimation result of the mask information obtained by the neural network, and information corresponding to correct answer mask information given in advance for the training input audio signal, and cause the neural network to output an estimation result of the mask information for a different sound source, by inputting a different piece of the supplemental information to the neural network at each iteration ([Figs. 1-3] [0022] [0026] [0035-0041] [0047-0048] employing permutation invariant training (PIT) for speech separation that functions for independent talkers in a multi-talker signal; a number of frames of feature vectors 202 of the mixed signal are used as the input to as deep neural networks ("DNNs") to generate one frame of masks for each talker; a mask 1 206 and a mask 2 208 can be generated for each talker; the correct reference or target magnitude must be provided to the corresponding output layers for supervision; a model can have multiple output layers, at least one for each mixing source, and these output layers correspond to the same input mixture; a first estimate a set of masks [circumflex over (M)].sub.S(t, f) using a deep learning model and reconstruct the magnitude spectra; the use of masks allow the estimation to be constrained as masks may be invariant to input variabilities caused by energy difference; the mixed speech set can be sent together with the masks of source streams as a combined input 212; a first computation is made comparing the pairwise MSE scores 216 between each reference, e.g. cleaned speech 1 218 and cleaned speech 2 220 and the estimated source, e.g. clean speech 1 222 and clean speech 2 224; a determination is made for the possible assignments between the references and the estimated sources, and compute the total MSE for each assignment, e.g. error assignment 1 226 and error assignment 2 228; the assignment with the least MSE can be compared and chosen based on the assignment with a minimum error 230; in which the first stage's separation results are used to inform the second-stage model to make better decision; the input to the second stage includes the mixed speech and the separated speech streams from the first stage; the final mask is the average of that from the first stage using only mixed speech as the input and that from the second stage that uses augmented features as the input; [circumflex over (M)].sub.S.sup.(1) and [circumflex over (M)].sub.S.sup.(2) refer to estimated masks for use in the first and second stage of the above described separation and tracing steps; specifically, the example equation for [circumflex over (M)].sub.S.sup.(1) refers to the generation of a first mask applying a first LSTM to the magnitude of the mixed spectrum in question (|Y|); the example equation for [circumflex over (M)].sub.S.sup.(2) refers to the generation of a second mask applying a second LSTM to the magnitude of the mixed spectrum in question (|Y|) as well as the element-wise product of the first mask results and the magnitude of the mixed spectrum in question (|Y|)).

Allowable Subject Matter
Claims 2-3 and 5-13 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
The prior art made of record and is considered pertinent to applicant's disclosure. See PTO-892.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHREYANS A PATEL whose telephone number is (571)270-0689. The examiner can normally be reached Monday-Friday 8am-5pm PST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

SHREYANS A. PATEL
Examiner
Art Unit 2657



/SHREYANS A PATEL/               Examiner, Art Unit 2656