DETAILED ACTION
 
Introduction
1.         This office action is in response to Applicant’s submission filed on 12/06/2019.   Claims 1-20 are pending in the application and has been examined.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Drawings
3.         The drawings filed on 12/06/2019 have been accepted and considered by the Examiner.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

4. 	Claim 1 is rejected under 35 U.S.C 101 because the claimed invention is directed to non-statutory subject matter. The claim(s) is/are directed to a judicial exception without significantly more. In a test for patent subject matter eligibility, claim(s) 1 is/are found to be in accordance with Step 1 (see 2019 Revised Patent Subject Matter Eligibility), as claim 1 is related to a process, machine, manufacture, or composition of matter. When assessed under Step 2A, Prong I, claim 1 is found to be directed towards an abstract idea. The rationale for this finding is explained below:
Independent claim 1: recites, see e.g., “…system for separating audio based on sound producing objects…” Under Step 2A, Prong I, this claim is directed to an abstract idea without significantly more, as the claim recites a judicial exception. Claim 1 reciting: “1. A system for separating audio based on sound producing objects, the system comprising: a processor configured to: receive video data and audio data; perform object detection using the video data to identify a number of sound producing objects in the video data; predict a separation for each sound producing object detected in the video data; generate separated audio data for each sound producing object using the separation and the audio data” can be considered an abstract idea. Specifically, if a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. In this case, the claim recites see e.g., a mental process that is used to see e.g., “…perform object detection using the video data to identify a number of sound producing objects in the video data; predict a separation for each sound producing object detected in the video data; generate separated audio data for each sound producing object using the separation and the audio data.” Thus, the claim recites a mental process. Note that, in this example, the “…perform object detection …” “…identify a number of sound producing objects …” and “…generate separated audio data for each sound producing object using the separation and the audio data …” steps are determined to recite a mental concept. For example, person A can be listening and visually observing to what person B, and person C are talking and doing. Person A can then proceed to detect/observe that person B is physically displaying a video using his phone to person C. Both, person B (e.g., says, “her name is Mary”) and person C (e.g., says, “that’s Mary”) at the same time the video is playing repeat the same name (e.g., Mary in a different phrase). Person A, immediately and verbally, interrupts Person B and C and says “e.g., “Excuse me gentlemen, I could not help to hear and see, but I also know the person in the video you both are naming as Mary, she is a relative of mine.” Therefore, under Step 2A, Prong I, claim 1 recites an abstract idea.
Step 2A, Prong II, is to determine whether any claim recites an additional element that integrates the judicial exception (abstract idea) into a practical application. In this case, Independent claim 1 does not appear to recite additional elements found to integrate the abstract idea into a practical application. The mere instructions to implement an abstract idea as a tool to “…perform object detection …” “…identify a number of sound producing objects …” and “…generate separated audio data for each sound producing object using the separation and the audio data …”  is observed in claim 1 as a mental process. “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea. This judicial exception is not integrated into a practical application. In particular, the claim only recites one additional element “…the system comprising: a processor configured to…” The processor in said limitations is recited at a high-level of generality (i.e., as a generic processor performing a generic computer function) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using a processor configure to “receive…perform…predict…generate…” amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Under Step 2A, Prong II, this claim is directed to an abstract idea. Claim 1 does not recite any additional elements that result in the claims amounting to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional elements disclosed amount to no more than generic computer components. Per MPEP 2106.05(d), neither the additional elements, nor combination of additional elements, are found to be "other than what is well-understood, routine, conventional activity in the field, or simply append well-understood, routine, conventional activities previously known to the industry, specified at a high level of generality, to the judicial exception." This finding is further supported by Applicant's specification, (See e.g., Applicant’s Specification at paras. 54, 82 “…Blocks can be configured to perform various operations, e.g., by programming a processor or providing appropriate control circuitry, and various blocks might or might not be reconfigurable depending on how the initial configuration is obtained. Implementations of the present disclosure can be realized in a variety of apparatus including electronic devices implemented using any combination of circuitry and software…” “…processor 404 can be implemented as a general purpose processor, an application specific integrated circuit (ASIC), one or more field programmable gate arrays (FPGAs ), a group of processing components, or other suitable electronic processing components…”) are seen as commonly used computer components specified at such a high level of generality that there is not significantly more than the judicial exception. Additionally, there is no improvement in the functioning of the computer or technological field, and there is no transformation of subject matter into a different state. Under Step 2B in a test for patent subject matter eligibility, Claim 1 is not patent eligible.


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.



5.	Claim(s) 1-3, 6, and 15 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Gao et al., (Gao, R., Feris, R., & Grauman, K. (2018). Learning to Separate Object Sounds by Watching Unlabeled Video. arXiv preprint arXiv:1804.01665, with Publication date: July 26, 2018), already of record and as cited in IDS filed 12/06/2019, hereinafter referred to as GAO.
With respect to Claim 1, GAO discloses:
1. A system for separating audio based on sound producing objects, the system comprising: 
a processor configured to: 
receive video data and audio data (See e.g., Unlabeled Video having audio and visual inputs is received according to Fig. 2,  See e.g., GAO  §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4 ); 
perform object detection using the video data to identify a number of sound producing objects in the video data (See e.g., “…image recognition tools to infer the objects present in each video clip, and we perform non-negative matrix factorization (NMF) on each video’s audio channel to recover its set of frequency basis vectors. …” See e.g., GAO §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4); 

    PNG
    media_image1.png
    227
    956
    media_image1.png
    Greyscale

(See e.g., Fig. 2 from Gao et al.)
predict a separation for each sound producing object detected in the video data (See e.g., “…For each video, we perform NMF on its audio magnitude spectrogram to get M basis vectors. An ImageNet-trained ResNet-152 network is used to make visual predictions to find the 
    PNG
    media_image2.png
    228
    992
    media_image2.png
    Greyscale

(See e.g., Fig. 3, from Gao et al.)
potential objects present in the video…” and how “…deep multi-instance multi-label network takes a bag of M audio basis vectors for each video as input, and gives a bag-level prediction of the objects present in the audio. The visual predictions from an ImageNet-trained CNN are used as weak “labels” to train the network with unlabeled video…” See e.g., GAO §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4GAO  §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4); 
generate separated audio data for each sound producing object using the separation and the audio data (See e.g., “…detect the objects present in the visual frames, and retrieve their learnt audio bases. The bases are collected to form a fixed basis dictionary W with which to guide 
    PNG
    media_image3.png
    280
    948
    media_image3.png
    Greyscale

(See e.g., Fig. 4, from Gao et al.)

NMF factorization of the test video’s audio channel. The basis vectors and the learned activation scores from NMF are finally used to separate the sound for each detected object, respectively…” See e.g., GAO §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4).

With respect to Claim 2, GAO discloses:
2. The system of claim 1, wherein the processor is further configured to predict the separation to minimize a co-separation loss (See e.g., “…factorization is usually obtained by solving the following minimization problem… For each unlabeled training video, we perform NMF independently on its audio magnitude spectrogram to obtain its spectral patterns W, and throw away the activation matrix H. M audio basis vectors are therefore extracted from each video …” and by “…use[ing] … multi-label hinge loss to train the network…,” See e.g., GAO §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4).

With respect to Claim 3, GAO discloses:
3. The system of claim 2, wherein the processor is configured to: 
convert the audio data into a magnitude spectrogram (See e.g., “…mixture signal can be transformed into a magnitude or power spectrogram…,” See e.g., GAO §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4); 
predict a spectrogram mask for each sound producing object as the separation (See e.g., “…reconstruct the individual (compressed) audio source signals by soft masking the mixture spectrogram… perform ISTFT…to reconstruct the audio signals for each detected object. If a detected object does not make sound, then its estimated activation scores will be low…phase can be seen as a self-supervised form of NMF, where the detected visual objects reveal which bases (previously discovered from unlabeled videos) are relevant to guide audio separation…,” See e.g., GAO §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4); 
generate a separated spectrogram for each detected object using the spectrogram mask and the magnitude spectrogram (See e.g., “…reconstruct the individual (compressed) audio source signals by soft masking the mixture spectrogram… perform ISTFT…to reconstruct the audio signals for each detected object. If a detected object does not make sound, then its estimated activation scores will be low…phase can be seen as a self-supervised form of NMF, where the detected visual objects reveal which bases (previously discovered from unlabeled videos) are relevant to guide audio separation…,” See e.g., GAO §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4); and 
convert each of the separated spectrograms to audio data (See e.g., “…Unsupervised training pipeline. For each video, we perform NMF on its audio magnitude spectrogram to get M basis vectors…,” See e.g., GAO §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4).

With respect to Claim 6, GAO discloses:
6. The system of claim 1, wherein the processor is further configured to use a neural network to predict the separation and generate the separation each detected object, wherein the neural network is trained using a plurality of sets of the video data and the audio data (See e.g., “…For each video, we perform NMF on its audio magnitude spectrogram to get M basis vectors. An ImageNet-trained ResNet-152 network is used to make visual predictions to find the 
    PNG
    media_image2.png
    228
    992
    media_image2.png
    Greyscale

(See e.g., Fig. 3, from Gao et al.)
potential objects present in the video…” and how “…deep multi-instance multi-label network takes a bag of M audio basis vectors for each video as input, and gives a bag-level prediction of the objects present in the audio. The visual predictions from an ImageNet-trained CNN are used as weak “labels” to train the network with unlabeled video…,” See e.g., GAO §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4).

With respect to Claim 15, GAO discloses:
15. A method for training a neural network to separate audio based on objects present in an associated video, the method comprising: 
receiving a plurality of training data, wherein each training data comprises one or more sets of audio data and associated video data (See e.g., “…Unsupervised training pipeline. For each video, we perform NMF on its audio magnitude spectrogram to get M basis vectors. An ImageNet-trained ResNet-152 network is used to make visual predictions to find the potential objects present in the video. Finally, we perform multi-instance multi-label learning to disentangle which extracted audio basis vectors go with which detected visible object(s)….” See e.g., GAO §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4): 
for each training data: performing object detection on the video data of the one or more sets to detect one or more sound producing objects of the video data (See e.g., “…perform NMF on its audio magnitude spectrogram to get M basis vectors. An ImageNet-trained ResNet-152 network is used to make visual predictions to find the potential objects present in the video. Finally, we perform multi-instance multi-label learning to disentangle which extracted audio basis vectors go with which detected visible object(s)…,” See e.g., GAO §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4); and mixing the audio data of the one or more sets to generate mixed audio data (See e.g., “…we have a set of audio bases for each visual object, discovered purely from unlabeled video and mixed single-channel audio)…,” See e.g., GAO §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4); and training the neural network using the plurality of training data to predict a separation for each sound producing object that minimizes a combined loss function (See e.g., “…factorization is usually obtained by solving the following minimization problem… For each unlabeled training video, we perform NMF independently on its audio magnitude spectrogram to obtain its spectral patterns W, and throw away the activation matrix H. M audio basis vectors are therefore extracted from each video …” and by “…use[ing] … multi-label hinge loss to train the network…,” See e.g., GAO §§ 3, 3.1-3.5, 4, 4.1, 4.2, Figs. 2-4).



Allowable Subject Matter
6.	Claims 4, 5, 16-20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
7.       Claims 7-14 would be allowable over the prior art of record for at least the following rationale.  In consideration of the teachings above provided in Gao et al., hereinafter referred to as GAO, and notwithstanding, said aforementioned teachings in GAO are respectfully found to fail to teach or fairly suggest either individually or in a reasonable combination the presented limitations in independent Claim 7 specifically reciting “…using a neural network to predict a separation for each object detected in the one or more sets, wherein the separation minimizes a co-separation loss and a consistency loss; and generating separated audio data for each of the sound producing objects using the separations predicted by the neural network and the mixed audio data” correspondingly.
Similarly, dependent Claims 8-14 would further limit allowable independent Claim 7 correspondingly, and thus they would also be found allowable over the prior art of record by virtue of their dependency.

Conclusion
8.	 The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Arandjelovic et al., (Arandjelovic, R., & Zisserman, A. (2017). Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (pp. 609-617)), hereinafter referred to as Arandjelovic et al., discloses, see e.g., how “…the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and we introduce a novel “Audio-Visual Correspondence” learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks. …” (See e.g., Arandjelovic et al., Abstract).
Please, see PTO-892 for more details. 
9.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to Edgar Guerra-Erazo whose telephone number is (571) 270-3708.  The examiner can normally be reached on M-F 7:30a.m.-5:00p.m. EST. If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Bhavesh Mehta can be reached on (571) 272-7453.  The fax phone number for the organization where this application or proceeding is assigned is (571) 273-8300. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at
http://www.uspto.gov/interviewpractice. 
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/EDGAR X GUERRA-ERAZO/Primary Examiner, Art Unit 2656