Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Drawings
The drawings were received on 4/14/2020.  These drawings are accepted.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1,5,6,8,14,15,17 is/are rejected under 35 U.S.C. 102a2 as being anticipated by Lakhdhar et al (US Publication No.: 20200322377).
Claim 1, Lakhdhar et al discloses 
Preamble: A method for training a Sound Effect Recommendation Network (Fig. 4), comprising: 
a) generating a positive audio embedding from a positive audio signal (Paragraph 67 discloses “the respective inputs 410,420,430 of the first, second and third feed-forward neural networks … each of the audio signals can be preprocessed by the PCEN frontend to generate a two dimensional representation ( or “image”) of the audio signal.” Paragraph 69 discloses “a set of positive audio samples” from “the plurality of audio signals”. Paragraph 76 discloses EVx+ as the embedded vector for the positive audio signal.), wherein the positive audio signal is related to a reference image (Paragraph 69 discloses “The images can be representations of the plurality of audio signals. … the images corresponding to the set of anchor audio signals can be provided to the second neural network …”); 
b) generating a negative audio embedding from a negative audio signal (Paragraph 69 discloses “a set of negative audio samples”. Paragraph 76 discloses EVx- stands for the embedded vector for the negative audio signal. Paragraph 67 discloses “the respective inputs 410,420,430 of the first, second and third feed-forward neural networks … each of the audio signals can be preprocessed by the PCEN frontend to generate a two dimensional representation ( or “image”) of the audio signal.”); 
c) using a machine learning algorithm with (Fig. 4 shows the neural network used to train PCEN front end shown in Fig. 2b and train itself (paragraph 77).), the positive audio embedding (paragraph 69 discloses positive audio samples as an input to the neural network.) and the negative audio embedding (paragraph 69 discloses negative audio samples as an input to the neural network.) as inputs to train visual to audio correlation neural network (Paragraph 69 discloses positive samples, negative samples and anchor samples are input into the neural network. Fig. 4 shows the visual to audio correlation neural network which includes the machine learning algorithm used to backpropagation to train itself. (paragraph 77)) to output a smaller distance between the positive audio embedding and the reference than the negative audio embedding and the reference (Paragraph 77 discloses “After a given batch of training samples are processed, a loss function may be calculated based on the respective outputs 414, 424, 434 of the first, second, and third feed-forward neural networks 412, 422, 432. The computed loss function may be used to train the respective neural networks 412, 422, 432 of the feature extraction block and the PCEN frontend using a backpropagation algorithm with a “stochastic gradient descent” optimizer, which aims at computing the gradient of the loss function with respect to all the weights in the respective neural networks (the PCEN frontend neural networks and the feature extraction block neural network). The goal of the optimizer is to update the weights, in order to minimize the loss function. However, it is also contemplated that other types of backpropagation algorithms may be used. In the example of FIG. 4, the loss function can be used to update the connection weights in each of the first convolutional layer, the second convolutional layer, and the fully connected layer.” By minimizing the loss function, a smaller distance between the positive audio embedding and the reference image or anchor than the negative audio embedding and the reference image or anchor.).
	Claims 5, and 14, Lakhdhar et al discloses the audio features are extracted from the positive signal and negative signal before being used in training with the machine learning algorithm (Paragraph 76 discloses embedded vector for positive audio and negative audio signal. Paragraph 77 discloses a loss function is minimized in order to update weights of the neural network shown in Fig. 4. Fig. 4 shows the loss function is performed after embedded vector generation as disclosed in paragraph 76.).
Claims 6, and 15, Lakhdhar et al discloses the positive audio signal (paragraph 69) includes noise (Paragraph 56 discloses the audio signal includes foreground loudness and stationary background noise. Paragraph 69 discloses the positive audio samples is included in the audio signals. This indicates the positive audio samples includes noise since the audio signal includes foreground loudness and stationary background noise.).
Claims 8, and 17, Lakhdhar et al discloses  
the positive signals part of an audio/video sequence that includes the reference image signal (paragraph 68-69 discloses positive signals included in the audio signals represented by images and image for each audio signal has a size. The reference image signal as the anchor audio samples or signals.), 
wherein positive signal includes noise signals (Paragraph 56 discloses the audio signal includes foreground loudness and stationary background noise. Paragraph 69 discloses the positive audio samples is included in the audio signals. This indicates the positive audio samples includes noise since the audio signal includes foreground loudness and stationary background noise.) and 
wherein the noise signals are other sounds occurring in the audio/video sequence (paragraph 56 discloses background noise in the audio signal. Paragraph 68-69 discloses images that represent a plurality of audio signals.) and 
	wherein the negative signal includes noise signals (Paragraph 56 discloses the audio signal includes foreground loudness and stationary background noise. Paragraph 69 discloses the negate audio samples is included in the audio signals. This indicates the negative audio samples includes noise since the audio signal includes foreground loudness and stationary background noise.).
Claim 10, Lakhdhar et al discloses 
Preamble: A method for training a Sound Effect Recommendation Network (Fig. 4), comprising: 
a processor (paragraph 91); 
a memory coupled to the processor (paragraph 91);
non-transitory instructions embedded in the memory that when executed cause the processor to carry out the method for training a sound effect recommendation network (Fig. 4, paragraphs 89-91) comprising:
a) generating a positive audio embedding from a positive audio signal (Paragraph 67 discloses “the respective inputs 410,420,430 of the first, second and third feed-forward neural networks … each of the audio signals can be preprocessed by the PCEN frontend to generate a two dimensional representation ( or “image”) of the audio signal.” Paragraph 69 discloses “a set of positive audio samples” from “the plurality of audio signals”. Paragraph 76 discloses EVx+ as the embedded vector for the positive audio signal.), wherein the positive audio signal is related to a reference image (Paragraph 69 discloses “The images can be representations of the plurality of audio signals. … the images corresponding to the set of anchor audio signals can be provided to the second neural network …”); 
b) generating a negative audio embedding from a negative audio signal (Paragraph 69 discloses “a set of negative audio samples”. Paragraph 76 discloses EVx- stands for the embedded vector for the negative audio signal. Paragraph 67 discloses “the respective inputs 410,420,430 of the first, second and third feed-forward neural networks … each of the audio signals can be preprocessed by the PCEN frontend to generate a two dimensional representation ( or “image”) of the audio signal.”); 
c) using a machine learning algorithm with (Fig. 4 shows the neural network used to train PCEN front end shown in Fig. 2b and train itself (paragraph 77).), the positive audio embedding (paragraph 69 discloses positive audio samples as an input to the neural network.) and the negative audio embedding (paragraph 69 discloses negative audio samples as an input to the neural network.) as inputs to train visual to audio correlation neural network (Paragraph 69 discloses positive samples, negative samples and anchor samples are input into the neural network. Fig. 4 shows the visual to audio correlation neural network which includes the machine learning algorithm used to backpropagation to train itself. (paragraph 77)) to output a smaller distance between the positive audio embedding and the reference than the negative audio embedding and the reference (Paragraph 77 discloses “After a given batch of training samples are processed, a loss function may be calculated based on the respective outputs 414, 424, 434 of the first, second, and third feed-forward neural networks 412, 422, 432. The computed loss function may be used to train the respective neural networks 412, 422, 432 of the feature extraction block and the PCEN frontend using a backpropagation algorithm with a “stochastic gradient descent” optimizer, which aims at computing the gradient of the loss function with respect to all the weights in the respective neural networks (the PCEN frontend neural networks and the feature extraction block neural network). The goal of the optimizer is to update the weights, in order to minimize the loss function. However, it is also contemplated that other types of backpropagation algorithms may be used. In the example of FIG. 4, the loss function can be used to update the connection weights in each of the first convolutional layer, the second convolutional layer, and the fully connected layer.” By minimizing the loss function, a smaller distance between the positive audio embedding and the reference image or anchor than the negative audio embedding and the reference image or anchor.).
Claim 19, Lakhdhar et al discloses 
 a) generating a positive audio embedding from a positive audio signal (Paragraph 67 discloses “the respective inputs 410,420,430 of the first, second and third feed-forward neural networks … each of the audio signals can be preprocessed by the PCEN frontend to generate a two dimensional representation ( or “image”) of the audio signal.” Paragraph 69 discloses “a set of positive audio samples” from “the plurality of audio signals”. Paragraph 76 discloses EVx+ as the embedded vector for the positive audio signal.), wherein the positive audio signal is related to a reference image (Paragraph 69 discloses “The images can be representations of the plurality of audio signals. … the images corresponding to the set of anchor audio signals can be provided to the second neural network …”); 
b) generating a negative audio embedding from a negative audio signal (Paragraph 69 discloses “a set of negative audio samples”. Paragraph 76 discloses EVx- stands for the embedded vector for the negative audio signal. Paragraph 67 discloses “the respective inputs 410,420,430 of the first, second and third feed-forward neural networks … each of the audio signals can be preprocessed by the PCEN frontend to generate a two dimensional representation ( or “image”) of the audio signal.”); 
c) using a machine learning algorithm with (Fig. 4 shows the neural network used to train PCEN front end shown in Fig. 2b and train itself (paragraph 77).), the positive audio embedding (paragraph 69 discloses positive audio samples as an input to the neural network.) and the negative audio embedding (paragraph 69 discloses negative audio samples as an input to the neural network.) as inputs to train visual to audio correlation neural network (Paragraph 69 discloses positive samples, negative samples and anchor samples are input into the neural network. Fig. 4 shows the visual to audio correlation neural network which includes the machine learning algorithm used to backpropagation to train itself. (paragraph 77)) to output a smaller distance between the positive audio embedding and the reference than the negative audio embedding and the reference (Paragraph 77 discloses “After a given batch of training samples are processed, a loss function may be calculated based on the respective outputs 414, 424, 434 of the first, second, and third feed-forward neural networks 412, 422, 432. The computed loss function may be used to train the respective neural networks 412, 422, 432 of the feature extraction block and the PCEN frontend using a backpropagation algorithm with a “stochastic gradient descent” optimizer, which aims at computing the gradient of the loss function with respect to all the weights in the respective neural networks (the PCEN frontend neural networks and the feature extraction block neural network). The goal of the optimizer is to update the weights, in order to minimize the loss function. However, it is also contemplated that other types of backpropagation algorithms may be used. In the example of FIG. 4, the loss function can be used to update the connection weights in each of the first convolutional layer, the second convolutional layer, and the fully connected layer.” By minimizing the loss function, a smaller distance between the positive audio embedding and the reference image or anchor than the negative audio embedding and the reference image or anchor.).



Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 7,16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lakhdhar et al (US Publication No.: 20200322377) in view of Xiang et al (Publication Title: Person Re-identification based on Feature Fusion and Triplet loss function).
Claims 7, and 16, Lakhdhar et al discloses the machine learning algorithm used to train the visual-to-audio correlation neural network (Fig. 4 shows the neural network trained using backpropagation (paragraph 77). Fig. 4, label loss shows the loss, paragraph 77 discloses minimize loss function.), but fails to discloses such machine learning algorithm includes a pairwise loss function.
Xiang et al discloses “a large number of metric learning algorithms have been applied … These metric learning methods .. include … pairwise contrastive, verification loss, triplet loss… These overall architectures are always with two of three branches according to pairwise or triplet loss ….” (Section II. Related work, A. Feature Description and Metric learning ,paragraph 2) It would be obvious to one skilled in the art before the effective filing date of the application to simply substitute one well known element triplet loss function as disclosed by Lakhdar et al in paragraph 36 with another well known element of pairwise loss function as disclosed by Xiang et al so to yield predictable results of optimize loss function so to improve the neural network.

Claim(s) 9,18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lakhdhar et al (US Publication No.: 20200322377) in view of Jansen et al (US Publication No.: 20200349921).
Claim 9,18, Lakhdhar et al discloses neural network shown in Fig. 4, but fails to disclose the machine learning algorithm is a self-supervised learning algorithm and wherein the positive, negative and correlated audio are unlabeled and unannotated inputs.
Jansen et al discloses positive audio segment, negative audio segment, and anchor audio segment as training data (paragraph 3.), the machine learning algorithm is a self-supervised learning algorithm (paragraph 19 discloses alternative training manner such as unsupervised manner, which includes training with unlabled data.), wherein the positive, negative and correlated audio are unlabeled and unannotated inputs (paragraph 20 discloses unsupervised manner of artificial neural networks. This indicates the training data are unlabeled.).
It would be obvious to one skilled in the art before the effective filing date of the application to simply substituted one well known element of Lakhdhar et al’s neural network with labeled training data (paragraph 36) with another well known element of unlabeled training data for training neural network as disclosed by Jansen et al so yield predictable results of training a neural network for optimization.
Allowable Subject Matter
Claims 2-4,11-13,20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LINDA WONG whose telephone number is (571)272-6044. The examiner can normally be reached 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/LINDA WONG/Primary Examiner, Art Unit 2655