Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Drawings
The drawings were received on 4/14/2020.  These drawings are accepted.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-3,5-6,10-12,18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Attore et al (US Publication No.: 20190035431) in view of Gan et al (US Publication No.: 20200242507).
Claim 1, Attorre et al discloses 
using a machine learning algorithm with a reference visual input and audio signal (paragraph 232 discloses training a neural network using pairs of feature representations of shots, given an observation or reference visual input. The training is performed using observation or reference visual input and MFCC audio features.), to train a multimodal clustering neural network (paragraph 232 discloses training a neural network to cluster the input as a positive or negative sample in relation to the observation.) to output representations for the visual input and audio input (Paragraph 232 discloses the output is a decision of whether the input is a positive or negative sample. The positive or negative sample are representations of the input. Label 1506 indicates the input as triplet of shots, paragraph 151 discloses shots as visual scenes. Paragraph 125 discloses the audio/visual data.) as well as correlation scores between the audio and visual representations (Paragraph 232 discloses 1506 the input triplets of shots into the trained neural network to create a score similarity matrix. The similarity score is between the observation and the negative samples and similarity between observation and positive samples, wherein the negative and positive samples are audio and visual representations.), 
wherein the trained multimode clustering neural network is configured to learn representations having higher correlation scores than the visual representation and a negative audio representation or an unrelated audio representation (paragraph 232 discloses “The triple loss neural network is trained to learn the weights in order to minimize the similarity between the observation and the negative samples and to maximize the similarity between the observation and the positive samples.”).
	Attore et al fails to disclose the training is performed using a positive audio signal input or a negative audio signal input, the observation or reference input is visual and the shots include audio input.
	Gan et al discloses the audio signal includes a positive audio signal or negative audio signal (Paragraph 65 discloses positive and negative example of audio feature corresponding to the extracted image frame.) and the reference input or observation is visual (paragraph 65 discloses “anchor (ie. extracted image feature)”.) and the shots include audio input (Fig. 6, label speech video which indicates shots of the video and 510 as the speech or audio accompanying the video.). It would be obvious to one skilled in the art before the effective filing date of the application to modify Attore et al’s audio signal to include positive and/or negative audio samples or signals disclosed by Gan et al so to improve the performance of the neural network.
Claim 2, Attore et al discloses applying a clustering algorithm to the reference image or video embedding (Paragraph 232 discloses clustering algorithm or process to decide whether the input is positive sample or negative sample in relation to the observation.).  
Claim 3,12, Attore et al discloses wherein the positive audio signal and reference visual input are part of an audio/video sequence (paragraph 232 discloses the observation or reference visual input and MFCC audio features. Paragraph 126 discloses MFCC audio features are from the audio information from the digital media or A/V sequence (paragraph 228).).  
Claim 10, Attorre et al discloses 
a processor (paragraph 341);
a memory coupled to the processor (paragraph 340-341);
non-transitory instructions embedded in the memory that when executed by the processor cause the processor to carry out the method comprising (paragraph 341 the processor is configured to executed the stored instructions and instructions are stored in a non-transitory computer readable memory.):
using a machine learning algorithm with a reference visual input and audio signal (paragraph 232 discloses training a neural network using pairs of feature representations of shots, given an observation or reference visual input. The training is performed using observation or reference visual input and MFCC audio features.), to train a multimodal clustering neural network (paragraph 232 discloses training a neural network to cluster the input as a positive or negative sample in relation to the observation.) to output representations for the visual input and audio input (Paragraph 232 discloses the output is a decision of whether the input is a positive or negative sample. The positive or negative sample are representations of the input. Label 1506 indicates the input as triplet of shots, paragraph 151 discloses shots as visual scenes. Paragraph 125 discloses the audio/visual data.) as well as correlation scores between the audio and visual representations (Paragraph 232 discloses 1506 the input triplets of shots into the trained neural network to create a score similarity matrix. The similarity score is between the observation and the negative samples and similarity between observation and positive samples, wherein the negative and positive samples are audio and visual representations.), 
wherein the trained multimode clustering neural network is configured to learn representations having higher correlation scores than the visual representation and a negative audio representation or an unrelated audio representation (paragraph 232 discloses “The triple loss neural network is trained to learn the weights in order to minimize the similarity between the observation and the negative samples and to maximize the similarity between the observation and the positive samples.”).
	Attore et al fails to disclose the training is performed using a positive audio signal input or a negative audio signal input, the observation or reference input is visual and the shots include audio input.
	Gan et al discloses the audio signal includes a positive audio signal or negative audio signal (Paragraph 65 discloses positive and negative example of audio feature corresponding to the extracted image frame.) and the reference input or observation is visual (paragraph 65 discloses “anchor (ie. extracted image feature)”.) and the shots include audio input (Fig. 6, label speech video which indicates shots of the video and 510 as the speech or audio accompanying the video.). It would be obvious to one skilled in the art before the effective filing date of the application to modify Attore et al’s audio signal to include positive and/or negative audio samples or signals disclosed by Gan et al so to improve the performance of the neural network.
Claim 11, Attore et al discloses applying a clustering algorithm to the reference visual representation (Paragraph 232 discloses clustering algorithm or process to decide whether the input is positive sample or negative sample in relation to the observation.).  
Claim 18, Attore et al discloses 
using a machine learning algorithm with a reference visual input and audio signal (paragraph 232 discloses training a neural network using pairs of feature representations of shots, given an observation or reference visual input. The training is performed using observation or reference visual input and MFCC audio features.), to train a multimodal clustering neural network (paragraph 232 discloses training a neural network to cluster the input as a positive or negative sample in relation to the observation.) to output representations for the visual input and audio input (Paragraph 232 discloses the output is a decision of whether the input is a positive or negative sample. The positive or negative sample are representations of the input. Label 1506 indicates the input as triplet of shots, paragraph 151 discloses shots as visual scenes. Paragraph 125 discloses the audio/visual data.) as well as correlation scores between the audio and visual representations (Paragraph 232 discloses 1506 the input triplets of shots into the trained neural network to create a score similarity matrix. The similarity score is between the observation and the negative samples and similarity between observation and positive samples, wherein the negative and positive samples are audio and visual representations.), 
wherein the trained multimode clustering neural network is configured to learn representations in such a way that the visual representation and positive audio representation have higher correlation scores than the visual representation and a negative audio representation or an unrelated audio representation (paragraph 232 discloses “The triple loss neural network is trained to learn the weights in order to minimize the similarity between the observation and the negative samples and to maximize the similarity between the observation and the positive samples.”).
	Attore et al fails to disclose the training is performed using a positive audio signal input or a negative audio signal input, the observation or reference input is visual and the shots include audio input.
	Gan et al discloses the audio signal includes a positive audio signal or negative audio signal (Paragraph 65 discloses positive and negative example of audio feature corresponding to the extracted image frame.) and the reference input or observation is visual (paragraph 65 discloses “anchor (ie. extracted image feature)”.) and the shots include audio input (Fig. 6, label speech video which indicates shots of the video and 510 as the speech or audio accompanying the video.). IT would be obvious to one skilled in the art before the effective filing date of the application to modify Attore et al’s audio signal to include positive and/or negative audio samples or signals disclosed by Gan et al so to improve the performance of the neural network.

Claim(s) 4-6,13,14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Attore et al (US Publication No.: 20190035431) in view of Gan et al (US Publication No.: 20200242507), and further in view of Lakhdhar et al (US Publication No.: 20200322377).
Claim 4,13, Attore et al and Gan et al discloses positive audio signal and the negative audio signal (paragraph 232 of Attore et al and paragraph 65 of Gan etal), but fails to discloses wherein the positive audio signal and the negative audio signal are mixtures of two or more audio signals.
Lakhdhar et al discloses wherein the positive audio signal and the negative audio signal are 2mixtures of two or more audio signals (Paragraph 69 discloses a plurality of audio signals includes a set of positive audio samples and a set of negative audio samples. Paragraph 56 discloses the audio signal includes foreground loudness and stationary background noise. Paragraph 69 discloses the positive audio samples is included in the audio signals. This indicates the positive audio samples includes noise since the audio signal includes noise, hence the audio samples are a mixture of audio signals.). It would be obvious to one skilled in the art before the effective filing date of the application to simply substitute one well known element of Attore et al in view of Gan et al’s positive audio signal and negative audio signals with another well-known element of audio signals that are mixed as disclosed by Lakhdhar et al so to yield predictable results of positive audio signal and negative audio signals.
Claim 5, Attore et al discloses wherein the mixture of two or more audio signals may include noise signals (Paragraph 69 discloses a plurality of audio signals includes a set of positive audio samples and a set of negative audio samples. Paragraph 56 discloses the audio signal includes foreground loudness and stationary background noise. Paragraph 69 discloses the positive audio samples is included in the audio signals. This indicates the positive audio samples includes noise since the audio signal includes noise.).  
Claim 6,14 Attore et al discloses the positive audio signal includes signals for sounds 2directly produced by one or more objects or actions in the reference input or are 3indirectly associated with one or more objects or actions in the reference input (Paragraph 232 discloses similarity between the observation or reference input and negative samples and similarity between the observation or reference input and positive samples. The computation of similarity indicates the observation and the positive audio signal have similar sounds, wherein the positive audio signal is from the input (paragraph 232) and the input includes audio and visual data (paragraph125).).  
Attore et al fails to disclose the reference input or observation is visual. 
Gan et al discloses the reference input or observation is visual (paragraph 65 discloses “anchor (ie. extracted image feature)”.) IT would be obvious to one skilled in the art before the effective filing date of the application to modify Attore et al’s audio signal to include positive and/or negative audio samples or signals disclosed by Gan et al so to improve the performance of the neural network.



Allowable Subject Matter
Claims 7-9,15-17 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LINDA WONG whose telephone number is (571)272-6044. The examiner can normally be reached 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/LINDA WONG/Primary Examiner, Art Unit 2655