DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments of 22 March 2021 have been fully considered.  Applicant argues that Krishnamurthy does not teach newly-introduced limitations.  Examiner agrees in part: Krishnamurthy does not explicitly teach one of these limitations.  However, the claimed invention would be obvious over Krishnamurthy as explained in the new grounds of rejection below.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-2, 4, 6, 9, 11-12, 14, 16, and 19 is/are rejected under 35 U.S.C. 103 as being obvious over Krishnamurthy, US 2020/0349387 A1 (hereinafter “Krishnamurthy”).

As per claims 1 and 11, Krishnamurthy teaches:
determining a plurality of first embedding vectors of a plurality of media content items of a first modality (Krishnamurthy ¶ 0070, “In any case, whether matching the audio tags to object tags, caption tags, or action tags, two numerical vectors are produced, one for the audio tag and one for the tag derived from the video.”); 
receiving a media content clip of a second modality, wherein the second modality is different than the first modality (Krishnamurthy ¶ 0068, “video 1000 such as a computer simulation without sound (audio) is used to generate visual tags 1002 based on visual understanding of, for example, identified objects 1004 in the video, identified actions 1006 in the video, and identified scene descriptions 1008 in the video.”), where video is different than audio; 
determining a second embedding vector of the media content clip of the second modality (Krishnamurthy ¶ 0070, “In any case, whether matching the audio tags to object tags, caption tags, or action tags, two numerical vectors are produced, one for the audio tag and one for the tag derived from the video.”); 
ranking the plurality of first embedding vectors based on a distance between the plurality of first embedding vectors and the second embedding vector (Krishnamurthy ¶ 0070, “The similarity of the tags is determined by computing the distance between the two vectors. Any distance measure, such as cosine similarity or Euclidean distance, can be used. The smaller the distance, the more similar the tags are. Using this approach, each visual tag is mapped to the top-k most similar audio tags.”); 
selecting one or more of the plurality of media content items of the first modality based on the ranking (Krishnamurthy ¶ 0072), where, e.g., sound effects are selected;
presenting, via a user interface, the selected one or more of the plurality of media content items of the first modality (Krishnamurthy ¶ 0072, “First, the audio tags can be used to recommend sound effects for game scenes to the game designers.”), where a recommendation is inherently communicated via an interface; and
providing an output including the media content clip of the second modality and a media content item of the first modality (Krishnamurthy ¶ 0067, “retrieve the corresponding sound effects 918 for combination with the video as shown at 920”), where combination is the output.


receiving, via the user interface, an elected media content item of the first modality; or
providing an output including the media content clip of the second modality and the elected media content item of the first modality.

Krishnamurthy teaches tagging sounds, tagging a video, locating sounds matching a video, and recommending the sounds relevant to the video to a game designer for the purpose of sound maxing (Krishnamurthy ¶ 0057).  Krishnamurthy, however, does not explicitly teach the sound designer choosing a recommended sound effect to mix with the video.

However, it would nevertheless have been obvious to one of ordinary skill in the art at the time of filing to have a sound designer, when designing sound for a video using the invention of Krishnamurthy, actually use the recommended sound effect for the video in the sound track to the video that the sound designer is designing in order to achieve the purpose of the sound design process (Krishnamurthy ¶ 0057).

As per claims 2 and 12, the rejection of claims 1 and 11 is incorporated, and Krishnamurthy further teaches:
wherein the first modality is an auditory modality and the second modality is a visual modality (Krishnamurthy ¶ 0067), where, for a video, sound effects are found.

As per claims 4 and 14, the rejection of claims 2 and 12 is incorporated, and Krishnamurthy further teaches:
wherein the audio modality is music (Krishnamurthy ¶ 0067), where music is nonfunctional descriptive matter describing the sound effect.

As per claims 6 and 16, the rejection of claims 1 and 11 is incorporated, and Krishnamurthy further teaches:
wherein a model is trained by constraining a stream of video with a plurality of predetermined tags for the first modality and constraining a stream of audio with a plurality of predetermined tags for the second modality (Krishnamurthy ¶ 0065, “As shown, during training, videos 900 with sound extracted, along with the noisy SFX tags 902 generated as described above and/or human-annotated, are input to a training phase module 904. With greater specificity, the corresponding audio that is extracted from the video is passed through the noisy SFX model explained above in FIG. 8 to generate the SFX tags or labels 902, which are input along with the corresponding video segment 900 to the supervised training phase model 904.”).

As per claims 9 and 19, the rejection of claims 2 and 12 is incorporated, and Krishnamurthy further teaches:
wherein the video modality is obtained from any one of a movie, a television program, a photo, a single frame of a video, or a combination thereof (Krishnamurthy ¶¶ 0068-70), where a video is a combination of single frames of video.

Claims 3, 5, 10, 13, 15, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Krishnamurthy, US 2020/0349387 A1 (hereinafter “Krishnamurthy”), in view of Leekley et al., US 2019/0267042 A1 (hereinafter “Leekley”).

As per claims 3 and 13, the rejection of claims 1 and 11 is incorporated, but Krishnamurthy does not explicitly teach:
wherein the first modality is a visual modality and the second modality is an auditory modality.

The analogous and compatible art of Leekley, however, teaches ranking video based on their association with selected audio (Leekley ¶ 0010).

It would therefore have been obvious to modify the teachings of Krishnamurthy with those of Leekley to use the tag vector distance method of Krishnamurthy to rank videos for a given audio file content segment of Leekley in order to provide better ranking to locate videos for a given audio file segment.

As per claims 5 and 15, the rejection of claims 3 and 13 is incorporated, and Krishnamurthy further teaches:
wherein the audio modality is music (Krishnamurthy ¶ 0067), where music is nonfunctional descriptive matter describing the sound effect.

As per claims 10 and 20, the rejection of claims 3 and 13 is incorporated, and Krishnamurthy further teaches:
wherein the video modality is obtained from any one of a movie, a television program, a photo, a single frame of a video, or a combination thereof (Krishnamurthy ¶¶ 0068-70), where a video is a combination of single frames of video.

Claims 7-8 and 17-18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Krishnamurthy, US 2020/0349387 A1 (hereinafter “Krishnamurthy”), in view of Devkar et al., US 2014/0214848 A1 (hereinafter “Devkar”).

As per claims 7 and 17, the rejection of claims 6 and 16 is incorporated, but Krishnamurthy does not teach:
wherein the one or more predetermined tags are used to represent an emotion.

The analogous and compatible art of Devkar, however, teaches assigning a tag vector to audio and video content comprising emotion tags (Devkar ¶¶ 0045-0048).

It would therefore have been obvious to combine the teachings of Devkar with those of Krishnamurthy to match sound effects with visuals based on emotional tags vector distance in order to better match sound effects with a scene.

As per claims 8 and 18, the rejection of claims 7 and 17 is incorporated, but Krishnamurthy does not teach:
wherein the emotions are selected from a set of predetermined emotions.

The analogous and compatible art of Devkar, however, teaches assigning a tag vector to audio and video content comprising emotion tags (Devkar ¶¶ 0045-0048), where tags are selected from a predetermined group (Devkar ¶ 0036).

.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM SPIELER whose telephone number is (571)270-3883.  The examiner can normally be reached on Monday-Friday, 11-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Mariela Reyes can be reached on 571-270-1006.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.



WILLIAM SPIELER
Primary Examiner
Art Unit 2159



/WILLIAM SPIELER/               Primary Examiner, Art Unit 2159