Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5-11, 13-19 are rejected under 35 U.S.C. 103 as being unpatentable over Shen (US2005/0022252) in view of Dimitrova (US 2007/0061352) and in further view of Sharma (US 2017/0032222)

As for claim 1 Shen teaches
A method comprising:
by one or more computing devices, accessing a video-content object (Shen, Fig 1, [0054] capturing and pre-processing video content), wherein the video-content object comprises content of at least a first modality type and a second modality type (video and audio data, see below);
by one or more computing devices, determining using a first recognition module of the first modality type,  a first feature vector representing the video-content object [..]( “first recognition module” can be Audio MMRP Fig 2 [0061] and “first modality” can be “audio”);
by one or more computing devices, determining using a second recognition module of the second modality type,  a second feature vector representing the video-content object, wherein [..] the first modality type is different from the second modality type,[..]; ( “the second recognition module” can be Text MMRP, Fig 2 [0061], and “second modality” can be “text”); and
by one or more computing devices, determining .. a context of the video-content object  [based on the first feature vector and second feature vector]  (Shen, Fig 1, Fig 3, [0057] Indexed Database, video clips are indexed and tagged with topic information, i.e. “dialogues”, “scenery”, “car chase”; any of the produced indexing or tagging can be considered “context”, as it is a broad term and can be interpreted as any information associated with the input)
Shen does not teach, Dimitrova however teaches
wherein the first feature vector is an n-dimensional vector in an n-dimensional vector space (Dimitrova [0046], Table 1, a number of audio features extracted)
wherein the second feature vector is an m-dimensional vector in an m-dimensional vector space that is different from the n-dimensional vector space  (Dimitrova [0046], Table 1, and [0072-74] a number of text features extracted; the number of text and audio features is not directly related, i.e. “a different vector space”)
	determining, using a machine learning network, (Dimitrova [0028] ln 8-9 neural network) a context of the video-content object ([0049] a high-level information structure, including identification of actors, plot summary and sematic scene description – can be called “context”;  it would also be obvious to apply Dimitrova’s method to Shen’s tagging of video clips with a thematic tag, i.e. “context”, as these are all ways of categorizing and/or producing summary description of video) and wherein the machine learning network is trained on features represented in both the n-dimensional vector space and the m-dimensional vector space (Dimitrova [0028], [0031]; [0058] training on text and audio features) by:
	determining, [based on the first and second feature vector] the context that describes at least an aspect of the video-content object (Dimitrova [0028] training a neural network on “intrinsic data", described as [0011] video, audio or text signals; it would be obvious to apply Shen’s video and audio data features as input to Dimitrova’s classifying neural network, as discussed below; )
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination of Shen and by including features of Dimitrova,  as all pertain to multimedia data analysis.   The motivation to do so would have been, Shen describes classifying the video content by audio, text and video features, but does not provide a lot of detail how the classifiers are generated.  Dimitrova teaches a common method of training the classifiers using training data.	
The combination of Shen and Dimitrova does not teach, Sharma however teaches
	.. determining, using the machine learning network receiving at least the first feature vector and the second feature vector as inputs, a first combination feature vector representing the video-content object (Sharma Fig. 2, el 218. teaches generating, by a group of neural networks, a combined feature vector from earlier-generated feature vectors 210 and 216)
	.. generate a combination feature vector gbased on a combination of the inputs (Fig 2, el 210, 216 and 218 as discussed above)
	determining, based on the first combination vector, the [classification/context] (Fig 2, el 220 performs classification using the combined feature vector 218)
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to modify the combination of Shen and Dimitrova by including features of Sharma,  as all pertain to image data analysis and classification.   The motivation to do so would have been, Dimitrova describes classifying video content using neural networks, but does not provide many details how this is performed, for example what inputs are fed into the classification portion of the ML engine.  Sharma teaches a commonly known method of combining feature vectors into a single combined vector, and inputting it into a classifier.  The advantage to doing so is allowing to utilize a simple classifier that processes only a single feature vector, instead of multiple feature vectors.

As for independent claims 9 and 17, please see discussion of analogous claim 1 above.

As for claims 2, 10, 18  the combination of Shen, Dimitrova and Sharma teaches
the first recognition module is an audio-recognition module (Shen, Fig 2 Audio MMRP [0064]);
the first feature vector represents a predicted transcript of the video-content object, wherein the predicted transcript comprises text (Shen, [0065] speech recognition produces text).


As for claims 3, 11, 19  the combination of Shen, Dimitrova and Sharma teaches
the first recognition module is a video-recognition module and the second recognition module is a text-recognition module (optional limitation);
the first recognition module is a video-recognition module and the second recognition module is an audio-recognition module (optional limitation; further discussed in claim 6), or
the first recognition module is an audio-recognition module and the second recognition module is a text-recognition module (Shen, Fig 2 Text MMRP [0061]).

As for claims 5, 13  the combination of Shen, Dimitrova and Sharma teaches
the video-content object comprises frames and audio and is associated with text (Shen, Fig 1, Fig 2, video frames, and associated audio and text information); and 
the object in the video-content object is one of:
one or more of the frames (Shen, Fig 1, 2);
one or more portions of the audio (Shen, Fig 1, 2); or
at least some of the text (Shen, Fig 1, 2).

As for claims 6, 14  the combination of Shen, and Dimitrova teaches, 
the first recognition module is a video-recognition module (Shen, Video MMRP can be called “first”);
the first feature vector represents an intermediate output prediction (any of the outputs of video processing module, for example in Fig 4) generated using a first ML model included in the first recognition module (Shen [0077] teaches a neural network as part of the Image Recognition (IR) module, which is described to be a part of the Video MMRP module); and
the second recognition module is an audio-recognition module (Shen, Audio MMRP, can be called “second”)
at least a portion of the ML network is included in a fusion module that generates the first combination feature vector (Sharma, Fig 2, as discussed in claim 1, teaches multiple neural network modules; a group of the NNs can be called “a fusion module” and a single NN “a portion of the ML network”)

As for claims 7, 15  the combination of Shen, Dimitrova and Sharma teaches
by one or more computing devices, determining, using a third recognition module of a third modality type, a third feature vector representing the video-content object, (Shen, if “first” is Audio and “second” is Text, then “third” can be “Video MMRP”, Fig 2 [0061]); wherein 
the third modality type is different from the first and second modality types (Shen, Video MMRP, as discussed above) 
the third feature vector is an k-dimensional vector in an k-dimensional vector space that is different from the n-dimensional vector space and the m-dimensional vector space (Dimitrova Table 1, [0046] extracting visual features) and
wherein the machine learning network (Dimitrova [0028] neural network) further determines the first combination feature vector based at least on the third feature vector (Shen, Fig 1, Fig 3, [0057] indexing and tagging, as discussed in claim 1; it would be obvious to apply Shen’s video, audio and text features to Dimitrova’s neural network, as discussed in claim 1).

As for claims 8, 16 the combination of Shen, Dimitrova and Sharma teaches
extracting at least one feature from each frame of a first set of frames of the video-content object to generate a first set of feature vectors (Shen, Fig 4, Video MMRP, multiple features are analyzed: Geometry, Color, etc); and
polling two or more of the first set of feature vectors to generate the first feature vector (Shen, Fig 1,2, Video MMRP is used to generate the indexing and tagging, as discussed in claim 1).


B. Claims 4, 12, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Shen, Dimitrova and Sharma in further view of and Naaman (Mor Naaman et al, Leveraging Context to Resolve Identity in Photo Albums, JCDL 2005).

As for claims 4, 12, 20 the combination of Shen, Dimitrova and Sharma does not teach, Naaman teaches
the video-content object corresponds to a node in a social graph of a social-networking system (Naaman, ch 4.2, par 4, determines links between people appearing in a set of photos; processing photos as in Naaman is also equally applicable to processing video frames as in the present claim); 
the social graph comprises a plurality of nodes and edges connecting the nodes (Naaman, ch 4.2, par 4, people i1, i2 are “nodes” and the function KE(i1,i2) defining the link is an “edge”); and
the context of the video-content object is further determined based on social-graph information that is based at least in part on one or more nodes or edges connected to the node corresponding to the video-content object, in addition to the first combination feature vector (it would be obvious to include additional information for context generation of Shen and Dimitrova, for example including the information about people appearing in a video, to produce the indexes and tags).
It would have been obvious for one of ordinary skill in the art to modify the video indexing and tagging method of the combination of Shen, Dimitrova and Sharma to further include person identification information taught by Naaman, as all pertain to the arts of video and image content analysis.  The motivation to do so would have been, to increase usability of indexing and tagging by including additional relevant information.

Final Rejection
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 


Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARK ROZ whose telephone number is (571)270-3382.  The examiner can normally be reached on 9AM-5PM M-F.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Chan Park  can be reached on  (571)272-7409.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/MARK ROZ/
Examiner, Art Unit 2669