DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
It is acknowledged that this Application claims foreign priority to CN201910632609.3 filed 14 July 2019.  Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 2 July 2020 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Objections
Claims 2, 19 and 20 are objected to because of the following informalities:  
Claim 2 recites the limitation "the number" in line 7.  There is insufficient antecedent basis for this limitation in the claim.
It appears that claims 19 and 20 should recite “The apparatus of claim 18” instead of “The method of claim 1.”  It appears this way since otherwise the claims . 
Appropriate correction is required.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1,2 and 10-20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by US PGPub 2018/0032845 to Polak et al (hereafter Polak).

Referring to claim 1, Polak discloses a method comprising:
determining a plurality of feature [concepts] combinations based on respective feature sets corresponding to at least two types of modality information in a multimedia file, the multimedia file including a plurality of types of modality information, the types of modality information selected from the group consisting of a text modality, an image modality, and a voice modality [modalities can be visual, audio and textual] (see [0040]; [0074]; [0075]; [0079]; [0091]; [0100]; [0101]; [0102]; and Fig 1 – As shown at 106, the video classification module extracts one or more modalities data, for example, visual data, motion data, audio data and/or textual data. At 108, the video classification module 220 applies one or more of a plurality of classifiers over the extracted modalities data. At step 114, the module aggregates the in-modality class probabilities across all modalities.);
determining a semantically relevant feature combination using a first computational model based on the plurality of feature combinations (see Fig 5 and [0106]-[0109]); and
categorizing the multimedia file using the first computational model (Fig 1, step 120 – output classification data).
Referring to claim 2, Polak discloses the method of claim 1, the determining a semantically relevant feature combination comprising:
determining, using an attention mechanism of the first computational model, attention features corresponding to a first feature combination from a plurality of first features constituting the first feature combination (see Fig 5; [0107]; and [0111]); and
combining the attention features corresponding to the first feature combination to obtain the semantically relevant feature combination when the number of the attention 
Referring to claim 10, Polak discloses the method of claim 1, the plurality of feature combinations comprising a third feature combination and determining the third feature combination comprising:
sampling a feature from each feature set in the respective feature sets corresponding to the at least two types of modality information (Fig 1, steps 108 and 110);
linearly mapping the feature selected from each feature set to obtain a plurality of features of the same dimensionality (Fig 1, steps 112 and 114); and
combining the plurality of features of the same dimensionality into the third feature combination (see Fig 5 and Fig 1, steps 112, 114 and 116).
Referring to claim 11, Polak discloses the method of claim 1, the at least two types of modality information comprising image modality information, the method further comprising
inputting a plurality of video frames in the image modality information to an image feature extraction model, and extracting respective image features corresponding to the plurality of video frames (Fig 1 and Fig 4, 310A, 410A and 420A); and

Referring to claim 12, Polak discloses the method of claim 1, the at least two types of modality information comprising text modality information, the method further comprising:
inputting a plurality of text words in the text modality information to a text word feature extraction model, and extracting respective text word features corresponding to the plurality of text words (Fig 1 and Fig 4, steps 310D, 410D and 420D); and
generating a feature set corresponding to the text modality information with the respective text word features corresponding to the plurality of text words (Fig 1 and Fig 4, steps 310D, 410D and 420D).
Referring to claim 13, Polak discloses a non-transitory computer-readable storage medium for tangibly storing computer program instructions capable of being executed by a computer processor (see: [0059] and [0060]), the computer program instructions defining the steps of:
determining a plurality of feature [concepts] combinations based on respective feature sets corresponding to at least two types of modality information in a multimedia file, the multimedia file including a plurality of types of modality information, the types of modality information selected from the group consisting of a text modality, an image modality, and a voice modality ;
determining a semantically relevant feature combination using a first computational model based on the plurality of feature combinations (see Fig 5 and [0106]-[0109]); and
categorizing the multimedia file using the first computational model (Fig 1, step 120 – output classification data).
Referring to claim 14, Polak discloses the non-transitory computer-readable storage medium of claim 13, the determining a semantically relevant feature combination comprising:
determining, using an attention mechanism of the first computational model, attention features corresponding to a first feature combination from a plurality of first features constituting the first feature combination (see Fig 5; [0107] and [0111]); and
combining the attention features corresponding to the first feature combination to obtain the semantically relevant feature combination when the number of the attention features corresponding to the first feature combination is greater than a first preset threshold (Fig 5; [0107]; and [0111]).
Referring to claim 15, Polak discloses the non-transitory computer-readable storage medium of claim 13, the plurality of feature combinations comprising a third feature combination and determining the third feature combination comprising:
sampling a feature from each feature set in the respective feature sets corresponding to the at least two types of modality information (Fig 1, steps 108 and 110);
linearly mapping the feature selected from each feature set to obtain a plurality of features of the same dimensionality (Fig 1, steps 112 and 114); and
combining the plurality of features of the same dimensionality into the third feature combination (Fig 5 and Fig 1, steps 112, 114 and 116).
Referring to claim 16, Polak discloses the non-transitory computer-readable storage medium of claim 13, the at least two types of modality information comprising image modality information, the method further comprising
inputting a plurality of video frames in the image modality information to an image feature extraction model, and extracting respective image features corresponding to the plurality of video frames (Fig 1 and Fig 4, steps 310A, 410A and 420A); and
generating a feature set corresponding to the image modality information with the respective image features corresponding to the plurality of video frames (Fig 1 and Fig 4, steps 310A, 410A and 420A).
Referring to claim 17, Polak discloses the non-transitory computer-readable storage medium of claim 13, the at least two types of modality information comprising text modality information, the method further comprising: inputting a plurality of text words in the text modality information to a text word feature extraction model, and extracting respective text word features corresponding to the plurality of text words (see Fig 1 and Fig 4, steps 310D, 410D and 420D); and generating a feature set corresponding to the text modality information with the respective text word features corresponding to the plurality of text words (see Fig 1 and Fig 4, steps 310D, 410D and 420D).
Referring to claim 18, Polak discloses an apparatus comprising: 
a processor (see Fig 1, item 204); and 
a storage medium for tangibly storing thereon program logic for execution by the processor (see [0060]), the stored program logic comprising: 
logic, executed by the processor, determining a plurality of feature [concepts] combinations based on respective feature sets corresponding to at least two types of modality information in a multimedia file, the multimedia file including a plurality of types of modality information, the types of modality information selected from the group consisting of a text modality, an image modality, and a voice modality [modalities can be visual, audio and textual] (see [0040]; [0074]; [0075]; [0079]; [0091]; [0100]; [0101]; [0102]; and Fig 1 – As shown at 106, the video classification module extracts one or more modalities data, for example, ;
logic, executed by the processor, determining a semantically relevant feature combination using a first computational model based on the plurality of feature combinations (see Fig 5 and [0106]-[0109]); and
logic, executed by the processor, categorizing the multimedia file using the first computational model (Fig 1, step 120 – output classification data).
Referring to claim 19, Polak discloses the method of claim 1, the determining a semantically relevant feature combination comprising: 
logic, executed by the processor, for determining, using an attention mechanism of the first computational model, attention features corresponding to a first feature combination from a plurality of first features constituting the first feature combination (see Fig 5; [0107]; and [0111]); and
logic, executed by the processor, for combining the attention features corresponding to the first feature combination to obtain the semantically relevant feature combination when the number of the attention features corresponding to the first feature combination is greater than a first preset threshold (see Fig 5; [0107]; and [0111]).
Referring to claim 20, Polak discloses the method of claim 1, the plurality of feature combinations comprising a third feature combination and determining the third feature combination comprising:
logic, executed by the processor, for sampling a feature from each feature set in the respective feature sets corresponding to the at least two types of modality information (see Fig 1, steps 108 and 110);
logic, executed by the processor, for linearly mapping the feature selected from each feature set to obtain a plurality of features of the same dimensionality (see Fig 1, steps 112 and 114); and
logic, executed by the processor, for combining the plurality of features of the same dimensionality into the third feature combination (see Fig 1, steps 112, 114 and 116 and Fig 5).

Allowable Subject Matter
Claims 3-9 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
As allowable subject matter has been indicated, applicant's reply must either comply with all formal requirements or specifically traverse each requirement not complied with.  See 37 CFR 1.111(b) and MPEP § 707.07(a).

Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KIMBERLY LOVEL WILSON whose telephone number is (571)272-2750. The examiner can normally be reached 8-4:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Robert Beausoliel can be reached on 571-272-3645. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KIMBERLY L WILSON/Primary Examiner, Art Unit 2167