Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim(s) 2, 4, 9, 11, 16, and 18 is/are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 2 recites the term “text segment”, where a text segment comprises “a selection from the group consisting of: a word and a phrase”.  Claim 4 depends on claim 2 and recites the term “text element”, where a “text segment comprises a plurality of text elements” and a text element comprises “a selection from the group consisting of: a word and a phrase”.  Since both the recited “text segment” and the recited “text element” are defined as selected from “a word and a phrase”, there would appear to be no patentable distinction between the two terms.  It is unclear, therefore, what differentiates a “text segment” from a “text element” as claimed. Additionally, if a text segment is defined as “a word” or “a phrase”, it is unclear how a text segment could comprise “a plurality of text elements”.
Paragraph [0069] of the specification describes a text segment as possibly “corresponding to a sentence”, and that a sentence “may include a plurality of text elements”.  However, since claim 2 specifically defines a text segment as one of “a word and a phrase”, it is unclear whether this interpretation is what is being claimed.
For the purposes of examination, the term “text segment” will be interpreted as a plurality of words or phrases, such as a sentence. The term “text element” will be interpreted as one word or phrase found in a text segment.
Furthermore, with respect to claim 4, the claim recites “the plurality of words” in lines 3-4 of the claim.  There is insufficient antecedent basis for this limitation in the claim.

Claims 9, 11, 16, and 18 recite similar limitations and are rejected for the same reasons as claims 2 and 4.


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-2, 4-5, 8-9, 11-12, 15-16 and 18-19 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Fink et al. (U.S. Patent Application Pub. No. 2020/0065589, hereinafter “Fink”).
	In regard to claim 1, Fink discloses a computer-implemented method (Fig. 2, 104) comprising:
extracting, by one or more processors, an audio signal from a video clip (step 202, an audio stream is extracted from a video, paragraph [0031]); 
converting, by one or more processors, the audio signal into a text sequence (step 204, automatic speech recognition (ASR) identifies words from the audio signal, paragraph [0032]); 
selecting, by one or more processors, a first set of keywords from the text sequence, the first set of keywords corresponding to a first audio segment of the audio signal (step 206, the recognized words are filtered to determine candidates for keywords to tag, paragraph [0036]); and 
tagging, by one or more processors, a target video segment of the video clip with the first set of keywords, the target video segment corresponding to the first audio segment (step 208, the determined keywords are used to tag the video, paragraph [0042]).

In regard to claim 2, Fink discloses extracting the first set of keywords from the text sequence comprises: 
dividing, by one or more processors, the text sequence into a plurality of text segments, the plurality of text segments corresponding to a plurality of audio segments of the audio signal, and a text segment of the plurality of text segments comprising a selection from the group consisting of: a word and a phrase (the recognized words corresponding to the detected audio are separated and output as a list of candidate words, paragraph [0034]); 
selecting, by one or more processors, a text segment comprising at least one target keyword from the plurality of text segments (in step 206, the list of word/tag candidates are filtered to select keywords based on a custom dictionary with words relevant to the subject matter, paragraphs [0040-0041]); and 
determining, by one or more processors, the first set of keywords from the text segment, the first set of keywords comprising the at least one target keyword (relevant words from the custom dictionary are selected as the keywords for tagging the video, paragraphs [0040-0042]).

In regard to claim 4, Fink discloses determining the first set of keywords from the text segment comprises: 
responsive to determining that the text segment comprises a plurality of text elements, determining, by one or more processors, (i) a plurality of importance scores of the plurality of words, and (ii) a text element of the plurality of text elements comprising a selection from the group consisting of: a word and a phrase (each word from the list of word/tag candidates is filtered to determine words that are likely or unlikely to be relevant keywords, paragraph [0036]); and 
selecting, by one or more processors, the first set of keywords from the plurality of words based on the plurality of importance scores (words that are determined to be unlikely to be relevant are discarded, paragraph [0036]).

In regard to claim 5, Fink discloses the first set of keywords further comprises at least one additional selection from the group consisting of: a word selected from the text segment and a phrase selected from the text segment (a plurality of keywords are selected from the list of candidate keywords, paragraph [0042]).

In regard to claim 8, Fink discloses a computer program product comprising: 
one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media (paragraph [0056]), the program instructions comprising: 
program instructions to extract an audio signal from a video clip (step 202, an audio stream is extracted from a video, paragraph [0031]); 
program instructions to convert the audio signal into a text sequence (step 204, automatic speech recognition (ASR) identifies words from the audio signal, paragraph [0032]); 
program instructions to select a first set of keywords from the text sequence, the first set of keywords corresponding to a first audio segment of the audio signal (step 206, the recognized words are filtered to determine candidates for keywords to tag, paragraph [0036]); and 
program instructions to tag a target video segment of the video clip with the first set of keywords, the target video segment corresponding to the first audio segment (step 208, the determined keywords are used to tag the video, paragraph [0042]).

In regard to claim 9, Fink discloses program instructions to extract the first set of keywords from the text sequence comprise: 
program instructions to divide the text sequence into a plurality of text segments, the plurality of text segments corresponding to a plurality of audio segments of the audio signal, and a text segment of the plurality of text segments comprising a selection from the group consisting of: a word and a phrase (the recognized words corresponding to the detected audio are separated and output as a list of candidate words, paragraph [0034]); 
program instructions to select a text segment comprising at least one target keyword from the plurality of text segments (in step 206, the list of word/tag candidates are filtered to select keywords based on a custom dictionary with words relevant to the subject matter, paragraphs [0040-0041]); and 
program instructions to determine the first set of keywords from the text segment, the first set of keywords comprising the at least one target keyword (relevant words from the custom dictionary are selected as the keywords for tagging the video, paragraphs [0040-0042]).

In regard to claim 11, Fink discloses program instructions to determining the first set of keywords from the text segment comprise: 
program instructions to, responsive to determining that the text segment comprises a plurality of text elements, determine (i) a plurality of importance scores of the plurality of words, and (ii) a text element of the plurality of text elements comprising a selection from the group consisting of: a word and a phrase (each word from the list of word/tag candidates is filtered to determine words that are likely or unlikely to be relevant keywords, paragraph [0036]); and 
program instructions to select the first set of keywords from the plurality of words based on the plurality of importance scores (words that are determined to be unlikely to be relevant are discarded, paragraph [0036]).

In regard to claim 12, Fink discloses the first set of keywords further comprises at least one additional selection from the group consisting of: a word selected from the text segment and a phrase selected from the text segment (a plurality of keywords are selected from the list of candidate keywords, paragraph [0042]).

In regard to claim 15, Fink discloses a computer system (Fig. 4, 500) comprising:
one or more computer processors (504, paragraph [0047]), one or more computer readable storage media (storage devices, paragraph [0048]), and program instructions collectively stored on the one or more computer readable storage media for execution by at least one of the one or more computer processors (paragraph [0049]), the program instructions comprising:
program instructions to extract an audio signal from a video clip (step 202, an audio stream is extracted from a video, paragraph [0031]); 
program instructions to convert the audio signal into a text sequence (step 204, automatic speech recognition (ASR) identifies words from the audio signal, paragraph [0032]); 
program instructions to select a first set of keywords from the text sequence, the first set of keywords corresponding to a first audio segment of the audio signal (step 206, the recognized words are filtered to determine candidates for keywords to tag, paragraph [0036]); and 
program instructions to tag a target video segment of the video clip with the first set of keywords, the target video segment corresponding to the first audio segment (step 208, the determined keywords are used to tag the video, paragraph [0042]).

In regard to claim 16, Fink discloses program instructions to extract the first set of keywords from the text sequence comprise: 
program instructions to divide the text sequence into a plurality of text segments, the plurality of text segments corresponding to a plurality of audio segments of the audio signal, and a text segment of the plurality of text segments comprising a selection from the group consisting of: a word and a phrase (the recognized words corresponding to the detected audio are separated and output as a list of candidate words, paragraph [0034]); 
program instructions to select a text segment comprising at least one target keyword from the plurality of text segments (in step 206, the list of word/tag candidates are filtered to select keywords based on a custom dictionary with words relevant to the subject matter, paragraphs [0040-0041]); and 
program instructions to determine the first set of keywords from the text segment, the first set of keywords comprising the at least one target keyword (relevant words from the custom dictionary are selected as the keywords for tagging the video, paragraphs [0040-0042]).

In regard to claim 18, Fink discloses program instructions to determining the first set of keywords from the text segment comprise: 
program instructions to, responsive to determining that the text segment comprises a plurality of text elements, determine (i) a plurality of importance scores of the plurality of words, and (ii) a text element of the plurality of text elements comprising a selection from the group consisting of: a word and a phrase (each word from the list of word/tag candidates is filtered to determine words that are likely or unlikely to be relevant keywords, paragraph [0036]); and 
program instructions to select the first set of keywords from the plurality of words based on the plurality of importance scores (words that are determined to be unlikely to be relevant are discarded, paragraph [0036]).

In regard to claim 19, Fink discloses the first set of keywords further comprises at least one additional selection from the group consisting of: a word selected from the text segment and a phrase selected from the text segment (a plurality of keywords are selected from the list of candidate keywords, paragraph [0042]).



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 3, 10, and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Fink, in view of Nagrani et al. (Speech2Action: Cross-modal Supervision for Action Recognition, hereinafter “Nagrani”).
In regard to claim 3, Fink does not disclose the target video segment and the first set of keywords are selected for training an action recognition model.
Nagrani disclose a method of training an action recognition model using keywords extracted from audio (see Abstract), wherein:
the target video segment and the first set of keywords are selected for training an action recognition model (labels for video clips are generated by recognizing speech in the corresponding audio, section 4.2); and 
the at least one target keyword comprises a selection from the group consisting of: a word indicating a target action to be recognized by the action recognition model, a phrase indicating the target action to be recognized by the action recognition model, a word indicating a target context related to the target action, and a phrase indicating a target context related to the target action (words and phrases indicative of the action occurring in the video clip are extracted to train the action recognition model, section 4.2).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to select the target video segment and the first set of keywords for training an action recognition model, because it would reduce the effort required to label the action recognition model training data manually, and further boosts the performance of an action recognition model over fully supervised models, as taught by Nagrani (section 1).

In regard to claims 10 and 17, Fink does not disclose the target video segment and the first set of keywords are selected for training an action recognition model.
Nagrani disclose a method of training an action recognition model using keywords extracted from audio (see Abstract), wherein:
the target video segment and the first set of keywords are selected for training an action recognition model (labels for video clips are generated by recognizing speech in the corresponding audio, section 4.2); and 
the at least one target keyword comprises a selection from the group consisting of: a word indicating a target action to be recognized by the action recognition model, a phrase indicating the target action to be recognized by the action recognition model, a word indicating a target context related to the target action, and a phrase indicating a target context related to the target action (words and phrases indicative of the action occurring in the video clip are extracted to train the action recognition model, section 4.2).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to select the target video segment and the first set of keywords for training an action recognition model, because it would reduce the effort required to label the action recognition model training data manually, and further boosts the performance of an action recognition model over fully supervised models, as taught by Nagrani (section 1).



Claim(s) 6-7, 13-14 and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Fink, in view of Zhang et al. (U.S. Patent Application Pub. No. 2018/0249193, hereinafter “Zhang”).
In regard to claim 6, Fink does not disclose merging a second set of keywords with a first set of keywords.
Zhang discloses a method for associating text with video data, wherein tagging a target video segment comprises:
selecting, by one or more processors, a second set of keywords from the text sequence, the second set of keywords corresponding to a second audio segment of the audio signal, the second audio segment adjacent to the first audio segment (text data is generated from speech data associated with a video, paragraphs [0022]-[0025]; the text is used to assign a series of semantic tags to the video, paragraphs [0052-0053]); 
determining, by one or more processors, that the second set of keywords comprises at least one keyword matching with the first set of keywords (the semantic tags are compared to determine whether they may be merged into new semantic tags, paragraphs [0060-0064]); 
responsive to determining that the second set of keywords comprises at least one keyword matching with the first set of keywords, merging, by one or more processors, the second set of keywords with the first set of keywords, creating merged set of keywords (the semantic tags for two consecutive activity scenes are merged, paragraph [0065]); and 
tagging, by one or more processors, the target video segment with the merged set of keywords (the two consecutive scenes are tagged with the merged semantic tag, paragraph [0065]).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to merge adjacent segments responsive to determining a keyword matching between the segments, because it would group adjacent segments in a manner that semantically maximized the tags assigned to the segment, as suggested by Zhang (paragraphs [0064-0065]).

In regard to claim 7, although Fink discloses context is used to relate keywords such as “dog” and “poodle” (paragraph [0038]), Fink does not expressly disclose performing semantic analysis on the text sequence using a semantic analysis model.
Zhang discloses extracting, by one or more processors, the first set of keywords by performing semantic analysis on the text sequence using a semantic analysis model (semantic tagging of a text sequence, paragraphs [0026-0035]).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to extract the first set of keywords by performing semantic analysis on the text sequence using a semantic analysis model, because semantic analysis allows the tags to summarize the content of the text in a meaningful way, as suggested by Zhang (paragraphs [0052-0053]).

In regard to claims 13 and 20, Fink does not disclose merging a second set of keywords with a first set of keywords.
Zhang discloses a method for associating text with video data, wherein tagging a target video segment comprises:
program instructions to select a second set of keywords from the text sequence, the second set of keywords corresponding to a second audio segment of the audio signal, the second audio segment adjacent to the first audio segment (text data is generated from speech data associated with a video, paragraphs [0022]-[0025]; the text is used to assign a series of semantic tags to the video, paragraphs [0052-0053]); 
program instructions to determine that the second set of keywords comprises at least one keyword matching with the first set of keywords (the semantic tags are compared to determine whether they may be merged into new semantic tags, paragraphs [0060-0064]); 
program instructions to, responsive to determining that the second set of keywords comprises at least one keyword matching with the first set of keywords, merge the second set of keywords with the first set of keywords, creating merged set of keywords (the semantic tags for two consecutive activity scenes are merged, paragraph [0065]); and 
program instructions to tag the target video segment with the merged set of keywords (the two consecutive scenes are tagged with the merged semantic tag, paragraph [0065]).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to merge adjacent segments responsive to determining a keyword matching between the segments, because it would group adjacent segments in a manner that semantically maximized the tags assigned to the segment, as suggested by Zhang (paragraphs [0064-0065]).

In regard to claim 14, although Fink discloses context is used to relate keywords such as “dog” and “poodle” (paragraph [0038]), Fink does not expressly disclose performing semantic analysis on the text sequence using a semantic analysis model.
Zhang discloses program instructions to extract the first set of keywords by performing semantic analysis on the text sequence using a semantic analysis model (semantic tagging of a text sequence, paragraphs [0026-0035]).
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to extract the first set of keywords by performing semantic analysis on the text sequence using a semantic analysis model, because semantic analysis allows the tags to summarize the content of the text in a meaningful way, as suggested by Zhang (paragraphs [0052-0053]).


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Ignat et al., Gupta et al., Asano et al., and Miech et al. disclose additional methods for automatically labelling videos for use in training action recognition models. Saigian et al., Bender et al., Chaudhuri et al., Delaney et al., Cooper et al., Lim et al., and Chang et al. disclose additional methods for tagging video segments based on content recognized from corresponding audio.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRIAN LOUIS ALBERTALLI whose telephone number is (571)272-7616. The examiner can normally be reached Mon-Thurs 9AM-3PM (Part time).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





BLA 10/19/22
/BRIAN L ALBERTALLI/               Primary Examiner, Art Unit 2656