Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on December 11, 2020 has been entered. 


Claims 1-20 are pending.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 4-6, 8, 11, 12, 15, 16 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Boquet et al. (Pub. No.: US 2019/0377823) in view of Curtis et al. (Pub. No.: US 2010/0088726) and Bou et al. (Pub. No.: US 2019/0258671).
Regarding claim 1, Boquet discloses a non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause a computer system to: generate, utilizing a neural network, feature vectors for images (Fig. 1, elements 20 and 30, paras. [0028]-[0033], Figs. 2A-2C.); identify a set of tagged feature vectors corresponding to a set of media content items (Figs. 1 and 3, element 40, para. [0040]; “The matching module 50 compares the vector to a pre-existing set of reference data 40. The reference data points (numeric vectors each having n dimensions) are based on reference documents with known features;” and [0054]; “At step 310, the features of a subject document are extracted by a feature extraction module, resulting in a numeric vector representation of the subject document. That numeric vector representation and the grouped reference data 40 is passed to the matching module. At step 320, the matching module determines the matching grouping for the subject document, and at step 330, the subject document is associated with that matching grouping.”); select one or more tagged feature vectors from a set of tagged feature vectors based on distances between the feature vectors and the one or more tagged feature vectors from the set of tagged feature vectors (Fig. 1, elements 40 and 50, paras. [0040]-[0046]).  
Boquet does not disclose wherein the instructions cause the computer system to extract a set of frames from a video; and thus does not disclose generating feature vectors for the set of frames, nor generating a set of tags to associate with the video by: selecting tags from the one or more tagged feature vectors. However, in analogous art, Curtis discloses that when generating recommended tags, the system will use “tags used by or recommended to other users 
It could be argued that Boquet and Curtis do not explicitly disclose generating a set of tags to associate with the video by aggregating the tags selected from the one or more tagged feature vectors. However, in analogous art, Bou discloses a dataset with “several training videos, each of which is labeled with one or more tags. However, the dataset does not contain information about where each tag occurs in the sequence. Our task is to classify whether an unknown test video contains each one of these tags. We use a weakly-supervised approach where a neural network predicts the tags of each frame independently and an aggregation layer computes the tags for the whole video based on the individual tags of each frame (para. [0028]).” Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to modify Boquet and Curtis to allow for generating a set of tags to associate with the video by aggregating the tags selected from the one or more tagged feature vectors. This would have produced predictable and desirable results, in that it would allow for a video to have a plurality of tags related to the entire video, rather than only different tags related to different 
Regarding claim 2, the combination of Boquet, Curtis and Bou discloses the non-transitory computer-readable medium of claim 1, and further discloses further comprising instructions that, when executed by the at least one processor, cause the computer system to: generate the feature vectors for the set of frames by: generating, utilizing the neural network, a set of initial feature vectors, wherein the set of initial feature vectors comprise a feature vector for each frame from the set of frames; and generating an aggregated feature vector based on the set of initial feature vectors; and select the one or more tagged feature vectors from the set of tagged feature vectors based on distances between the aggregated feature vector and the one or more tagged feature vectors from the set of tagged feature vectors (Boquet, Figs 1 and 3, paras. [0028]-[0046]; Bou, para. [0028]. This claim is rejected on the same grounds as claim 1.).
Regarding claim 4, the combination of Boquet, Curtis and Bou discloses the non-transitory computer-readable medium of claim 1, and further discloses further comprising instructions that, when executed by the at least one processor, cause the computer system to: select the one or more tagged feature vectors from the set of tagged feature vectors by: determining distance values between the feature vectors and the one or more tagged feature vectors from the set of tagged feature vectors; and selecting the one or more tagged feature vectors that correspond to distance values that meet a threshold distance value (Boquet, para. [0008]-[0016], language of claim 4; Bou, para. [0028]. This claim is rejected on the same grounds as claim 1.).
the non-transitory computer-readable medium of claim 1, and further discloses further comprising instructions that, when executed by the at least one processor, cause the computer system to identify the set of tagged feature vectors from one or more videos associated with actions (Boquet, Figs. 1 and 3, paras. [0040] and [0054]; Bou, paras. [0022]-[0024]. This claim is rejected on the same grounds as claim 1.).
Regarding claim 6, the combination of Boquet, Curtis and Bou discloses the non-transitory computer-readable medium of claim 1, and further discloses further comprising instructions that, when executed by the at least one processor, cause the computer system to generate the set of tags to associate with the video by aggregating action based tags corresponding to the one or more tagged feature vectors (Boquet, Figs. 1 and 3, paras. [0040] and [0054]; Bou, paras. [0022]-[0024]. This claim is rejected on the same grounds as claim 1.).
Regarding claim 8, the combination of Boquet, Curtis and Bou discloses the non-transitory computer-readable medium of claim 1, and further discloses further comprising instructions that, when executed by the at least one processor, cause the computer system to associate the set of tags with a temporal segment of the video comprising the set of frames (Bou, paras. [0025] and [0057]. This claim is rejected on the same grounds as claim 1.).
Regarding claim 11, Boquet discloses a system comprising: memory comprising a neural network and a set of tagged feature vectors corresponding to a set of media content items (Figs. 1 and 3, element 40, para. [0040]; “The matching module 50 compares the vector to a pre-existing set of reference data 40. The reference data points (numeric vectors each having n dimensions) are based on reference documents with known features;” and [0054]; “At step 310, the features of a subject document are extracted by a feature extraction module, resulting in a ; at least one processor (para. [0058]); and at least one non-transitory computer-readable storage medium storing instructions that, when executed by the at least one processor, cause the system to: generate, utilizing the neural network, feature vectors for images (Fig. 1, elements 20 and 30, paras. [0028]-[0033], Figs. 2A-2C.); generate a set of tags to associate with the images by: determining distance values between the feature vectors for the images and one or more tagged feature vectors from the set of tagged feature vectors (paras. [0008]-[0016], language of claim 4); selecting tags from the one or more tagged feature vectors from the set of tagged feature vectors based on the determined distance values (Fig. 1, elements 40 and 50, paras. [0040]-[0046]).
Boquet does not disclose wherein the instructions cause the computer system to extract a set of frames from a video; and thus does not disclose generating feature vectors for the set of frames, nor generating a set of tags to associate with the video by determining distance values between the feature vectors for the set of frames and one or more tagged feature vectors from the set of tagged feature vectors. However, in analogous art, Curtis discloses that when generating recommended tags, the system will use “tags used by or recommended to other users for video items 34 in the video repository 24 that have audio and/or video content similar to that of the video item 34 (para. [0037]),” which teaches that tags may be selected from tagged related videos, which when combined with the teaching of Boquet, can be seen as tagged feature vectors. Therefore, it would have been obvious to one of ordinary skill in the art at the time of 
It could be argued that Boquet and Curtis do not explicitly disclose generating a set of tags to associate with the video by aggregating the tags selected from the one or more tagged feature vectors. However, in analogous art, Bou discloses a dataset with “several training videos, each of which is labeled with one or more tags. However, the dataset does not contain information about where each tag occurs in the sequence. Our task is to classify whether an unknown test video contains each one of these tags. We use a weakly-supervised approach where a neural network predicts the tags of each frame independently and an aggregation layer computes the tags for the whole video based on the individual tags of each frame (para. [0028]).” Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to modify Boquet and Curtis to allow for generating a set of tags to associate with the video by aggregating the tags selected from the one or more tagged feature vectors. This would have produced predictable and desirable results, in that it would allow for a video to have a plurality of tags related to the entire video, rather than only different tags related to different portions of the video, which could increase the effectiveness of the tags in terms of relaying information to interested parties.
the system of claim 11, and further discloses further comprising instructions that, when executed by the at least one processor, cause the system to: generate the feature vectors for the set of frames by: generating, utilizing the neural network, a set of initial feature vectors, wherein the set of initial feature vectors comprise a feature vector for each frame from the set of frames; and generating an aggregated feature vector based on the set of initial feature vectors; and generate the set of tags associated with the video by determining the distance values between the aggregated feature vector and the one or more tagged feature vectors from the set of tagged feature vectors (Boquet, Figs 1 and 3, paras. [0028]-[0046]; Bou, para. [0028]. This claim is rejected on the same grounds as claim 11.).
Regarding claim 15, the combination of Boquet, Curtis and Bou discloses the system of claim 11, and further discloses further comprising instructions that, when executed by the at least one processor, cause the system to generate the set of tags to associate the set of tags with a temporal segment of the video comprising the set of frames (Bou, paras. [0025] and [0057]. This claim is rejected on the same grounds as claim 1.).
Regarding claim 16, Boquet discloses a computer-implemented method for automatic tagging of videos, the computer-implemented method comprising: generating, utilizing a neural network, feature vectors for images (Fig. 1, elements 20 and 30, paras. [0028]-[0033], Figs. 2A-2C.); determining one or more tagged feature vectors similar to a feature vector (paras. [0008]-[0016], language of claim 4), wherein the one or more tagged feature vectors are associated with an identified set of tagged feature vectors corresponding to a set of media content items (Figs. 1 and 3, element 40, para. [0040]; “The matching module 50 compares the vector to a pre-existing set of reference data 40. The reference data points (numeric ; and generating a set of tags to associate with the images from the one or more tagged feature vectors (Fig. 1, elements 40 and 50, paras. [0040]-[0046]).
Boquet does not disclose extracting a set of frames from a video; and thus does not disclose generating feature vectors for the set of frames, nor generating a set of tags to associate with the video by: selecting tags from the one or more tagged feature vectors. However, in analogous art, Curtis discloses that when generating recommended tags, the system will use “tags used by or recommended to other users for video items 34 in the video repository 24 that have audio and/or video content similar to that of the video item 34 (para. [0037]),” which teaches that tags may be selected from tagged related videos, which when combined with the teaching of Boquet, can be seen as tagged feature vectors. Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to modify Boquet to allow for extracting a set of frames from a video, generating feature vectors for the set of frames, and generating a set of tags to associate with the video by selecting tags from the one or more tagged feature vectors. This would have produced predictable and desirable results, in that it would allow for the improvements of Boquet to be used in a wider variety of situations, such as with video, as well as allowing for tags of related videos to be used to associate with a given video.
performing a step for generating an aggregated feature vector from the feature vectors, determining one or more tagged feature vectors similar to the aggregated feature vector, nor generating a set of tags to associate with the video by aggregating the tags selected from the one or more tagged feature vectors. However, in analogous art, Bou discloses a dataset with “several training videos, each of which is labeled with one or more tags. However, the dataset does not contain information about where each tag occurs in the sequence. Our task is to classify whether an unknown test video contains each one of these tags. We use a weakly-supervised approach where a neural network predicts the tags of each frame independently and an aggregation layer computes the tags for the whole video based on the individual tags of each frame (para. [0028]).” Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to modify Boquet and Curtis to allow for performing a step for generating an aggregated feature vector from the feature vectors, determining one or more tagged feature vectors similar to the aggregated feature vector, and generating a set of tags to associate with the video by aggregating the tags selected from the one or more tagged feature vectors. This would have produced predictable and desirable results, in that it would allow for a video to have a plurality of tags related to the entire video, rather than only different tags related to different portions of the video, which could increase the effectiveness of the tags in terms of relaying information to interested parties.
Regarding claim 18, the combination of Boquet, Curtis and Bou discloses the computer-implemented method of claim 17, and further discloses wherein generating the set of tags to associate with the video from the one or more tagged feature vectors comprises aggregating tags corresponding to the one or more tagged feature vectors (Boquet, Figs 1 and 3, paras. [0028]-[0046]; Bou, para. [0028]. This claim is rejected on the same grounds as claim 11.).
Regarding claim 19, the combination of Boquet, Curtis and Bou discloses the computer-implemented method of claim 16, and further discloses further comprising generating the set of tags to associate with the video from action based tags corresponding to the one or more tagged feature vectors (Boquet, Figs. 1 and 3, paras. [0040] and [0054]; Bou, paras. [0022]-[0024]. This claim is rejected on the same grounds as claim 16.).
Regarding claim 20, the combination of Boquet, Curtis and Bou discloses the computer-implemented method of claim 16, and further discloses further comprising identifying the set of tagged feature vectors from one or more videos associated with actions (Boquet, Figs. 1 and 3, paras. [0040] and [0054]; Bou, paras. [0022]-[0024]. This claim is rejected on the same grounds as claim 16.).


Claims 3 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Boquet et al. (Pub. No.: US 2019/0377823) in view of Curtis et al. (Pub. No.: US 2010/0088726) and Bou et al. (Pub. No.: US 2019/0258671), and further in view of Dal Mutto et al. (Pub. No.: US 2019/0108396).
Regarding claim 3, the combination of Boquet, Curtis and Bou discloses the non-transitory computer-readable medium of claim 2, but does not explicitly disclose wherein generating the aggregated feature vector comprises combining the set of initial feature vectors utilizing max pooling. However, in analogous art, Dal Mutto discloses the concept of max pooling (paras. [0133], [0141], [0142] and [0246]). Therefore, it would have been obvious 
Regarding claim 17, the combination of Boquet, Curtis and Bou discloses the computer-implemented method of claim 16, but does not explicitly disclose wherein performing the step for generating the aggregated feature vector from the feature vectors comprises combining the feature vectors by utilizing max pooling. However, in analogous art, Dal Mutto discloses the concept of max pooling (paras. [0133], [0141], [0142] and [0246]). Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to modify Boquet, Curtis and Bou to allow for generating the aggregated feature vector to comprise combining the set of initial feature vectors utilizing max pooling. This would have produced predictable and desirable results, in that it would allow for a well-known technique for combining feature vectors could be used.


Claims 7 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Boquet et al. (Pub. No.: US 2019/0377823) in view of Curtis et al. (Pub. No.: US 2010/0088726) and Bou et al. (Pub. No.: US 2019/0258671), and further in view of Verdejo et al. (Pub. No.: US 2018/0082122).
Regarding claim 7, the combination of Boquet, Curtis and Bou discloses the non-transitory computer-readable medium of claim 1, but does not explicitly disclose further comprising instructions that, when executed by the at least one processor, cause the computer system to identify the set of tagged feature vectors by: identifying a media content item comprising text representing one or more verbs; generating, utilizing the neural network, a tagged feature vector for the media content item; assigning tags to the tagged feature vector by assigning the one or more verbs to the tagged feature vector; and associating the tagged feature vector with the set of tagged feature vectors. However, in analogous art, Verdejo discloses that “analytics system 205 may associate tags with words included in the first data (e.g., based on tag association rules). In some implementations, the tag association rules may specify a manner in which the tags are to be associated with words, or based on characteristics of the words. For example, a tag association rule may specify that a singular noun tag (“/NN”) is to be associated with words that are singular nouns (e.g., based on a language database or a context analysis). In some implementations, a tag may include a part-of-speech (POS) tag, such as NN (noun, singular or mass), NNS (noun, plural), NNP (proper noun, singular), NNPS (proper noun, plural), VB (verb, base form), VBD (verb, past tense), VBG (verb, gerund or present participle), and/or the like (para. [0066]).” Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to modify Boquet, Curtis and Bou to allow for the computer system to identify the set of tagged feature vectors by identifying a media content item comprising text representing one or more verbs, generating, utilizing the neural network, a tagged feature vector for the media content item, assigning tags to the tagged feature vector by assigning the one or more verbs to the tagged feature vector, and associating the tagged feature vector with the set of tagged feature vectors. This would have produced predictable and desirable results, in that it would allow for more desired words and/or concepts to be found with greater specificity.
the non-transitory computer-readable medium of claim 7, and further discloses further comprising instructions that, when executed by the at least one processor, cause the computer system to identify the media content item comprising text representing one or more verbs by identifying one or more gerunds within text associated with the media content item (Verdejo, para. [0066]. This claim is rejected on the same grounds as claim 7.).


Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Boquet et al. (Pub. No.: US 2019/0377823) in view of Curtis et al. (Pub. No.: US 2010/0088726) and Bou et al. (Pub. No.: US 2019/0258671), and further in view of Li et al. (Pub. No.: US 2017/0047096).
Regarding claim 9, the combination of Boquet, Curtis and Bou discloses the non-transitory computer-readable medium of claim 8, but does not explicitly disclose further comprising instructions that, when executed by the at least one processor, cause the computer system to: provide graphical user interface displaying the video; provide a timeline for the video in the graphical user interface; and place a tag indicator associated with a tag of the set of tags on the timeline at a position corresponding to the temporal segment of the video. However, in analogous art, Li discloses a GUI with tag indicators relating to video segments on a timeline (Figs. 5 and 6, paras. [0036]-[0044]). Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to modify Boquet, Curtis and Bou to allow for providing a graphical user interface displaying the video, providing a timeline for the video in the graphical user interface, and placing a tag indicator associated with a tag of the set of tags on the timeline at a position corresponding to the temporal segment of the .


Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over Boquet et al. (Pub. No.: US 2019/0377823) in view of Curtis et al. (Pub. No.: US 2010/0088726) and Bou et al. (Pub. No.: US 2019/0258671), and further in view of Frischholz et al. (Pub. No.: US 2018/0144184).
Regarding claim 13, the combination of Boquet, Curtis and Bou discloses the system of claim 12, but does not explicitly disclose wherein generating the aggregated feature vector comprises combining the set of initial feature vectors utilizing averaging. However, in analogous art, Frischholz discloses “The first feature vector and the second feature vector can be combined for creating a second template feature vector, for example by summing the individual vector components or averaging in step 44 (para. [0042]).” Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to modify Boquet, Curtis and Bou to allow for generating the aggregated feature vector by combining the set of initial feature vectors utilizing averaging. This would have produced predictable and desirable results, in that it would allow for a well-known method of combining vectors to be used.


Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Boquet et al. (Pub. No.: US 2019/0377823) in view of Curtis et al. (Pub. No.: US 2010/0088726) and Bou et al. (Pub. No.: US 2019/0258671), and further in view of Lee et al. (Pub. No.: US 2011/0205359).
the system of claim 11, but does not explicitly disclose further comprising instructions that, when executed by the at least one processor, cause the system to select the one or more tags associated with the one or more tagged feature vectors from the set of tagged feature vectors based on the determined distance values by utilizing a k-nearest neighbor algorithm. However, in analogous art, Lee discloses a k-nearest neighbor search can determine distance values between vectors (para. [0083]). Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to modify Boquet, Curtis and Bou to allow for the system to select the one or more tags associated with the one or more tagged feature vectors from the set of tagged feature vectors based on the determined distance values by utilizing a k-nearest neighbor algorithm. This would have produced predictable and desirable results, in that it would  allow for a well-known method of determining distance values to be used.


Response to Arguments
Applicant’s arguments filed on December 11, 2020 with respect to all claims have been considered but are moot based on the new ground of rejection in view of Curtis.


Conclusion
Claims 1-20 are rejected.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Joshua D Taylor whose telephone number is (571)270-3755.  The examiner can normally be reached on Monday - Friday 8 am - 6 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Nasser Goodarzi can be reached on 571-272-4195.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Joshua D Taylor/Primary Examiner, Art Unit 2426                                                                                                                                                                                                        February 11, 2021