Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Response to Arguments
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1-7, 13-17 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Campbell (US 2017/0351922) in view of Chen et al. (US 2016/0014482). 
Regarding claims 1, 14 and 19, Campbell teaches a computer-implemented method comprising: 
	extracting visual and audio content of a target video (at least paragraphs 36-47 teaches wherein the system, which meets the content extraction module, analyses the video and audio content. Especially paragraphs 39 and 41 teaches audio analysis captured in the video); 
parsing the target video into segments based on the extracted visual and audio content (at least paragraphs 36-47 teaches wherein events of interest are identified. Each of the identified events of interest meets the claimed parsed multiple segments); 
generating, by a multimodal fragment generation model, multimodal fragments corresponding to the segments of the target video, wherein the multimodal fragments comprise visual components and textual components extracted from the segments (paragraphs 64-69 and 72-73 teaches a video summary created based on the template selected as a preference by the user. The video summary includes visual and textual components (spoken words in the audio – see paragraphs 39 and 41) that are linked to other video fragments in the original video sequence. Based on the user preference, the fragments of the full video are grouped into different arrangements based on the template selected by the user); and
determining a nonlinear ordering of the multimodal fragments by comparing the user preference vector with sets of frame embeddings, from the frame embeddings, corresponding to the multimodal fragments (Campbell partially teaches this limitation in paragraphs 64-69 and 72-73 teaches a video summary created based on the template selected as a preference by the user. Based on the user preference template, the fragments of the full video are grouped into different sequences. Therefore, the playback of the video is considered non-linear since the playback is shorter than the linear original full length video uploaded to the server. However, while the nonlinear ordering of the multimodal fragments is based on a user preference, Campbell isn’t explicit to the user preference being a user preference vector per se, and so therefore, the comparison isn’t explicitly between the user preference vector with the sets of frame embeddings that correspond to the multimodal fragments); and 
causing, at the computing device, playback of the segments in accordance with the nonlinear ordering of the multimodal fragments (paragraphs 24 and 53 teaches wherein the video summary is presented to the user on a user device for playback).
Therefore, Campbell fails to explicitly teach generating, utilizing an embedding model, a user preference vector, based on one or more videos previously viewed by a user of a computing device, and frame embeddings from frames of the target video; and comparing the user preference vector with sets of frame embeddings, from the frame embeddings, corresponding to the multimodal fragments.
In an analogous art, Chen teaches in a similar video summarization art that in Figs. 3, 5B, 13-14, 17, and paragraphs 78-79, 90, 128-130, 136 and 143-153, wherein a personalized playlist is generated (paragraphs 78-79, 90) based on a user’s viewing history (especially paragraph 136) and their corresponding weights when compared to incoming video segments using a vector (paragraph 147).
It is appreciated that the prior art of Chen can be utilized into Campbell’s system such that the prior viewing history in the form of vectors are used to compare incoming video to determine which segments or videos are of particular relevance. It would have been obvious to one of ordinary skill in the art before the effective filing date of the current application to incorporate the teachings of Chen into the system of Campbell because said incorporation allows for the benefit of optimizing the personalization of next generation media consumption (see paragraphs 5-6 of Chen).
As to claim 14, the limitations are performed by the limitations in claim 1, and additionally, Campbell teaches a system wherein a computer readable medium with instruction is stored thereon and when executed performs the steps in paragraph 26.
As to claim 19, claim 1 performs all of the limitations except for the limitation to “post an uploaded video;” and “receive an indication to play back the uploaded video in accordance with a preference of a target user”.
Campbell additionally teaches these aspect in at least paragraphs 23 wherein the video is uploaded to a video server and paragraph 24 and 53 teaches wherein the client can interact with the video server to edit, create and customize a video summary, which includes the ability to play back a summary.
Regarding claim 2, Campbell teaches the claimed wherein generating the multimodal fragments comprises generating, utilizing a pretrained embedding model, a textual component of a multimodal fragment of the multimodal fragments based on comparing, in an embedding space, embedded sentences from a segment of the segments a textual component (paragraphs 64-69 and 72-73 teaches a video summary created based on the template selected as a preference by the user. The video summary includes visual and textual components (spoken words in the audio – see paragraphs 39 and 41) that are linked to other video fragments in the original video sequence. Based on the user preference, the fragments of the full video are grouped into different arrangements based on the template selected by the user. Furthermore, the linking can be based on textual components, in the sense that the spoken words (paragraphs 39 and 41) are matched to the desired output of the video summary. The system performs the functionalities as discussed above and therefore meets the claimed embedding model).
Regarding claim 3, Campbell teaches the claimed wherein generating the multimodal fragments comprises generating a visual component of a multimodal fragment of the multimodal fragments by comparing the user preference vector with frame embeddings from frames of a segment of the segments corresponding to the multimodal fragment (paragraphs 64-69 and 72-73 teaches a video summary created based on the template selected as a preference by the user. The video summary includes visual and textual components (spoken words in the audio – see paragraphs 39 and 41) that are linked to other video fragments in the original video sequence. Based on the user preference, the fragments of the full video are grouped into different arrangements based on the template selected by the user. The linking of the videos are thereby based on the visual components taking place within the video. The system performs the functionalities as discussed above and therefore meets the claimed embedding model).
Regarding claims 4 and 15, Chen teaches the claimed wherein determining the nonlinear ordering of the multimodal fragments comprises: ranking the multimodal fragments by comparing the user preference vector with the sets of frame embedding corresponding to the multimodal fragments; and selecting and ordering the multimodal fragments based on the ranking (Figs. 3, 5B, 13-14, 17, and paragraphs 78-79, 90, 128-130, 136 and 143-153, wherein a personalized playlist is generated (paragraphs 78-79, 90) based on a user’s viewing history (especially paragraph 136) and their corresponding weights when compared to incoming video segments using a vector (paragraph 147). The weighting of the videos when compared to the incoming videos determines a ranking (see at least paragraphs 103, 115, 126 and 195)). The prior motivation as discussed above in claim 1 is incorporated herein.
Regarding claim 6, Chen teaches the claimed wherein generating the multimodal fragments further comprise: determining importance scores for frames of a segment, each importance score comprising one or more of a user preference score, a video context relevance score, or a sentence similarity score; and selecting a frame from the frames of the segment as a visual component for a multimodal fragment corresponding to the segment by comparing the importance scores of the frames (Figs. 3, 5B, 13-14, 17, and paragraphs 78-79, 90, 128-130, 136 and 143-153, wherein a personalized playlist is generated (paragraphs 78-79, 90) based on a user’s viewing history (especially paragraph 136) and their corresponding weights when compared to incoming video segments using a vector (paragraph 147). Additionally, see at least paragraphs 103, 115, 126 and 195 wherein scores for frames are used to select a particular frame). The prior motivation as discussed above in claim 1 is incorporated herein.
Regarding claim 7, Chen teaches the claimed wherein determining the nonlinear ordering of the multimodal fragments further comprises: determining information factors for frames of a segment, each information factor comprising one or more of a user preference factor, a video context relevance factor, or an information diversity factor, and ordering the multimodal fragments based on the information factors of the frames of the corresponding segments (Figs. 3, 5B, 13-14, 17, and paragraphs 78-79, 90, 128-130, 136 and 143-153, wherein a personalized playlist is generated (paragraphs 78-79, 90) based on a user’s viewing history (especially paragraph 136) and their corresponding weights when compared to incoming video segments using a vector (paragraph 147). Additionally, see at least paragraphs 103, 115, 126 and 195 wherein scores for frames are ordered based on at least the alternatively stated video context relevance). The prior motivation as discussed above in claim 1 is incorporated herein.
Regarding claim 13, Chen teaches the claimed wherein ordering the multimodal fragments comprises: determining, utilizing a pretrained processing model, context similarity scores based on a comparison of frame embeddings from frames of the target video with a video context of the target video; and ordering the multimodal fragments based on relevance to context of the video based on the context similarity scores of the frames of the target video corresponding to the multimodal fragments (Figs. 3, 5B, 13-14, 17, and paragraphs 78-79, 90, 128-130, 136 and 143-153, wherein a personalized playlist is generated (paragraphs 78-79, 90) based on a user’s viewing history (especially paragraph 136) and their corresponding weights when compared to incoming video segments using a vector (paragraph 147). The weighting of the videos when compared to the incoming videos determines a ranking (see at least paragraphs 103, 115, 126 and 195)). The prior motivation as discussed above in claim 1 is incorporated herein.
Regarding claim 16, Campbell teaches the claimed further comprising instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising causing, at a computing device, the segmented playback of the segments of the target video in accordance with the ordering for the multimodal fragments (paragraphs 24 and 53 teaches wherein the video summary is presented to the user on a user device for playback).
Regarding claim 17, Campbell teaches the claimed wherein ordering the multimodal fragments further comprises arranging the multimodal fragments according to a nonlinear playback ordering of the segments relative to a timeline of the target video (paragraphs 24 and 53 teaches wherein the video summary is tailored in a nonlinear time lined manner since the playback isn’t the linear as the original video).
Regarding claims 5 and 20, Chen teaches the claimed wherein generating the user preference vector further comprises: generating, utilizing the embedding model, frame embeddings for the one or more videos previously viewed by the user (see paragraphs 136 and 147); and determining, utilizing a frame-level sentiment classifier, a distribution of sentiments indicated by the frame embeddings of the one or more videos. (paragraphs 78-79, 90, 128-130, 135-136 and 143-153, especially paragraph 135 and 162 teaches that the videos for suggestion are based on portions of the video that are deemed to be of greater interest/sentiment). The prior motivation as discussed above in claim 1 is incorporated herein.

Claims 8-9 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Campbell (US 2017/0351922) in view of Chen et al. (US 2016/0014482) and further in view of Cameron et al. (US 2018/0032305).
Regarding claim 8, Campbell and Chen teaches the claimed as discussed in claim 1 above, however fails to, but Cameron teaches wherein determining the nonlinear ordering of the multimodal fragments further comprises: determining, utilizing a pretrained embedding model, sentence coherence scores for sentences of the segments corresponding to the multimodal fragments; and ordering the multimodal fragments based on the sentence coherence scores (paragraphs 162-163 and 247-248 teaches wherein algorithms and sub-algorithms are used to extract audio transcripts of multimedia content and to identify sentences and to group the sentences based on the structure/coherence).
Campbell’s system utilizes audio transcripts (spoken words) to identify events of interest and Cameron allows for the introduction of identifying sentences and grouping them based on their structure and identifying their start and stop positions (paragraph 193). Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to incorporate the teachings of Cameron into Campbell such that Campbell, using its existing ability to identify events of interest, is also able to identify segments of the video based on the audio transcripts showing a group of sentences, because such an incorporation allows for the benefit of improving the system by giving text regions a particular importance over other regions (Cameron: paragraph 5).
Regarding claim 9, Campbell and Chen teaches the claimed as discussed in claim 1 above, however fails to, but Cameron wherein parsing the target video into segments further comprises determining groups of sentences of an audio transcript of the target video based on cosine similarities between embedded sentences from the sentences; and partitioning the target video based on the groups of sentences  (paragraphs 162-163 and 247-248 teaches wherein algorithms and sub-algorithms are used to extract audio transcripts of multimedia content and to identify sentences and to group the sentences based on the structure).
Campbell’s system utilizes audio transcripts (spoken words) to identify events of interest and their corresponding video segments and Cameron allows for the introduction of identifying sentences and grouping them based on their structure and identifying their start and stop positions (paragraph 193). Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to incorporate the teachings of Cameron into Campbell such that Campbell, using its existing ability to identify events of interest, is also able to identify segments of the video based on the audio transcripts showing a group of sentences, because such an incorporation allows for the benefit of improving the system by giving text regions a particular importance over other regions (Cameron: paragraph 5).
Regarding claim 11, Campbell and Chen teaches the claimed as discussed in claim 1 above, however fails to, but Cameron teaches the claimed wherein parsing the target video into the segments further comprises extracting audio transcripts from the target video; generating embedded sentences by utilizing a pretrained embedding model to encode sentences of the audio transcripts into an embedding space; and parsing the audio transcripts into groups of sentences based on semantic similarities between the embedded sentences (paragraphs 247-248 teaches wherein algorithms and sub-algorithms are used to extract audio transcripts of multimedia content and to identify sentences and to group the sentences based on the structure).
Campbell’s system utilizes audio transcripts (spoken words) to identify events of interest and Cameron allows for the introduction of identifying sentences and grouping them based on their structure and identifying their start and stop positions (paragraph 193). Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to incorporate the teachings of Cameron into Campbell such that Campbell, using its existing ability to identify events of interest, is also able to identify segments of the video based on the audio transcripts showing a group of sentences, because such an incorporation allows for the benefit of improving the system by giving text regions a particular importance over other regions (Cameron: paragraph 5).

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Campbell (US 2017/0351922) in view of Chen et al. (US 2016/0014482) and further in view of Younessian (US 2020/0159759).
Regarding claim 10, Campbell and Chen teaches the claimed as discussed in claim 1 above, however fails to, but Younessian teaches the claimed wherein extracting the video and audio content from the target video comprises: 
marking a first frame of the target video based on a difference between color histograms of the first frame and a second frame of the target video, wherein the first frame is adjacent to the second frame; identifying a set of related frames relative to the first frame and the second frame; and selecting a median frame from the set of related  frames as a keyframe, wherein the visual component of a given multimodal fragment corresponds to the keyframe for a segment, from the segments of the target video, corresponding to the given multimodal fragment (paragraphs 44, 61-62, 72-73 and 85-86 teaches wherein for a system to select a keyframe for a given scene, the system selected a median frame of a scene for which the keyframe is supposed to represent. Typically, keyframes utilize a first frame (meeting the claimed “first frame”) that meets an evaluation based on the color histogram (see paragraphs 62, 73 and 86). Paragraphs 44, 61, 72 and 85 teaches more specifically that the keyframe that is selected is based on all of the frames that make up the scene itself and that the selected keyframe is a median frame for the particular scene.).
It would have been obvious to one of ordinary skill in the art at the time of the invention to incorporate the teachings of Younessian into the combined system of Campbell and Chen such that Campbell also utilizes color histograms to help select a first frame and to select the median frame of scene as a key frame because such an incorporation allows for the benefit of improving the user friendliness of the system by allowing a user to identify a label for the segment of the content being viewed (Younessian: paragraph 2).

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Campbell (US 2017/0351922) in view of Chen et al. (US 2016/0014482) and further in view of Olstad et al. (US 8,296,797).
Regarding claim 12, Campbell and Chen teaches the claimed wherein generating multimodal fragments comprises: selecting a representative frame based on a semantic importance score of each frame of a segment of the segments relative to a video context of the target video and a comparison of a set of the sets of frame embeddings corresponding to the segment with the user preference vector, wherein the visual component includes the representative frame (Chen: Figs. 3, 5B, 13-14, 17, and paragraphs 78-79, 90, 128-130, 136 and 143-153, wherein a personalized playlist is generated (paragraphs 78-79, 90) based on a user’s viewing history (especially paragraph 136) and their corresponding weights when compared to incoming video segments using a vector (paragraph 147));
However, Campbell and Chen fails to, however, Olstad teaches the claimed selecting a representative text from a group of sentences extracted from an audio transcript of the video, wherein the textual component includes the representative text (see claim 22 which teaches that based on a search query, it is determined whether the query terms is present inside “audio tracks of the matching videos” as well and the selected thumbnail for the search result reflects the search query terms).
It would have been obvious to one of ordinary skill in the art at the time of the invention to incorporate the teachings of Olstad into Campbell and Chen because such an incorporation would allow for the benefit of a more improved and efficient video searching system (Olstad: col. 1, lines 35-61).

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Campbell (US 2017/0351922) in view of Chen et al. (US 2016/0014482) and further in view of Batchu et al. (US 2017/0169128).
Regarding claim 18, Campbell and Chen teaches the claimed as discussed in claims 1 and 14 above, however fails to, but Batchu teaches the claimed further comprising instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising generating a table of contents for multimodal fragments, wherein the table of contents links each multimodal frame of the multimodal fragments to a respective segment of the segments of the target video (paragraph 99).
It would have been obvious to one of ordinary skill in the art at the time of the invention to incorporate the teachings of Batchu into the proposed combined system of Campbell and Chen such that during the playback of the video summary the user is able to see a table of contents because such an incorporation allows for the benefit of improving the user friendliness of the system by allowing the user to be aware of the highlights in the associated video (Batchu: paragraphs 3 and 5-6).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to GELEK W TOPGYAL whose telephone number is (571)272-8891. The examiner can normally be reached M-F (9:30-6 PST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, William Vaughn can be reached on 571-272-3922. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GELEK W TOPGYAL/           Primary Examiner, Art Unit 2481