Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 11/9/2022 has been entered.

Response to Arguments
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 1-7, 13-17 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Campbell (US 2017/0351922) in view of Chao et al. (US 202/0195983). 
Regarding claims 1, 14 and 19, Campbell teaches a computer-implemented method comprising: 
	extracting visual and audio content of a target video (at least paragraphs 36-47 teaches wherein the system, which meets the content extraction module, analyses the video and audio content. Especially paragraphs 39 and 41 teaches audio analysis captured in the video); 
parsing the target video into segments based on the extracted visual and audio content (at least paragraphs 36-47 teaches wherein events of interest are identified. Each of the identified events of interest meets the claimed parsed multiple segments); 
generating, by a multimodal fragment generation model, multimodal fragments corresponding to the segments of the target video, wherein the multimodal fragments comprise visual components and textual components extracted from the segments (paragraphs 64-69 and 72-73 teaches a video summary created based on the template selected as a preference by the user. The video summary includes visual and textual components (spoken words in the audio – see paragraphs 39 and 41) that are linked to other video fragments in the original video sequence. Based on the user preference, the fragments of the full video are grouped into different arrangements based on the template selected by the user);
determining a nonlinear ordering of the multimodal fragments by comparing the user preference vector with sets of frame sentiment embeddings for frames of the target video, corresponding to the multimodal fragments (Campbell partially teaches this limitation in paragraphs 64-69 and 72-73 teaches a video summary created based on the template selected as a preference by the user. Based on the user preference template, the fragments of the full video are grouped into different sequences. Therefore, the playback of the video is considered non-linear since the playback is shorter than the linear original full length video uploaded to the server. However, while the nonlinear ordering of the multimodal fragments is based on a user preference, Campbell isn’t explicit to the user preference being a user preference vector per se, and so therefore, the comparison isn’t explicitly between the user preference vector with the sets of frame embeddings that correspond to the multimodal fragments); and 
causing, at the computing device, playback of the segments in accordance with the nonlinear ordering of the multimodal fragments (paragraphs 24 and 53 teaches wherein the video summary is presented to the user on a user device for playback).
Therefore, Campbell fails to explicitly teach “generating, utilizing a trained machine learning model, frame sentiment embeddings for frames of the target video and a sentiment distribution for one or more videos previously viewed by a user of a computing device” and “generating a user preference vector based on the sentiment distribution for the one or more videos previously viewed by a user of a computing device, and frame embeddings from frames of the target video; and comparing the user preference vector with sets of frame embeddings, from the frame embeddings, corresponding to the multimodal fragments.
In an analogous art, Chao teaches in a similar video summarization art that in Figs. 1-5, and paragraph 46 teaches wherein past viewing history and video segments in the database are processed by the system to generate semantic metadata and semantic vectors (502 and 506) by analyzing segments. At least paragraphs 73, 85 and 127 teaches the use of using learning models to assist in analyzing the segments and the segments in the user history (see para. 46 which teaches using the system for analyzing user viewing history segments as well).  Thereafter, once the viewing history is processed, the user’s preference vector is generated as semantic vectors 504 (See paragraph 135-137 and Fig. 5), which is compared to a list of segments in the segments datastore that are in consideration for personalization. Based on the vector similarity between them, a personalized recommendations are made, therefore, the segment that is recommended matches the vectors of the past viewing habits. Fig. 5, teaches semantic vectors from user history (501) and semantic vectors from video segments from a segments datastore, compares them for similarity in 507 and generates personalized segments related best to the viewers past viewing habit.
It is appreciated that the prior art of Chao can be utilized into Campbell’s system such that the prior viewing history in the form of vectors (generated using a trained learning model) are used to compare incoming video to determine which segments or videos are of particular relevance. It would have been obvious to one of ordinary skill in the art before the effective filing date of the current application to incorporate the teachings of Chao into the system of Campbell because said incorporation allows for the benefit of efficiently retrieve and compute similarity to other segments (see paragraphs 6-11 of Chao).
As to claim 14, the limitations are performed by the limitations in claim 1, and additionally, Campbell teaches a system wherein a computer readable medium with instruction is stored thereon and when executed performs the steps in paragraph 26.
As to claim 19, claim 1 performs all of the limitations except for the limitation to “post an uploaded video;” and “receive an indication to play back the uploaded video in accordance with a preference of a target user”.
Campbell additionally teaches these aspect in at least paragraphs 23 wherein the video is uploaded to a video server and paragraph 24 and 53 teaches wherein the client can interact with the video server to edit, create and customize a video summary, which includes the ability to play back a summary.
Regarding claim 2, Campbell teaches the claimed wherein generating the multimodal fragments comprises generating, utilizing a pretrained embedding model, a textual component of a multimodal fragment of the multimodal fragments based on comparing, in an embedding space, embedded sentences from a segment of the segments a textual component (paragraphs 64-69 and 72-73 teaches a video summary created based on the template selected as a preference by the user. The video summary includes visual and textual components (spoken words in the audio – see paragraphs 39 and 41) that are linked to other video fragments in the original video sequence. Based on the user preference, the fragments of the full video are grouped into different arrangements based on the template selected by the user. Furthermore, the linking can be based on textual components, in the sense that the spoken words (paragraphs 39 and 41) are matched to the desired output of the video summary. The system performs the functionalities as discussed above and therefore meets the claimed embedding model).
Regarding claim 3, Campbell teaches the claimed wherein generating the multimodal fragments comprises generating a visual component of a multimodal fragment of the multimodal fragments by comparing the user preference vector with a distribution of frame sentiment embeddings from frames of a segment of the segments corresponding to the multimodal fragment (paragraphs 64-69 and 72-73 teaches a video summary created based on the template selected as a preference by the user. The video summary includes visual and textual components (spoken words in the audio – see paragraphs 39 and 41) that are linked to other video fragments in the original video sequence. Based on the user preference, the fragments of the full video are grouped into different arrangements based on the template selected by the user. The linking of the videos are thereby based on the visual components taking place within the video. The system performs the functionalities as discussed above and therefore meets the claimed embedding model).
Regarding claims 4 and 15, Chao teaches the claimed wherein determining the nonlinear ordering of the multimodal fragments comprises: determining a ranking of the multimodal fragments by comparing the user preference vector with the sets of frame sentiment embedding corresponding to the multimodal fragments; and selecting and ordering the multimodal fragments based on the ranking (Fig. 5, step 508 and paragraphs 137-138 teaches the ranking/ordering and selecting based on the ranking). The prior motivation as discussed above in claim 1 is incorporated herein.
Regarding claim 6, Chao teaches the claimed wherein generating the multimodal fragments further comprise: determining importance scores for frames of a segment, each importance score comprising one or more of a user preference score based on comparing the user preference vector with frame sentiment embeddings for the frames of the segment (Fig. 5, step 508 and paragraphs 137-138 teaches selecting the similarity with the highest ranked match), a video context relevance score based on comparing contextual information of the target video with the frames of the segment (paragraph 80 teaches image similarity), or a sentence similarity score based on comparing sentences corresponding to the frames of the segment (paragraphs 22-23); and selecting a frame from the frames of the segment as a visual component for a multimodal fragment corresponding to the segment by comparing the importance scores of the frames (Once the viewing history is processed, the user’s preference vector is generated as semantic vectors 504 (See paragraph 135-137 and Fig. 5), which is compared to a list of segments in the segments datastore that are in consideration for personalization. Based on the vector similarity between them, a personalized recommendation is made, therefore, the segment that is recommended matches the vectors of the past viewing habits. Fig. 5, teaches semantic vectors from user history (501) and semantic vectors from video segments from a segments datastore, compares them for similarity in 507 and generates personalized segments related best to the viewers past viewing habit). The prior motivation as discussed above in claim 1 is incorporated herein.
Regarding claim 7, Chao teaches the claimed wherein determining the nonlinear ordering of the multimodal fragments further comprises: determining information factors for frames of a segment, each information factor comprising one or more of a user preference factor based on similarities between the user preference vector and frame sentiment embeddings for frames of the segment (Fig. 5, step 508 and paragraphs 137-138 teaches selecting the similarity with the highest ranked match), a video context relevance factor based on similarities between contextual information of the target video and the frames of the segment (paragraph 80 teaches image similarity), or an information diversity factor based on differences between frames of the segment and a diversity distribution of segments of the target video (paragraph 80 teaches image similarity, similarly, those that aren’t similar would be based on differences between the images being compared); and ordering the multimodal fragments based on the information factors of the frames of the corresponding segments (Fig. 5, step 507-5088 and paragraphs 137-138 teaches selecting a plurality of segments 506 for possible matches with the user viewing history and ranked according to their semantic vectors. The output of 508 is a list of segments, that are ranked in their relevance. Once the viewing history is processed, the user’s preference vector is generated as semantic vectors 504 (See paragraph 135-137 and Fig. 5), which is compared to a list of segments in the segments datastore that are in consideration for personalization. Based on the vector similarity between them, a personalized recommendation is made, therefore, the segment that is recommended matches the vectors of the past viewing habits. Fig. 5, teaches semantic vectors from user history (501) and semantic vectors from video segments from a segments datastore, compares them for similarity in 507 and generates personalized segments related best to the viewers past viewing habit. The best matching segments are therefore ranked in the specific order). The prior motivation as discussed above in claim 1 is incorporated herein.
Regarding claim 13, Chao teaches the claimed wherein ordering the multimodal fragments comprises: determining, utilizing a pretrained processing model, context similarity scores based on a comparison of frame embeddings from frames of the target video with a video context of the target video (Fig. 5, step 507-5088 and paragraphs 137-138 teaches selecting a plurality of segments 506 for possible matches with the user viewing history); and ordering the multimodal fragments based on relevance to context of the video based on the context similarity scores of the frames of the target video corresponding to the multimodal fragments (Once the viewing history is processed, the user’s preference vector is generated as semantic vectors 504 (See paragraph 135-137 and Fig. 5), which is compared to a list of segments in the segments datastore that are in consideration for personalization. Based on the vector similarity between them, a personalized recommendation is made, therefore, the segment that is recommended matches the vectors of the past viewing habits. Fig. 5, step 507-5088 and paragraphs 137-138 teaches selecting a plurality of segments 506 for possible matches with the user viewing history and ranked according to their semantic vectors. The output of 508 is a list of segments, that are ranked in their relevance. Fig. 5, teaches semantic vectors from user history (501) and semantic vectors from video segments from a segments datastore, compares them for similarity in 507 and generates personalized segments related best to the viewers past viewing habit. The best matching segments are therefore ranked in the specific order). The prior motivation as discussed above in claim 1 is incorporated herein.
Regarding claim 16, Campbell teaches the claimed further comprising instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising causing, at a computing device, the segmented playback of the segments of the target video in accordance with the ordering for the multimodal fragments (paragraphs 24 and 53 teaches wherein the video summary is presented to the user on a user device for playback).
Regarding claim 17, Campbell teaches the claimed wherein ordering the multimodal fragments further comprises arranging the multimodal fragments according to a nonlinear playback ordering of the segments relative to a timeline of the target video (paragraphs 24 and 53 teaches wherein the video summary is tailored in a nonlinear time lined manner since the playback isn’t the linear as the original video).
Regarding claims 5 and 20, Chao teaches the claimed wherein generating the user preference vector further comprises: generating, utilizing trained machine learning model, frame sentiment embeddings for the one or more videos previously viewed by the user (Figs. 1-5, and paragraph 46 teaches wherein past viewing history and video segments in the database are processed by the system to generate semantic metadata and semantic vectors (502 and 506) by analyzing segments. At least paragraphs 73, 85 and 127 teaches the use of using learning models to assist in analyzing the segments and the segments in the user history (see para. 46 which teaches using the system for analyzing user viewing history segments as well).  Thereafter, once the viewing history is processed, the user’s preference vector is generated as semantic vectors 504 (See paragraph 135-137 and Fig. 5)). The prior motivation as discussed above in claim 1 is incorporated herein.
Claims 8-9 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over Campbell (US 2017/0351922) in view of Chao et al. (US 202/0195983) and further in view of Cameron et al. (US 2018/0032305).
Regarding claim 8, Campbell and Chao teaches the claimed as discussed in claim 1 above, however fails to, but Cameron teaches wherein determining the nonlinear ordering of the multimodal fragments further comprises: determining, utilizing a pretrained embedding model, sentence coherence scores for sentences of the segments corresponding to the multimodal fragments; and ordering the multimodal fragments based on the sentence coherence scores (paragraphs 162-163 and 247-248 teaches wherein algorithms and sub-algorithms are used to extract audio transcripts of multimedia content and to identify sentences and to group the sentences based on the structure/coherence).
Campbell’s system utilizes audio transcripts (spoken words) to identify events of interest and Cameron allows for the introduction of identifying sentences and grouping them based on their structure and identifying their start and stop positions (paragraph 193). Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to incorporate the teachings of Cameron into Campbell such that Campbell, using its existing ability to identify events of interest, is also able to identify segments of the video based on the audio transcripts showing a group of sentences, because such an incorporation allows for the benefit of improving the system by giving text regions a particular importance over other regions (Cameron: paragraph 5).
Regarding claim 9, Campbell and Chao teaches the claimed as discussed in claim 1 above, however fails to, but Cameron wherein parsing the target video into segments further comprises determining groups of sentences of an audio transcript of the target video based on cosine similarities between embedded sentences from the groups of sentences; and partitioning the target video based on the groups of sentences  (paragraphs 162-163 and 247-248 teaches wherein algorithms and sub-algorithms are used to extract audio transcripts of multimedia content and to identify sentences and to group the sentences based on the structure).
Campbell’s system utilizes audio transcripts (spoken words) to identify events of interest and their corresponding video segments and Cameron allows for the introduction of identifying sentences and grouping them based on their structure and identifying their start and stop positions (paragraph 193). Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to incorporate the teachings of Cameron into Campbell such that Campbell, using its existing ability to identify events of interest, is also able to identify segments of the video based on the audio transcripts showing a group of sentences, because such an incorporation allows for the benefit of improving the system by giving text regions a particular importance over other regions (Cameron: paragraph 5).
Regarding claim 11, Campbell and Chao teaches the claimed as discussed in claim 1 above, however fails to, but Cameron teaches the claimed wherein parsing the target video into the segments further comprises extracting audio transcripts from the target video; generating embedded sentences by utilizing a pretrained embedding model to encode sentences of the audio transcripts into an embedding space; and parsing the audio transcripts into groups of sentences based on semantic similarities between the embedded sentences (paragraphs 247-248 teaches wherein algorithms and sub-algorithms are used to extract audio transcripts of multimedia content and to identify sentences and to group the sentences based on the structure).
Campbell’s system utilizes audio transcripts (spoken words) to identify events of interest and Cameron allows for the introduction of identifying sentences and grouping them based on their structure and identifying their start and stop positions (paragraph 193). Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to incorporate the teachings of Cameron into Campbell such that Campbell, using its existing ability to identify events of interest, is also able to identify segments of the video based on the audio transcripts showing a group of sentences, because such an incorporation allows for the benefit of improving the system by giving text regions a particular importance over other regions (Cameron: paragraph 5).

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Campbell (US 2017/0351922) in view of Chao et al. (US 202/0195983) and further in view of Younessian (US 2020/0159759).
Regarding claim 10, Campbell and Chao teaches the claimed as discussed in claim 1 above, however fails to, but Younessian teaches the claimed wherein parsing the target video into segments comprises: 
Identifying, based on differences between color histograms of consecutive frames of the target video, a set of related frames corresponding to a video shot; and selecting a median frame from the set of related frames as a keyframe for the video shot (paragraphs 44, 61-62, 72-73 and 85-86 teaches wherein for a system to select a keyframe for a given scene, the system selected a median frame of a scene for which the keyframe is supposed to represent. Typically, keyframes utilize a first frame (meeting the claimed “first frame”) that meets an evaluation based on the color histogram (see paragraphs 62, 73 and 86). Paragraphs 44, 61, 72 and 85 teaches more specifically that the keyframe that is selected is based on all of the frames that make up the scene itself and that the selected keyframe is a median frame for the particular scene.).
It would have been obvious to one of ordinary skill in the art at the time of the invention to incorporate the teachings of Younessian into the combined system of Campbell and Chao such that Campbell also utilizes color histograms to help select a first frame and to select the median frame of scene as a key frame because such an incorporation allows for the benefit of improving the user friendliness of the system by allowing a user to identify a label for the segment of the content being viewed (Younessian: paragraph 2).

Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Campbell (US 2017/0351922) in view of Chao et al. (US 202/0195983) and further in view of Shetty et al. (US 2016/0070962) and Olstad et al. (US 8,296,797).
Regarding claim 12, Campbell and Chao teaches the claimed as discussed above, however fails to, but Shetty teaches the claimed wherein generating multimodal fragments comprises: selecting a representative frame based on a semantic importance score of each frame of a segment of the segments relative to a video context of the target video and a comparison of a set of the sets of frame sentiment embeddings corresponding to the segment with the user preference vector, wherein a visual component for a multimodal fragment corresponding to the segment includes the representative frame (paragraphs 7, 41-42, 45, 48, 50, 52, 56 wherein semantic scores are used to decide on a representative frame for a particular video clip/preview);
It would have been obvious to one of ordinary skill in the art before the effective filing date of the current application to incorporate the teachings of Shetty into the proposed combination of Campbell and Chao because said incorporation allows for the benefit of improving the user experience by matching an object of interest in a matching search (Shetty: paragraphs 6-12).
However, Campbell, Chao and Shetty fails to, however, Olstad teaches the claimed selecting a representative text from a group of sentences extracted from an audio transcript of the video, wherein the textual component includes the representative text (see claim 22 which teaches that based on a search query, it is determined whether the query terms is present inside “audio tracks of the matching videos” as well and the selected thumbnail for the search result reflects the search query terms).
It would have been obvious to one of ordinary skill in the art at the time of the invention to incorporate the teachings of Olstad into Campbell, Chao and Shetty because such an incorporation would allow for the benefit of a more improved and efficient video searching system (Olstad: col. 1, lines 35-61).

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Campbell (US 2017/0351922) in view of Chao et al. (US 202/0195983) and further in view of Batchu et al. (US 2017/0169128).
Regarding claim 18, Campbell and Chao teaches the claimed as discussed in claims 1 and 14 above, however fails to, but Batchu teaches the claimed further comprising instructions that, when executed by the at least one processor, cause the at least one processor to perform operations comprising generating a table of contents for multimodal fragments, wherein the table of contents links each multimodal frame of the multimodal fragments to a respective segment of the segments of the target video (paragraph 99).
It would have been obvious to one of ordinary skill in the art at the time of the invention to incorporate the teachings of Batchu into the proposed combined system of Campbell and Chao such that during the playback of the video summary the user is able to see a table of contents because such an incorporation allows for the benefit of improving the user friendliness of the system by allowing the user to be aware of the highlights in the associated video (Batchu: paragraphs 3 and 5-6).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to GELEK W TOPGYAL whose telephone number is (571)272-8891. The examiner can normally be reached M-F (9:30-6 PST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, William Vaughn can be reached on 571-272-3922. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/GELEK W TOPGYAL/           Primary Examiner, Art Unit 2481