DETAILED ACTIONS
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The amendment filed August 15th, 2022 has been entered. Claims 1-5 and 7-19 and new claims 21-22 remain pending in application. Applicant’s amendment to the Claims have overcome each and every objections previously set forth in the Non-Final Office Action mailed April 14th, 2022.

Response to Arguments
Applicant’s arguments with respect to claim 1 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.


Claim 1-2, 5, 7-11, 13, 15-16, and 21-22  are rejected under 35 U.S.C. 103 as being unpatentable over Chen et al. (US 20160014482 A1), hereinafter referred to as Chen, in view of Lin et al. (US 7760956 B2), hereinafter referred to as Lin, and in further view of Lu et al. ("A Bag-of-Importance Model With Locality-Constrained Coding Based Feature Learning for Video Summarization"), hereinafter referred to as Lu.

Regarding claim 1, Chen discloses a method for summarizing multimedia content (Title), comprising: 
receiving multimedia content (para. 0023, “obtaining a set of annotated video segments using a video summarization system”), wherein the multimedia content (para. 0023, “obtaining a set of annotated video segments using a video summarization system) comprises a plurality of frames (para. 0023, “obtaining a set of annotated video segments using a video summarization system”, video has one or more frames) and each of the plurality of frames (para. 0023, “obtaining a set of annotated video segments using a video summarization system”, video has one or more frames)  comprises one or more audio elements, one or more visual elements, and metadata (para. 0004, “Although the term “video content” references video information, the term is typically utilized to encompass a combination of video, audio, and text data. In many instances, video content can also include and/or reference sources of metadata”; 
extracting the one or more audio elements, the one or more visual elements , and the metadata from each of the plurality of frames of the multimedia content (para. 0023, “extracting a set of video clips from the set of annotated video segments based upon clipping cues using the video summarization system, where a video clip in the set of video clips includes at least one key feature, an audio channel, and metadata describing the length of the video clip”); 
retrieving a transcript of the multimedia content based on the one or more audio elements and the one or more visual elements (para. 0010, “a video clip in the set of video clips further includes an audio channel and the at least one key feature of each video clip includes a text transcript of the audio channel”); 
determining a plurality of keywords from the transcript (Fig. 27, step 2702, identify keywords using text and visual analysis of the video segments); 
mapping the plurality of keywords across each of the plurality of frames to generate a keyword mapping (Fig. 27, step 2704, generate inverted index mapping keywords to video segments); 
computing, for each plurality of frames, a plurality of sub-scores (para. 0175, “any of a variety of processes can be utilized to identify and score individual video clips extracted from a video segment for the purpose of combining video clips”) based on the keyword mapping (Fig. 27, step 2710, “score relevance of identified video segments to query”), the one or more audio elements (para. 0093, “a confidence score is associated with the timestamp assigned to a textual cue and the confidence score can also be considered in the determination of a segmentation boundary, textual cue is from the textual data which is derived from the audio channel), the one or more visual elements (para. 0090, “factors including (but not limited to) the L1 distance, and the number of adjacent frames in which the anchor face are detected are utilized to generate a confidence score that can be used by a multi-modal segmentation process in combination with information concerning other cues to determine the likelihood of a transition indicative of a segmentation boundary”, face is a visual element), and the metadata (para. 0093, “a confidence score is associated with the timestamp assigned to a textual cue and the confidence score can also be considered in the determination of a segmentation boundary, textual cue is from the textual data which is a metadata), wherein the plurality of sub-scores comprises a layout change score (para. 0182, “Scoring data be generated (2474) for each video clip based upon the extracted key features. Importance of a video clip can be determined based upon key features. In some embodiments, motion data, such as optical flow, motion vectors, or pixel differences between frames of a video clip can indicate importance”, in Specification, layout change score is defined as “image-based layout analysis may compute pixel-based similarity scores between sequential frames of the video and determine large differences in similarities scores between frames as representing a layout change” in page 51 ), a chapter score (para. 0195, “video segments are scored based upon a variety of factors including the number of related stories”, all the video segments are scored segments correspond to chapter) and a topic score (para. 0181, “Scoring metrics can be any value assigned to a video clip that can represent the relative importance and/or relevance of a video clip as compared to other video clips with respect to a specific topic and/or subject”); 
generating an importance score for each of the plurality of frames (para. 0177, “the importance of video clips is scored in order to generate a relevant video summary sequence”);
generating a ranking of the plurality of frames based on the importance scores (para. 0181, “ordering of video segments can be achieved by generating scoring data”, para. 0177, “the importance of video clips is scored in order to generate a relevant video summary sequence”); 
determining one or more top-ranked frames from the ranking that satisfy an importance threshold (para. 0187, “score thresholds can be determined (2476) and can used to filter out video clips. Video clips that are scored below the threshold value can be dropped from the video summary sequence”, so only the ones that meets the threshold are kept); 
merging the one or more top-ranked frames into one or more moments based on a sequential similarity analysis of the determined one or more top-ranked frames (para. 0186, “video clips can be grouped by similarity. In a variety of embodiments, shots, text, and/or audio within video clips can be used to measure similarity. In a variety of embodiments, an integer linear programming optimization can be used to determine similar video clips. In several embodiments, similar video clips can be determined using techniques including (but not limited to) by applying thresholds to similarity measurements and/or using decision trees to determine similarity based upon similarity measurements”, “A reference video clip can be the video clip with the highest score in a grouping of similar video clips”), wherein the merging (para. 0186, “video clips can be grouped by similarity”) comprises aggregating (para. 0186, “video clips can be grouped by similarity”) one or more of the one or more audio elements, the one or more visual elements, and the metadata of each of the one or more top-ranked frames (para. 0023, “where a video clip in the set of video clips includes at least one key feature, an audio channel, and metadata describing the length of the video clip”); and 
aggregating the one or more moments into a final summarization of the multimedia content (para. 0174, “The extracted portions of the video segments can then be combined (2410) and encoded to create a video segment that is a summary of all of the related video segments”).

Chen does not explicitly disclose wherein the layout change score measures a delta in visual arrangement of the one or more visual elements between adjacent frames. 
	However, Lin teaches wherein the layout change score measures a delta in visual arrangement of the one or more visual elements between adjacent frames (Lin teaches summarizing video streams in Col. 6, lines 13-15, “exemplary embodiments automatically extract representative key frames that summarize the video stream or clip with low frame redundancy”, Summary, “automatically extracting multiple frames from a video stream, based on frame content; enhancing resolution of images contained within each of the extracted frames, using information from neighboring frames”, Col. 13 lines 1-4, “the key frame selector 30 determines an importance score for each of the candidate key frames 18. The importance score of a candidate key frame is based on a set of characteristics of the candidate key frame.”, Col. 13 lines 22-35, “Another characteristic used to determine an importance score for a candidate key frame is based on moving objects in the candidate key frame. The key frame selector 30 credits a candidate key frame with M importance points if the candidate key frame includes a moving object having a size that is within a predetermined size range. The number M is determined by the position of the moving object in the candidate key frame in relation to the middle of the frame. The number M equals 3 if the moving object is in a predefined middle area range of the candidate key frame. The number M equals 2 if the moving object is in a predefined second-level area range of the candidate key frame. The number M equals 1 if the moving object is in a predefined third-level area range of the candidate key frame.”, Col. 17, lines 3-7, “the movements are represented by velocity vectors (dx/dt, dy/dt) that describe how quickly a pixel (or a group of pixels) is moving across an image, and the direction of pixel movement.”, the moving object corresponds to the visual elements and the movement corresponds to the visual arrangement or position of the moving object between the adjacent frames, since the moving object is changing position in each frame, there is a delta or change in visual arrangement, Lin teaches that the score is based on the moving object in the frames.”). 
Chen and Lin are both considered to be analogous to the claimed invention because they are in the same field of video summarization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method as taught by Chen to incorporate the teachings of Lin wherein the layout change score measures a delta in visual arrangement of the one or more visual elements between adjacent frames. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to develop a visually pleasing layout (Lin, Col. 47, lines 24-25).

Chen does not explicitly disclose wherein generating the importance score comprises weighting each of the plurality of sub-scores according to predetermined weight values and aggregating the weighted sub-scores.
	However, Lu teaches wherein generating the importance score (Section I, page 1498, “BoI model provides a mechanism to exploit both inter-frame and intra-frame properties by quantifying the importance of the individual features representing the whole video”) comprises weighting each of the plurality of sub-scores (Section I, page 1498, “A video can be viewed as a collection of representativeness weighted features instead of equally important ones.”) according to predetermined weight values (Section III.B, page 1502, “predefined weight set for the spatial salience map”) and aggregating the weighted sub-scores (Abstract, “a video is characterized with a bag of local features weighted with individual importance scores and the frames with more important local features are more representative, where the representativeness of each frame is the aggregation of the weighted importance of the local features contained in the frame”).
	Chen and Lu are both considered to be analogous to the claimed invention because they are in the same field of video summarization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method as taught by Chen to incorporate the teachings of Lu wherein generating the importance score comprises weighting each of the plurality of sub-scores according to predetermined weight values and aggregating the weighted sub-scores. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been because Lu’s proposed video summarization approach is able to exploit both the inter-frame and intra-frame properties of feature representations and identify keyframes capturing both the dominant content and discriminative details within a video (Lu, Abstract).

Regarding claim 2, the combination of Chen in view of Lin and in further view of Lu discloses the method for summarizing multimedia content of claim 1 (Chen, Title), wherein the plurality of sub-scores (Chen, para. 0023, “generating scoring data using a video summarization system, wherein the scoring data includes at least one scoring metric for each video clip in the set of video clips”) further comprises a spoken text change score (Chen, para. 0023, “generating scoring data using a video summarization system, wherein the scoring data includes at least one scoring metric for each video clip in the set of video clips, where the at least one scoring metric describes the at least one key feature of each video clip utilized to determine the relative importance of each video clip within the set of video clips wherein the at least one scoring metric includes at least one audio metric, at least one visual metric, and at least one textual metric”, one of the score metrics is textual metric which corresponds to spoken text change score).

Regarding claim 3, the combination of Chen in view of Lin and in further view of Lu discloses the method for summarizing multimedia content of claim 2 (Chen, Title), wherein the plurality of sub-scores further comprises a speaker score (Chen, para. 0196, “search result scores can be personalized based upon similar factors to those discussed above with respect to the generation of personalized video playlists. In this way, the most relevant search result for a specific user can be informed by factors including (but not limited to) a user's preferences with respect to content source, anchor people, and/or actors”, actor and anchor people are the speaker) and/or a visual text change score (Chen, para. 0127, “extracting relevant keywords from video segments for use in the annotation of video segments in accordance with embodiments of the invention are illustrated in FIGS. 13A-13D. FIG. 13A is a frame of video containing visual representations of text”, keywords are also scored in Fig. 27).

Regarding claim 5, the combination of Chen in view of Lin and in further view of Lu discloses the method for summarizing multimedia content of claim 1 (Chen, Title), wherein: 
the multimedia content (Chen, para. 0023, “obtaining a set of annotated video segments using a video summarization system”) is received from a user (Che, para. 0071, “the user interface provides the user with the ability to select video segments”); and 
the method (Chen, Title) further comprises transmitting the final summarization to the user (Chen, para. 0006, “providing the generated video summary sequence in response to a request for a video summary sequence using the video summarization system”).

Regarding claim 7, the combination of Chen in view of Lin and in further view of Lu discloses the method (Chen, Title)for summarizing multimedia content of claim 1, wherein the plurality of keywords are determined using a frequency-based keyword extraction method (Chen, para. 0183, “text keyword frequency can be an indicator of clip importance”, “Words with high tf-idf scores can be determined to be important keywords. Video clips containing important keywords can be determined to be relatively important compared with video clips that do not contain keywords. In many embodiments, multi-modal processes can be used to score video clips.”).

Regarding claim 8, the combination of Chen in view of Lin and in further view of Lu discloses the method for summarizing multimedia content of claim 4 (Chen, Title), wherein the sequential similarity analysis (Chen, para. 0186, “video clips can be grouped by similarity. In a variety of embodiments, shots, text, and/or audio within video clips can be used to measure similarity. In a variety of embodiments, an integer linear programming optimization can be used to determine similar video clips. In several embodiments, similar video clips can be determined using techniques including (but not limited to) by applying thresholds to similarity measurements and/or using decision trees to determine similarity based upon similarity measurements”, “A reference video clip can be the video clip with the highest score in a grouping of similar video clips”), wherein the merging (Chen, para. 0186, “video clips can be grouped by similarity”) comprises computing one or more Word Mover's Distance values (Chen, Applicant defines Word Mover’s Distance as “a WMD may refer to a measure of the distance  between two distributions over a region D. In various embodiments, the system (using WMD, or the like) quantifies the number of edits and/or amount of changes required to transfer one data point (e.g., in a multi-dimensional representation) into another data point. Thus, 10WMD, or the  like may serve as a contextual distance measure that is used by the system for similarity and/or dis-similarity measurements between frames or other segments of original content” in the Specification, Chen describes in para. 0194, “an inverted index mapping keywords to video segments. When a search query is received (2706), keywords can be extracted from text, an image, and/or a video sequence provided as part of the search query and the keywords used to identify (2708) relevant videos from the inverted index. As noted above, a search can also be performed for one or more image portions within the frames of the indexed video segments. The relevancy of the identified video segments can be scored (2710) and search results including a listing of one or more video segments can be returned. In several embodiments, the process of annotating the video segments includes identifying additional sources of relevant data and links to the additional sources of relevant data and/or excerpts of relevant data can be returned with the search results”, Chen uses the search query to find similarities between video segments) from the keyword mapping (Chen, para. 0194, “an inverted index mapping keywords to video segments”).

Regarding claim 9, Chen discloses a system for summarizing multimedia content (Fig. 24A), comprising: 
at least one server (Fig. 2) configured for receiving multimedia content (para. 0023, “obtaining a set of annotated video segments using a video summarization system”), wherein the multimedia content (para. 0023, “obtaining a set of annotated video segments using a video summarization system) comprises a plurality of frames (para. 0023, “obtaining a set of annotated video segments using a video summarization system”, video has one or more frames) and each of the plurality of frames (para. 0023, “obtaining a set of annotated video segments using a video summarization system”, video has one or more frames)  comprises one or more audio elements, one or more visual elements, and metadata (para. 0004, “Although the term “video content” references video information, the term is typically utilized to encompass a combination of video, audio, and text data. In many instances, video content can also include and/or reference sources of metadata”; 
at least one processor (Fig. 24A, processor 2491) configured for: 
extracting of the one or more audio elements, the one or more visual elements , and the metadata from each of the plurality of frames of the multimedia content (para. 0023, “extracting a set of video clips from the set of annotated video segments based upon clipping cues using the video summarization system, where a video clip in the set of video clips includes at least one key feature, an audio channel, and metadata describing the length of the video clip”); 
retrieving a transcript of the multimedia content based on the one or more audio elements and the one or more visual elements (para. 0010, “a video clip in the set of video clips further includes an audio channel and the at least one key feature of each video clip includes a text transcript of the audio channel”); 
determining a plurality of keywords from the transcript (Fig. 27, step 2702, identify keywords using text and visual analysis of the video segments); 
mapping the plurality of keywords across each of the plurality of frames to generate a keyword mapping (Fig. 27, step 2704, generate inverted index mapping keywords to video segments); 
computing, for each of the plurality of frames, a plurality of sub-scores (para. 0175, “any of a variety of processes can be utilized to identify and score individual video clips extracted from a video segment for the purpose of combining video clips”) based on the keyword mapping (Fig. 27, step 2710, “score relevance of identified video segments to query”), the one or more audio elements (para. 0093, “a confidence score is associated with the timestamp assigned to a textual cue and the confidence score can also be considered in the determination of a segmentation boundary, textual cue is from the textual data which is derived from the audio channel), of the one or more visual elements (para. 0090, “factors including (but not limited to) the L1 distance, and the number of adjacent frames in which the anchor face are detected are utilized to generate a confidence score that can be used by a multi-modal segmentation process in combination with information concerning other cues to determine the likelihood of a transition indicative of a segmentation boundary”, face is a visual element), and metadata (para. 0093, “a confidence score is associated with the timestamp assigned to a textual cue and the confidence score can also be considered in the determination of a segmentation boundary, textual cue is from the textual data which is a metadata), wherein the plurality of sub-scores comprises a layout change score (para. 0182, “Scoring data be generated (2474) for each video clip based upon the extracted key features. Importance of a video clip can be determined based upon key features. In some embodiments, motion data, such as optical flow, motion vectors, or pixel differences between frames of a video clip can indicate importance”, in Specification, layout change score is defined as “image-based layout analysis may compute pixel-based similarity scores between sequential frames of the video and determine large differences in similarities scores between frames as representing a layout change” in page 51 ), a chapter score (para. 0195, “video segments are scored based upon a variety of factors including the number of related stories”, all the video segments are scored segments correspond to chapter) and a topic score (para. 0181, “Scoring metrics can be any value assigned to a video clip that can represent the relative importance and/or relevance of a video clip as compared to other video clips with respect to a specific topic and/or subject”); 
generating an importance score for each of the plurality of frames (para. 0177, “the importance of video clips is scored in order to generate a relevant video summary sequence”);
generating a ranking of the plurality of frames based on the importance scores (para. 0181, “ordering of video segments can be achieved by generating scoring data”, para. 0177, “the importance of video clips is scored in order to generate a relevant video summary sequence”); 
determining one or more top-ranked frames from the ranking that satisfy an importance threshold (para. 0187, “score thresholds can be determined (2476) and can used to filter out video clips. Video clips that are scored below the threshold value can be dropped from the video summary sequence”, so only the ones that meets the threshold are kept); 
merging the one or more top-ranked frames into one or more moments based on a sequential similarity analysis of the determined one or more top-ranked frames (para. 0186, “video clips can be grouped by similarity. In a variety of embodiments, shots, text, and/or audio within video clips can be used to measure similarity. In a variety of embodiments, an integer linear programming optimization can be used to determine similar video clips. In several embodiments, similar video clips can be determined using techniques including (but not limited to) by applying thresholds to similarity measurements and/or using decision trees to determine similarity based upon similarity measurements”, “A reference video clip can be the video clip with the highest score in a grouping of similar video clips”), wherein the merging (para. 0186, “video clips can be grouped by similarity”) comprises aggregating (para. 0186, “video clips can be grouped by similarity”) one or more of the one or more audio elements, the one or more visual elements, and the metadata of each of the one or more top-ranked frames (para. 0023, “where a video clip in the set of video clips includes at least one key feature, an audio channel, and metadata describing the length of the video clip”); and 
aggregating the one or more moments into a final summarization of the multimedia content (para. 0174, “The extracted portions of the video segments can then be combined (2410) and encoded to create a video segment that is a summary of all of the related video segments”).

Chen does not explicitly disclose wherein the layout change score measures a delta in visual arrangement of the one or more visual elements between adjacent frames. 
	However, Lin teaches wherein the layout change score measures a delta in visual arrangement of the one or more visual elements between adjacent frames (Lin teaches summarizing video streams in Col. 6, lines 13-15, “exemplary embodiments automatically extract representative key frames that summarize the video stream or clip with low frame redundancy”, Summary, “automatically extracting multiple frames from a video stream, based on frame content; enhancing resolution of images contained within each of the extracted frames, using information from neighboring frames”, Col. 13 lines 1-4, “the key frame selector 30 determines an importance score for each of the candidate key frames 18. The importance score of a candidate key frame is based on a set of characteristics of the candidate key frame.”, Col. 13 lines 22-35, “Another characteristic used to determine an importance score for a candidate key frame is based on moving objects in the candidate key frame. The key frame selector 30 credits a candidate key frame with M importance points if the candidate key frame includes a moving object having a size that is within a predetermined size range. The number M is determined by the position of the moving object in the candidate key frame in relation to the middle of the frame. The number M equals 3 if the moving object is in a predefined middle area range of the candidate key frame. The number M equals 2 if the moving object is in a predefined second-level area range of the candidate key frame. The number M equals 1 if the moving object is in a predefined third-level area range of the candidate key frame.”, Col. 17, lines 3-7, “the movements are represented by velocity vectors (dx/dt, dy/dt) that describe how quickly a pixel (or a group of pixels) is moving across an image, and the direction of pixel movement.”, the moving object corresponds to the visual elements and the movement corresponds to the visual arrangement or position of the moving object between the adjacent frames, since the moving object is changing position in each frame, there is a delta or change in visual arrangement, Lin teaches that the score is based on the moving object in the frames.”). 
Chen and Lin are both considered to be analogous to the claimed invention because they are in the same field of video summarization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system as taught by Chen to incorporate the teachings of Lin wherein the layout change score measures a delta in visual arrangement of the one or more visual elements between adjacent frames. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to develop a visually pleasing layout (Lin, Col. 47, lines 24-25).


Chen does not explicitly disclose wherein generating the importance score comprises weighting each of the plurality of sub-scores according to predetermined weight values and aggregating the weighted sub-scores.
	However, Lu teaches wherein generating the importance score (Section I, page 1498, “BoI model provides a mechanism to exploit both inter-frame and intra-frame properties by quantifying the importance of the individual features representing the whole video”) comprises weighting each of the plurality of sub-scores (Section I, page 1498, “A video can be viewed as a collection of representativeness weighted features instead of equally important ones.”) according to predetermined weight values (Section III.B, page 1502, “predefined weight set for the spatial salience map”) and aggregating the weighted sub-scores (Abstract, “a video is characterized with a bag of local features weighted with individual importance scores and the frames with more important local features are more representative, where the representativeness of each frame is the aggregation of the weighted importance of the local features contained in the frame”).
	Chen and Lu are both considered to be analogous to the claimed invention because they are in the same field of video summarization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system as taught by Chen to incorporate the teachings of Lu wherein generating the importance score comprises weighting each of the plurality of sub-scores according to predetermined weight values and aggregating the weighted sub-scores. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been because Lu’s proposed video summarization approach is able to exploit both the inter-frame and intra-frame properties of feature representations and identify keyframes capturing both the dominant content and discriminative details within a video (Lu, Abstract).

Regarding claim 10, the combination of Chen in view of Lin and in further view of Lu discloses the system for summarizing multimedia content of claim 9 (Chen, Fig. 24A), wherein the plurality of sub-scores further comprises a spoken text change score (Chen, para. 0023, “generating scoring data using a video summarization system, wherein the scoring data includes at least one scoring metric for each video clip in the set of video clips, where the at least one scoring metric describes the at least one key feature of each video clip utilized to determine the relative importance of each video clip within the set of video clips wherein the at least one scoring metric includes at least one audio metric, at least one visual metric, and at least one textual metric”, one of the score metrics is textual metric which corresponds to spoken text change score).

Regarding claim 11, the combination of Chen in view of Lin and in further view of Lu discloses the system for summarizing multimedia content of claim 10 (Chen, Fig. 24A), wherein the plurality of sub-scores further comprises a speaker score (Chen, para. 0196, “search result scores can be personalized based upon similar factors to those discussed above with respect to the generation of personalized video playlists. In this way, the most relevant search result for a specific user can be informed by factors including (but not limited to) a user's preferences with respect to content source, anchor people, and/or actors”, actor and anchor people are the speaker) and a visual text change score (Chen, para. 0127, “extracting relevant keywords from video segments for use in the annotation of video segments in accordance with embodiments of the invention are illustrated in FIGS. 13A-13D. FIG. 13A is a frame of video containing visual representations of text”, keywords are also scored in Fig. 27).

Regarding claim 13, the combination of Chen in view of Lin and in further view of Lu discloses the system for summarizing multimedia content of claim 9 (Chen, Fig. 24A), wherein: 
the multimedia content (Chen, para. 0023, “obtaining a set of annotated video segments using a video summarization system”) is received from a user (Che, para. 0071, “the user interface provides the user with the ability to select video segments”); and 
the method (Chen, Title) further comprises transmitting the final summarization to the user (Chen, para. 0006, “providing the generated video summary sequence in response to a request for a video summary sequence using the video summarization system”).

Regarding claim 15, the combination of Chen in view of Lin and in further view of Lu discloses the system for summarizing multimedia content of claim 9 (Chen, Fig. 24A), wherein the plurality of keywords are determined using a frequency-based keyword extraction method (Chen, para. 0183, “text keyword frequency can be an indicator of clip importance”, “Words with high tf-idf scores can be determined to be important keywords. Video clips containing important keywords can be determined to be relatively important compared with video clips that do not contain keywords. In many embodiments, multi-modal processes can be used to score video clips.”).

Regarding claim 16, the combination of Chen in view of Lin and in further view of Lu discloses the system for summarizing multimedia content of claim 15 (Chen, Fig. 24A), wherein the sequential similarity analysis (Chen, para. 0186, “video clips can be grouped by similarity. In a variety of embodiments, shots, text, and/or audio within video clips can be used to measure similarity. In a variety of embodiments, an integer linear programming optimization can be used to determine similar video clips. In several embodiments, similar video clips can be determined using techniques including (but not limited to) by applying thresholds to similarity measurements and/or using decision trees to determine similarity based upon similarity measurements”, “A reference video clip can be the video clip with the highest score in a grouping of similar video clips”), wherein the merging (Chen, para. 0186, “video clips can be grouped by similarity”) comprises computing one or more Word Mover's Distance values (Applicant defines Word Mover’s Distance as “a WMD may refer to a measure of the distance  between two distributions over a region D. In various embodiments, the system (using WMD, or the like) quantifies the number of edits and/or amount of changes required to transfer one data point (e.g., in a multi-dimensional representation) into another data point. Thus, 10WMD, or the  like may serve as a contextual distance measure that is used by the system for similarity and/or dis-similarity measurements between frames or other segments of original content” in the Specification, Chen describes in para. 0194, “an inverted index mapping keywords to video segments. When a search query is received (2706), keywords can be extracted from text, an image, and/or a video sequence provided as part of the search query and the keywords used to identify (2708) relevant videos from the inverted index. As noted above, a search can also be performed for one or more image portions within the frames of the indexed video segments. The relevancy of the identified video segments can be scored (2710) and search results including a listing of one or more video segments can be returned. In several embodiments, the process of annotating the video segments includes identifying additional sources of relevant data and links to the additional sources of relevant data and/or excerpts of relevant data can be returned with the search results”, Chen uses the search query to find similarities between video segments) from the keyword mapping (Chen, para. 0194, “an inverted index mapping keywords to video segments”).

Regarding claim 21, the combination of Chen in view of Lin and in further view of Lu discloses the method of claim 1 (Chen, Title) further comprising generating a multi-dimensional representation of each of the plurality of frames (Lin, Col. 16 lines 59-63, “motion estimation module 14 computes movements of individual pixels or groups of pixels from a given base image to a neighboring base image based on a non-parametric optical flow model (or dense motion model)”, Col 17 lines 7-10, “optical flow model represents a projection of three-dimensional object motion onto the image sensor's two-dimensional image plane”, the three-dimensional projection corresponds to the multi-dimensional representation), wherein: 
the multi-dimensional representation comprises the one or more visual elements and a frame-corresponding portion of the transcript (Lin, Col 10 lines 65-67, “object motion analyzer examines the trajectories of moving objects in the video stream 12 by comparing small-grid color layouts in the video frames”,  Col. 16 lines 59-63, “motion estimation module 14 computes movements of individual pixels or groups of pixels from a given base image to a neighboring base image based on a non-parametric optical flow model (or dense motion model)”, Col 17 lines 7-10, “optical flow model represents a projection of three-dimensional object motion onto the image sensor's two-dimensional image plane”, the three-dimensional projection corresponds to the multi-dimensional representation, the moving object corresponds to visual elements Chen discloses a transcript corresponding to each frames in para. 0010); and 
the layout change score is further based on upon a comparison between the respective multi-dimensional representations of adjacent frames (Lin teaches summarizing video streams in Col. 6, lines 13-15, “exemplary embodiments automatically extract representative key frames that summarize the video stream or clip with low frame redundancy”, Summary, “automatically extracting multiple frames from a video stream, based on frame content; enhancing resolution of images contained within each of the extracted frames, using information from neighboring frames”, Col. 13 lines 1-4, “the key frame selector 30 determines an importance score for each of the candidate key frames 18. The importance score of a candidate key frame is based on a set of characteristics of the candidate key frame.”, Col. 13 lines 22-35, “Another characteristic used to determine an importance score for a candidate key frame is based on moving objects in the candidate key frame. The key frame selector 30 credits a candidate key frame with M importance points if the candidate key frame includes a moving object having a size that is within a predetermined size range. The number M is determined by the position of the moving object in the candidate key frame in relation to the middle of the frame. The number M equals 3 if the moving object is in a predefined middle area range of the candidate key frame. The number M equals 2 if the moving object is in a predefined second-level area range of the candidate key frame. The number M equals 1 if the moving object is in a predefined third-level area range of the candidate key frame.”, Col. 17, lines 3-11, “the movements are represented by velocity vectors (dx/dt, dy/dt) that describe how quickly a pixel (or a group of pixels) is moving across an image, and the direction of pixel movement. The optical flow model represents a projection of three-dimensional object motion onto the image sensor's two-dimensional image plane. Any one of a wide variety of optical flow computation methods can be used by the motion estimation module 14 to compute motion vectors.”, the moving object corresponds to the visual elements and the movement corresponds to the visual arrangement or position of the moving object between the adjacent frames, since the moving object is changing position in each frame, there is a delta or change in visual arrangement, Lin teaches that the score is based on the moving object in the frames.”, the optical flow is the three-dimensional representation of the moving object).
Chen and Lin are both considered to be analogous to the claimed invention because they are in the same field of video summarization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system as taught by Chen to incorporate the teachings of Lin of generating a multi-dimensional representation of each of the plurality of frames. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to develop a visually pleasing layout (Lin, Col. 47, lines 24-25).

Regarding claim 22, the combination of Chen in view of Lin and in further view of Lu discloses the method of claim 1 (Chen, Title), wherein computing the delta in visual arrangement of the one or more visual elements (Lin teaches summarizing video streams in Col. 6, lines 13-15, “exemplary embodiments automatically extract representative key frames that summarize the video stream or clip with low frame redundancy”, Summary, “automatically extracting multiple frames from a video stream, based on frame content; enhancing resolution of images contained within each of the extracted frames, using information from neighboring frames”, Col. 13 lines 1-4, “the key frame selector 30 determines an importance score for each of the candidate key frames 18. The importance score of a candidate key frame is based on a set of characteristics of the candidate key frame.”, Col. 13 lines 22-35, “Another characteristic used to determine an importance score for a candidate key frame is based on moving objects in the candidate key frame. The key frame selector 30 credits a candidate key frame with M importance points if the candidate key frame includes a moving object having a size that is within a predetermined size range. The number M is determined by the position of the moving object in the candidate key frame in relation to the middle of the frame. The number M equals 3 if the moving object is in a predefined middle area range of the candidate key frame. The number M equals 2 if the moving object is in a predefined second-level area range of the candidate key frame. The number M equals 1 if the moving object is in a predefined third-level area range of the candidate key frame.”, Col. 17, lines 3-11, “the movements are represented by velocity vectors (dx/dt, dy/dt) that describe how quickly a pixel (or a group of pixels) is moving across an image, and the direction of pixel movement. The optical flow model represents a projection of three-dimensional object motion onto the image sensor's two-dimensional image plane. Any one of a wide variety of optical flow computation methods can be used by the motion estimation module 14 to compute motion vectors.”, the moving object corresponds to the visual elements and the movement corresponds to the visual arrangement or position of the moving object between the adjacent frames, since the moving object is changing position in each frame, there is a delta or change in visual arrangement, Lin teaches that the score is based on the moving object in the frames.”, the optical flow is the three-dimensional representation of the moving object) comprises determining, between the adjacent frames (Lin, Summary, “automatically extracting multiple frames from a video stream, based on frame content; enhancing resolution of images contained within each of the extracted frames, using information from neighboring frames”,), a spatial change of respective pixels corresponding to each of the one or more visual elements (Lin, Col. 13 lines 22-35, “Another characteristic used to determine an importance score for a candidate key frame is based on moving objects in the candidate key frame. The key frame selector 30 credits a candidate key frame with M importance points if the candidate key frame includes a moving object having a size that is within a predetermined size range. The number M is determined by the position of the moving object in the candidate key frame in relation to the middle of the frame. The number M equals 3 if the moving object is in a predefined middle area range of the candidate key frame. The number M equals 2 if the moving object is in a predefined second-level area range of the candidate key frame. The number M equals 1 if the moving object is in a predefined third-level area range of the candidate key frame.”, Col 16 lines 59-63, “motion estimation module 14 computes movements of individual pixels or groups of pixels from a given base image to a neighboring base image based on a non-parametric optical flow model (or dense motion model).


Claim 4 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Lin and in further view of Lu in further view of Fajtl et al. ("Summarizing Videos with Attention"), hereinafter referred to as Fajtl.

Regarding claim 4, the combination of Chen in view of Lin and in further view of Lu discloses the method for summarizing multimedia content of claim 1 (Chen, Title), wherein the predetermined weight values (Lu, Section III.B, page 1502, “predefined weight set for the spatial salience map”) are machine learned (Lu, Section I, page 1499, “We address the video summarization problem from the feature learning perspective. We utilize the unsupervised feature learning approach (by employing the locality-constrained linear coding method) to subsequently feed in the summarization pipelines with the learned features, and obtain superior performance in comparison with handcrafted features having been used in traditional methods.”, Section II, page 1499, “From the feature learning perspective, we investigate the importance of the individual features and model their importance to the representativeness of the video content”).

The combination of Chen in view of Lin and in further view of Lu does not explicitly disclose wherein the predetermined weight values cross-validation optimized weight values.
However, Fajtl teaches wherein the predetermined weight values (Section 2.1, “attention is based on an idea that the neural network can learn how important various samples in a sequence, or image regions, are with respect to the desired output state. These importance values are defined as attention weights and are commonly estimated simultaneously with other model parameters trained for a specific objective”) cross-validation optimized weight values (Section 4.2, “use a 5-fold cross validation for both, canonical and augmented settings”).
Fajtl is considered to be analogous to the claimed invention because it is in the same field of video summarization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method as taught by the combination of Chen in view of Lin and in further view of Lu to incorporate the teachings of Fajtl wherein the predetermined weight values cross-validation optimized weight values. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to see a clear correlation between the ground truth and machine summary, confirming the quality of the method (Fajtl, Section 5.1).

Regarding claim 12, the combination of Chen in view of Lin and in further view of Lu discloses the system for summarizing multimedia content of claim 9 (Chen, Fig. 24A), wherein the predetermined weight values (Lu, Section III.B, page 1502, “predefined weight set for the spatial salience map”) are machine learned (Lu, Section I, page 1499, “We address the video summarization problem from the feature learning perspective. We utilize the unsupervised feature learning approach (by employing the locality-constrained linear coding method) to subsequently feed in the summarization pipelines with the learned features, and obtain superior performance in comparison with handcrafted features having been used in traditional methods.”, Section II, page 1499, “From the feature learning perspective, we investigate the importance of the individual features and model their importance to the representativeness of the video content”).

The combination of Chen in view of Lin and in further view of Lu does not explicitly disclose wherein the predetermined weight values cross-validation optimized weight values.
However, Fajtl teaches wherein the predetermined weight values (Section 2.1, “attention is based on an idea that the neural network can learn how important various samples in a sequence, or image regions, are with respect to the desired output state. These importance values are defined as attention weights and are commonly estimated simultaneously with other model parameters trained for a specific objective”) cross-validation optimized weight values (Section 4.2, “use a 5-fold cross validation for both, canonical and augmented settings”).
Fajtl is considered to be analogous to the claimed invention because it is in the same field of video summarization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system as taught by the combination of Chen in view of Lin and in further view of Lu to incorporate the teachings of Fajtl wherein the predetermined weight values cross-validation optimized weight values. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to see a clear correlation between the ground truth and machine summary, confirming the quality of the method (Fajtl, Section 5.1).

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Chen in view of Lin and in further view of Lu in further view of Mani (US 20190258660 A1), hereinafter referred to as Mani (previously cited by applicant in IDS).

Regarding claim 14, the combination of Chen in view of Lin and in further view of Lu discloses the system for summarizing multimedia content of claim 9 (Chen, Fig. 24A).

The combination of Chen in view of Lin and in further view of Lu does not explicitly disclose wherein the transcript is retrieved from an external service.
	However, Mani teaches wherein the transcript is retrieved from an external service (para. 0044, “The ASR module 304 receives the audio input either directly or via the extraction module 302 and generates a text transcript of the received audio file, para. 0046, “It may be appreciated that while the extraction module 302 and the ASR module 304 are shown as part of the sequence generating module 202, it may be appreciated that this is not necessary and that one or more of these modules may be remote from the multimedia summarization system 100 and accessible via a network. For example, the ASR module 304 can comprise open source tools such as, Sphinx, HTC and the like”).
Mani is considered to be analogous to the claimed invention because it is in the same field of video summarization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the system as taught by the combination of Chen in view of Lin and in further view of Lu to incorporate the teachings of Mani wherein the transcript is retrieved from an external service. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been because the external service can produce a text transcript that is in standard form that can be used by media players (Mani, para. 0044).

Claim 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Fajtl in view of  Lin and in further view of Chen.

Regarding claim 17, Fajtl discloses a process for training a machine learning model (Abstract, ”we propose a simple, self-attention based network for video summarization which performs the entire sequence to sequence transformation in a single feed forward pass and single backward pass during training”) for summarizing multimedia content (Title), comprising: 
generating at least one machine learning model (Fig. 2), wherein the generating comprises initializing a plurality of weight values (Fig. 1, “self-attention network generates weights for all input features”), each weight value (Fig. 1, “self-attention network generates weights for all input features”) associated with one of a plurality of analysis modalities (Section I, page 2, “features such as semantic and pixel intensities”); 
retrieving a training dataset (Section 4.1 and 4.2, TvSum and SumMe datasets) comprising multimedia content comprising a plurality of frames, wherein each of the plurality of frames comprises one or more visual elements (Section I, “Personal videos, video lectures, video diaries, video messages”, the videos or multimedia content example that was given by Fajtl, video consists of several frames with visual elements), a first final summarization of the multimedia content (“Section 5, result of the VASNet which is the proposed video summarization method is compared with TvSum and SumMe datasets), and a plurality of sub-scores for each of the plurality of frames (Section 4.2, “trained using frame-level scores”); 
training the at least one machine learning model  to output a final summarization of the multimedia content (Abstract, ”we propose a simple, self-attention based network for video summarization which performs the entire sequence to sequence transformation in a single feed forward pass and single backward pass during training”), wherein the training comprises: 
executing the at least one machine learning model (Fig. 2) to generate an importance score for each of the plurality of frames (Section 3, “Architecture proposed in this work replaces entirely the LSTM encoder decoder network with the soft, self-attention and a two layer, fully connected network for regression of the frame importance score”, Section I, “Personal videos, video lectures, video diaries, video messages”, the videos or multimedia content example that was given by Fajtl) wherein generating the importance score (Section 3, “Architecture proposed in this work replaces entirely the LSTM encoderdecoder network with the soft, self-attention and a two layer, fully connected network for regression of the frame importance score”, Section I, “Personal videos, video lectures, video diaries, video messages”, the videos or multimedia content example that was given by Lu) comprises aggregating, for each of the plurality of frames (Section I, “Personal videos, video lectures, video diaries, video messages”, the videos or multimedia content example that was given by Lu), the plurality of sub-scores according to the plurality of weight values  (Section 2.1, “importance values are defined as attention weights and are commonly estimated simultaneously with other model parameters trained for a specific objective”, there is an attention weights for each input features which means there are plurality of sub-scores corresponding to input features);
computing an error metric by comparing the second final summarization to the first final summarization (Section 4.3, F-score is calculated which can test the error); 
determining that the error metric (Section 4.3, F-score is calculated which can test the error) does not satisfy an error threshold (Fig. 4); and 
adjusting one or more of the plurality of weight values (Section 3, weights learned during training) towards reducing the error metric (Section 3.2, “We use 50% dropout and L2 = 10−5 regularization. Training is done over 200 epochs. Model with the highest validation F-score is then selected).

Fajtl does not explicitly disclose wherein the plurality of sub-scores comprises a layout change score that measures a delta in arrangement of the one or more visual elements of adjacent frames. 
	However, Lin teaches wherein the plurality of sub-scores comprises a layout change score that measures a delta in arrangement of the one or more visual elements of adjacent frames (Lin teaches summarizing video streams in Col. 6, lines 13-15, “exemplary embodiments automatically extract representative key frames that summarize the video stream or clip with low frame redundancy”, Summary, “automatically extracting multiple frames from a video stream, based on frame content; enhancing resolution of images contained within each of the extracted frames, using information from neighboring frames”, Col. 13 lines 1-4, “the key frame selector 30 determines an importance score for each of the candidate key frames 18. The importance score of a candidate key frame is based on a set of characteristics of the candidate key frame.”, Col. 13 lines 22-35, “Another characteristic used to determine an importance score for a candidate key frame is based on moving objects in the candidate key frame. The key frame selector 30 credits a candidate key frame with M importance points if the candidate key frame includes a moving object having a size that is within a predetermined size range. The number M is determined by the position of the moving object in the candidate key frame in relation to the middle of the frame. The number M equals 3 if the moving object is in a predefined middle area range of the candidate key frame. The number M equals 2 if the moving object is in a predefined second-level area range of the candidate key frame. The number M equals 1 if the moving object is in a predefined third-level area range of the candidate key frame.”, Col. 17, lines 3-7, “the movements are represented by velocity vectors (dx/dt, dy/dt) that describe how quickly a pixel (or a group of pixels) is moving across an image, and the direction of pixel movement.”, the moving object corresponds to the visual elements and the movement corresponds to the visual arrangement or position of the moving object between the adjacent frames, since the moving object is changing position in each frame, there is a delta or change in visual arrangement, Lin teaches that the score is based on the moving object in the frames.”). 
Fajtl and Lin are both considered to be analogous to the claimed invention because they are in the same field of video summarization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the method as taught by Fajtl to incorporate the teachings of Lin wherein the plurality of sub-scores comprises a layout change score that measures a delta in arrangement of the one or more visual elements of adjacent frames. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to develop a visually pleasing layout (Lin, Col. 47, lines 24-25).

Fajtl does not explicitly disclose generating a second final summarization of the multimedia content based on comparing the generated importance scores to an importance threshold and merging a subset of the plurality of frames associated with threshold-satisfying importance scores into the second final summarization.
	However, Chen discloses generating a second final summarization of the multimedia content (para. 0174, “The extracted portions of the video segments can then be combined (2410) and encoded to create a video segment that is a summary of all of the related video segments”) based on comparing the generated importance scores to an importance threshold (para. 0187, “score thresholds can be determined (2476) and can used to filter out video clips. Video clips that are scored below the threshold value can be dropped from the video summary sequence”, so only the ones that meets the threshold are kept) and merging a subset of the plurality of frames associated with threshold-satisfying importance scores (para. 0181, “ordering of video segments can be achieved by generating scoring data”, para. 0177, “the importance of video clips is scored in order to generate a relevant video summary sequence”) into the second final summarization (para. 0174, “The extracted portions of the video segments can then be combined (2410) and encoded to create a video segment that is a summary of all of the related video segments”, para. 0175, “any of a variety of processes can be utilized to identify and score individual video clips extracted from a video segment for the purpose of combining video clips”).
Fajtl and Chen are both considered to be analogous to the claimed invention because they are in the same field of video summarization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the process as taught by Fajtl to incorporate the teachings of Chen of generating a second final summarization of the multimedia content based on comparing the generated importance scores to an importance threshold and merging frames associated with threshold-satisfying importance scores into the second final summarization. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to generate a relevant video summary sequence by using scoring data (Chen, para. 0177).

Regarding claim 18, the combination of Fajtl in view of  Lin and in further view of Chen discloses the process for training a machine learning model for summarizing multimedia content of claim 17 (Fajtl, Abstract, ”we propose a simple, self-attention based network for video summarization which performs the entire sequence to sequence transformation in a single feed forward pass and single backward pass during training”), wherein the plurality of sub-scores comprises a chapter score (Chen, para. 0195, “video segments are scored based upon a variety of factors including the number of related stories”, all the video segments are scored segments correspond to chapter) and a topic score (Chen, para. 0181, “Scoring metrics can be any value assigned to a video clip that can represent the relative importance and/or relevance of a video clip as compared to other video clips with respect to a specific topic and/or subject”).
Fajtl and Chen are both considered to be analogous to the claimed invention because they are in the same field of video summarization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the process as taught by Fajtl to incorporate the teachings of Chen wherein the plurality of sub-scores comprises a chapter score and a topic score. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to generate a relevant video summary sequence by using scoring data (Chen, para. 0177).

Regarding claim 19, the combination of Fajtl in view of  Lin and in further view of Chen discloses the process for training a machine learning model for summarizing multimedia content of claim 18 (Fajtl, Abstract, ”we propose a simple, self-attention based network for video summarization which performs the entire sequence to sequence transformation in a single feed forward pass and single backward pass during training”), wherein the plurality of sub-scores (Fajtl, Section 2.1, “importance values are defined as attention weights and are commonly estimated simultaneously with other model parameters trained for a specific objective”, there is an attention weights for each input features which means there are plurality of sub-scores corresponding to input features) further comprises a spoken text change score (Chen, para. 0023, “generating scoring data using a video summarization system, wherein the scoring data includes at least one scoring metric for each video clip in the set of video clips, where the at least one scoring metric describes the at least one key feature of each video clip utilized to determine the relative importance of each video clip within the set of video clips wherein the at least one scoring metric includes at least one audio metric, at least one visual metric, and at least one textual metric”, one of the score metrics is textual metric which corresponds to spoken text change score)  and a layout change score (Chen, para. 0182, “Scoring data be generated (2474) for each video clip based upon the extracted key features. Importance of a video clip can be determined based upon key features. In some embodiments, motion data, such as optical flow, motion vectors, or pixel differences between frames of a video clip can indicate importance”).
Fajtl and Chen are both considered to be analogous to the claimed invention because they are in the same field of video summarization. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the process as taught by Fajtl to incorporate the teachings of Chen wherein the plurality of sub-scores comprises a spoken text change score and a layout change score. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to generate a relevant video summary sequence by using scoring data (Chen, para. 0177).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DENISE G ALFONSO whose telephone number is (571)272-1360. The examiner can normally be reached Monday - Friday 7:30 - 5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Claire Wang can be reached on 571-270-1051. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/DENISE G ALFONSO/Examiner, Art Unit 2663                                                                                                                                                                                                        /CLAIRE X WANG/Supervisory Patent Examiner, Art Unit 2663