DETAILED ACTION
	Receipt of Applicant’s Amendment, filed June 27, 2022 is acknowledged.  
Claims 1, 8, 10, 12, 13, 16, 23 and 24 were amended.
Claims 4-6, 9 and 20 were cancelled.
Claim 25 was newly added.
Claims 1-3, 7-19, 21-25 are pending in this office action.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-8, 22, and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Issa [2009/0265737] in view of Soni [2018/0089203], Ronfard [Conceptual indexing of television images based on face and caption sizes and locations], and Bennet [7668610].

With regard to claim 1 Issa teaches A method comprising (See Issa abstract, invention embodied as a method for obtaining key frames of video content items and publishing to other users for navigation of video): 
Determining (Issa, ¶30 “First, key frame information for a video content item is obtained”), by a computing device (Issa, ¶34 “the frames of the video content item may be programmatically analyzed to identify frames that are well suited to being used as key frames”), a plurality of keyframes (Issa, ¶31 “key frames corresponds to or is representative of a segment of the video content item”) from a portion of a content asset as the video content item (Id); 
…
determining, … a time reference as the time offset (Issa, ¶36 “the key frame information may include information identifying the key frames of the video content item obtained… This information may be, for example, a time offset from a start of playback of the video content item”) associated with the keyframe as the key frame information includes the time offset (Id), a segment label for the portion of the content asset as annotations (Issa,  ¶37 “annotations may optionally be created for the segments of the video content item represented by the key frames”); and 
generating metadata as the published information (Issa, ¶37 “The key frame information and, optionally, the annotations are then published… such that the key frames of the video content item and, optionally, the annotations are presented to the one or more second users”’ ¶42 “Next, the key frames, and optionally any associated annotations, are presented to a second user… Note that if the key frame information includes references to the key frames, the key frames are obtained from one or more remote sources hosting the key frames utilizing the references from the key frame information”), wherein the metadata (Id) indicates an association between the segment label as annotations (Issa,  ¶37 “annotations may optionally be created for the segments of the video content item represented by the key frames”) and the portion of the content asset as the video content item (Issa, ¶31 “key frames corresponds to or is representative of a segment of the video content item”)  and wherein the metadata as the published information (Issa, ¶37, ¶42) facilitates navigation to the portion of the content asset (Issa, ¶43 “the second user may select a key frame published for a video content item… and choose to begin playback … of that video content item at the segment corresponding to the selected key frame”; “the second user may be enabled to search for more video content items… corresponding to the selected key frame”).  
Issa does not explicitly teach determining, based on at least one of facial-recognition or object-recognition, a quantity of an element in at least one keyframe of the plurality of keyframes; determining, based on …of the element satisfying a threshold … a segment label.
Soni teaches determining, based on at least one of facial-recognition or object-recognition (Soni, ¶50 “detects content features of the key frames by analyzing the key frames with content feature recognition technology (e.g., object recognition technology)… For example, the content feature recognition technology can recognize (e.g., detect) the content features depicted in the key frames using machine learning (e.g., deep learning)”; ¶49 “detect objects… photo types (e.g., macros, portraits)”; ¶52 “in the case of a content feature including a person”; ¶53 “can identify that a content features detected with in a key frame is Babe Ruth (e.g., the name), a person (e.g., type), and/or a man (e.g., category)”), a quantity of an element (Soni, ¶19 the media system identifies content features (e.g., objects, activities, emotions, animal, sceneries, locations, colors) depicted in a group of key frames from the video content”) in at least one keyframe of the plurality of keyframes as the key frames (Id); 
determining, based on the quantity of… the element as the content feature (Soni, ¶53; ¶54 “determine confidence values for each identified content feature”) satisfying a threshold (Soni, ¶56 “a defined threshold confidence value… the media system 108 can elect to not identify the content features… based on the confidence value being below the threshold, that the probability of the content feature being accurate is not sufficient”) … a segment label for the portion of the content asset as the categories associated with the key frame based on the content feature (Soni, ¶53 “based on, detecting and identifying characteristics of a content feature within a key frame, the media system 108 can identify a name, type, or a category for a content feature depicted within a key frame… associate identification and characteristic data for a content feature with a key frame that includes the content features”).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the tagging/annotation system taught by Issa using the automatic categorization features taught by Soni as it yields the predictable results of providing a means to automatically tag/annotate the content without human intervention.  Within the device taught by Issa the annotations are expected to be created by the user (Issa, ¶37).  The proposed combination is modifying Issa, to use the techniques taught by Soni to automatically generate category annotations for the key frames (Soni, ¶53-56).  This enables the annotations to be more efficient and more accurate as the process is automated and does not require a human user, and is more consistent as the labeling is not subject to individual user whims.
Issa in view of Soni does not explicitly teach a quantity of the element.
Ronfard teaches determining, based on the quantity as the number of face features detected, such as the example number 2 for an interview (Ronfard, Section 4 “Shot classification”, Page 4, “Frames containing detected faces or text areas may be searched according to the number, the size, or the position of these features, looking for specific classes of scenes”; Page 5 “Number and size of detected faces may characterize a big audience (multiple faces), an interview (two medium size faces) or a close-up view of a speaker (a large size face)”)  of the element satisfying a threshold as the example number 2 being categorized as an interview (Id) …associated with the keyframe as the frames containing the detected faces (Id), a segment label as the classification, such as “big audience”, “interview”, or “close-up” (Id).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the key frame classification/label taught by the combination of Issa-Soni using the Classification techniques taught by Ronfard to provide a powerful video classification functionality that is more meaningful to browsing/searching users.  The use of the techniques taught by Ronfard may enable the frames to be “searched according to the number, the size, or the positions of these features” (Page 4, Section 4 “Shot classification”).  The proposed combination enables frames to be more granularly classified and labeled, in a manner user are expected to search, browse, and consume the content.  Thereby allowing users to more effectively discover and understand the content.
Issa does not explicitly teach that the segment label is determined based on the time reference associated with the key frame.  To be clear, Issa teaches a time reference as the time offset (Issa, ¶36 “the key frame information may include information identifying the key frames of the video content item obtained… This information may be, for example, a time offset from a start of playback of the video content item”) associated with they key frame as the key frame information includes the time offset (Id), and teaches determining a segment label as annotations (Issa,  ¶37 “annotations may optionally be created for the segments of the video content item represented by the key frames”) but does not teach that these annotations are determined based on the time offset.
Bennett teaches determining, based on … a time reference as the beginning, end, last (Bennett, Column 6, line 19 “Intro: An intro portion may start at the beginning”; line 27-28 “Bridge: A bridge portion commonly occurs within an audio stream other than the beginning or end”; lines 37-38 “Outtro: An outro portion may include the last portion of an audio stream and generally trails off of the last chorus”)  associated with the key frame as the portion of the content (Id), a segment label as Intro, Bridge, Outtro (Id) for the portion of the content asset as the content (Id; Bennett, Column 2, lines 12-15 “As sued herein, “electronic media” may refer to different forms of… video information, such as … television, video recordings… It should be understood that the description may equally apply to other forms of electronic media, such as video streams or files”).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the proposed device to incorporate the classification techniques taught by Bennett as it yields the predictable results of being able to classify the Intro, Bridge, and Outro portions of content.  

With regard to claim 2 the proposed combination further teaches wherein the element comprises at least one of a face, an object, or an advertisement (See Soni [0032] elements (i.e. features) comprise objects, persons, etc. And see Ronfard Section 3.1 elements comprises faces. Note that a “face” is also an “object” under the broadest reasonable interpretation of “object.”)..  

With regard to claim 3 the proposed combination further teaches wherein determining a keyframe of the plurality of keyframes is based on at least one of: a color histogram for a frame of the content asset, or a quantity of changes between a plurality of frames of the content asset.  (See Soni [0042]-[0046] wherein content and non-content based methods for determining keyframes are provided including as in [0045] based on “color” as analyzed in [0046] based on “histogram similarity” which is based on a color histogram of the frames (i.e. between frames). Further as in the same portions of Soni the “changes in imagery from one frame to a next” (i.e. quantity of changes) are compared to determine keyframes such as by inter-frames entropy which by definition determines keyframes based on the appearance and the disappearance of one or more significant objects, that is quantity of changes. See also Ronfard Section 3 color content).

With regard to claim 7 the proposed combination further teaches wherein the element comprises an object and wherein determining the segment label comprises applying, based on the quantity of the object identified by the object-recognition, an image classifier to the at least one keyframe of the plurality of keyframes (See Soni [0053]-[0057] and particularly [0055] applying an image classifier to the keyframe. Moreover, note Ronfard as in Section 4 and 4.1 shot/image classification applied based on defined classes of faces/objects representing specific labels/categories, such as “an interview (2 medium size faces)”).  

With regard to claim 8 the proposed combination further teaches  wherein the element comprises an object and wherein determining the segment label comprises determining, based on a quantity of matches between at least one object of a segment profile and the at least one object identified by the object-recognition satisfying a second threshold, the segment label (See Ronfard as in Section 4 and 4.1 shot/image classification applied based on defined classes of faces/objects representing specific labels/categories.  A single large face is classified as a clos-up view of a speaker).  

With regard to claim 22 the proposed combination further teaches wherein the element comprises faces and wherein based on the quantity of faces satisfying the threshold, the segment label comprises at least one of an advertisement, an interview, or a monologue (See Ronfard Section 4.1 wherein the element is a face and the quantity of faces satisfying the threshold of “two” the segment label is “interview” or as in Section 4.3 “interview shot”).  

Claims 10-19, 21, 24, and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Issa [2009/0265737] in view of Soni [2018/0089203] and Ronfard [Conceptual indexing of television images based on face and caption sizes and locations]

With regard to claim 10 Issa teaches A method comprising (See Issa abstract, invention embodied as a method for obtaining key frames of video content items and publishing to other users for navigation of video): 
Determining (Issa, ¶30 “First, key frame information for a video content item is obtained”), by a computing device (Issa, ¶34 “the frames of the video content item may be programmatically analyzed to identify frames that are well suited to being used as key frames”), from a portion of a content asset as the video content item (Issa, ¶31 “key frames corresponds to or is representative of a segment of the video content item”), a plurality of keyframes as the key frames (Id); 
…
determining,… a segment profile indicating a category of segment in the content asset as annotations (Issa,  ¶37 “annotations may optionally be created for the segments of the video content item represented by the key frames”); and 
generating metadata as the published information (Issa, ¶37 “The key frame information and, optionally, the annotations are then published… such that the key frames of the video content item and, optionally, the annotations are presented to the one or more second users”’ ¶42 “Next, the key frames, and optionally any associated annotations, are presented to a second user… Note that if the key frame information includes references to the key frames, the key frames are obtained from one or more remote sources hosting the key frames utilizing the references from the key frame information”), wherein the metadata indicates an association between the category of the segment as annotations (Issa, ¶37 “annotations may optionally be created for the segments of the video content item represented by the key frames”) and the portion of the content asset as the video content item (Issa, ¶31 “key frames corresponds to or is representative of a segment of the video content item”) and wherein the metadata facilitates navigation to the portion of the content asset (Issa, ¶43 “the second user may select a key frame published for a video content item… and choose to begin playback … of that video content item at the segment corresponding to the selected key frame”; “the second user may be enabled to search for more video content items… corresponding to the selected key frame”).  
Issa does not explicitly teach determining, based on the plurality of keyframes, a first plurality of disparate objects; determining, based on a quantity of matches between the first plurality of disparate objects and a second plurality of disparate objects satisfying a threshold.
Soni teaches determining, based on the plurality of keyframes, a first plurality of disparate objects (Soni, ¶50 “detects content features of the key frames by analyzing the key frames with content feature recognition technology (e.g., object recognition technology)… For example, the content feature recognition technology can recognize (e.g., detect) the content features depicted in the key frames using machine learning (e.g., deep learning; ¶19 the media system identifies content features (e.g., objects, activities, emotions, animal, sceneries, locations, colors) depicted in a group of key frames from the video content”)”; 
determining, based on a … of matches as the content feature (Soni, ¶53; ¶54 “determine confidence values for each identified content feature”)… satisfying a threshold (Soni, ¶56 “a defined threshold confidence value… the media system 108 can elect to not identify the content features… based on the confidence value being below the threshold, that the probability of the content feature being accurate is not sufficient”), a segment profile indicating a category of segment in the content asset as the categories associated with the key frame based on the content feature (Soni, ¶53 “based on, detecting and identifying characteristics of a content feature within a key frame, the media system 108 can identify a name, type, or a category for a content feature depicted within a key frame… associate identification and characteristic data for a content feature with a key frame that includes the content features”). 
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the tagging/annotation system taught by Issa using the automatic categorization features taught by Soni as it yields the predictable results of providing a means to automatically tag/annotate the content without human intervention.  Within the device taught by Issa the annotations are expected to be created by the user (Issa, ¶37).  The proposed combination is modifying Issa, to use the techniques taught by Soni to automatically generate category annotations for the key frames (Soni, ¶53-56).  This enables the annotations to be more efficient and more accurate as the process is automated and does not require a human user, and is more consistent as the labeling is not subject to individual user whims.
Issa in view of Soni does not explicitly teach a quantity of matches between the first plurality of disparate objects and a second plurality of disparate objects.
Ronfard teaches a quantity of matches as the number of face features detected, such as the example number 2 for an interview (Ronfard, Section 4 “Shot classification”, Page 4, “Frames containing detected faces or text areas may be searched according to the number, the size, or the position of these features, looking for specific classes of scenes”; Page 5 “Number and size of detected faces may characterize a big audience (multiple faces), an interview (two medium size faces) or a close-up view of a speaker (a large size face)”)  between the first plurality of disparate objects as a first face in the frame with an interview (Id) and a second plurality of disparate objects as a first face in the frame with an interview (Id)
satisfying a threshold as the example number 2 being categorized as an interview (Id), a segment profile indicating a category as the classification, such as “big audience”, “interview”, or “close-up” (Id) of segment in the content asset as the frames containing the detected faces (Id). 
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the key frame classification/label taught by the combination of Issa-Soni using the Classification techniques taught by Ronfard to provide a powerful video classification functionality that is more meaningful to browsing/searching users.  The use of the techniques taught by Ronfard may enable the frames to be “searched according to the number, the size, or the positions of these features” (Page 4, Section 4 “Shot classification”).  The proposed combination enables frames to be more granularly classified and labeled, in a manner user are expected to search, browse, and consume the content.  Thereby allowing users to more effectively discover and understand the content.

With regard to claim 11 the proposed combination further teaches wherein determining the first plurality of disparate objects comprises applying an image classifier to the plurality of keyframes (See Soni [0053]-[0057] and particularly [0055] applying an image classifier to the keyframe. Moreover, note Ronfard as in Section 4 and 4.1 shot/image classification applied based on defined classes of faces/objects representing specific labels/categories, such as “an interview (2 medium size faces)”).  

With regard to claim 12 the proposed combination further teaches wherein determining the first plurality of disparate objects comprises: 
determining, based on object-recognition (Soni, ¶50 “detects content features of the key frames by analyzing the key frames with content feature recognition technology (e.g., object recognition technology)… For example, the content feature recognition technology can recognize (e.g., detect) the content features depicted in the key frames using machine learning (e.g., deep learning)”; ¶49 “detect objects… photo types (e.g., macros, portraits)”, for at least one object of the first plurality of disparate objects, a confidence score of a plurality of confidence scores (Soni, ¶53; ¶54 “determine confidence values for each identified content feature”), wherein the plurality of confidence scores indicate an association to one or more identifiable objects (Soni, ¶53 “based on, detecting and identifying characteristics of a content feature within a key frame, the media system 108 can identify a name, type, or a category for a content feature depicted within a key frame… associate identification and characteristic data for a content feature with a key frame that includes the content features”); and 
determining that at least one confidence score of the plurality of confidence scores satisfies a second threshold (See Soni [0056] wherein confidence score satisfies a threshold or otherwise object/feature is not identified; See Ronfard as in Section 4 and 4.1 shot/image classification applied based on defined classes of faces/objects representing specific labels/categories.  A single large face is classified as a clos-up view of a speaker).  

With regard to claim 13 the proposed combination further teaches wherein determining the plurality of keyframes is based on determining a quantity of changes between a plurality of frames of the content asset (See Soni [0042]-[0046] wherein content and non-content based methods for determining keyframes are provided including based on the “changes in imagery from one frame to a next” (i.e. quantity of changes) are compared to determine keyframes such as by inter-frames entropy which by definition determines keyframes based on the appearance and the disappearance of one or more significant objects, that is quantity of changes).  

With regard to claim 14 the proposed combination further teaches wherein determining the plurality of confidence scores comprises: 
generating a data structure comprising a multi-dimensional vector (Soni, Figure 4; ¶54-¶57), wherein at least one dimension of the multi-dimension vector corresponds to an identifiable object of the one or more identifiable objects (Soni, Figure 4, see the Content Features A-F); and 
storing each confidence score of the plurality of confidence scores in a corresponding dimension of the multi-dimension vector (Figure 4, see the confidence score % listed).  

With regard to claim 15 the proposed combination further teaches determining, based on the plurality of keyframes and facial-recognition (Soni, ¶50 “detects content features of the key frames by analyzing the key frames with content feature recognition technology (e.g., object recognition technology)… For example, the content feature recognition technology can recognize (e.g., detect) the content features depicted in the key frames using machine learning (e.g., deep learning)”; ¶49 “detect objects… photo types (e.g., macros, portraits)”; ¶52 “in the case of a content feature including a person”; ¶53 “can identify that a content features detected with in a key frame is Babe Ruth (e.g., the name), a person (e.g., type), and/or a man (e.g., category)”), a quantity of faces associated with the portion of the content asset (See Ronfard as in Section 2 “keyframes may have faces” and note that faces are also generally an “object” under the broadest reasonable interpretation of that term. Then as in Section 3.1 “face detection” is performed and determines faces in key frames of the video. Most Specifically as in Section 4 “the number … of these features” identified in the keyframe including explicitly “Number … of detected faces”); and 
determining, based on the quantity of faces, the segment profile (See Ronfard as in Section 4 keyframes that represent segments/shots are classified (i.e. labeled) including based on “the number … of these features” identified in the keyframe including explicitly “Number … of detected faces” equaling a threshold of “two” representing the label/category of an “interview” shot/segment. As in Section 4.1 keyframe classes/labels are defined “based on the number … of their detected faces.” See also 4.3 shot/segment classified based on key frame classification).  

With regard to claim 16 Issa teaches A method comprising (See Issa abstract, invention embodied as a method for obtaining key frames of video content items and publishing to other users for navigation of video): 
…
determining (Issa, ¶30 “First, key frame information for a video content item is obtained”), from a portion of a content asset as the video content item (Issa, ¶31 “key frames corresponds to or is representative of a segment of the video content item”), a plurality of keyframes as the key frames (Id); 
…
generating, …metadata as the published information (Issa, ¶37 “The key frame information and, optionally, the annotations are then published… such that the key frames of the video content item and, optionally, the annotations are presented to the one or more second users”’ ¶42 “Next, the key frames, and optionally any associated annotations, are presented to a second user… Note that if the key frame information includes references to the key frames, the key frames are obtained from one or more remote sources hosting the key frames utilizing the references from the key frame information”), wherein the metadata indicates an association between a segment label as annotations (Issa,  ¶37“annotations may optionally be created for the segments of the video content item represented by the key frames”) …and the portion of the content asset as the video content item (Issa, ¶31 “key frames corresponds to or is representative of a segment of the video content item”) and wherein the metadata facilitates navigation to the portion of the content asset (Issa, ¶43 “the second user may select a key frame published for a video content item… and choose to begin playback … of that video content item at the segment corresponding to the selected key frame”; “the second user may be enabled to search for more video content items… corresponding to the selected key frame”).  
Issa does not explicitly teach determining, by a computing device, one or more keywords …, wherein the one or more keywords are associated with a plurality of identifiable objects of an image classifier; determining, based on the plurality of keyframes and the image classifier, a plurality of objects from the portion of the content asset; and generating, based on a similarity between the plurality of objects and the one or more keywords, … a segment label of … the portion of the content asset.
Soni teaches determining, by a computing device, one or more keywords as the words that make up the name, type, or category (Soni, ¶53 “the media system 108 can identify a name, a type, or a category for the content features depicted within a key frame”) …, wherein the one or more keywords are associated with a plurality of identifiable objects of an image classifier (Soni, ¶53 “associate the identification and the characteristic data for the content feature with a key frame that includes the content feature”; ¶49 “detects content features… detects objects”); 
determining, based on the plurality of keyframes and the image classifier, a plurality of objects from the portion of the content asset (Soni, ¶49 “detects content features included and/or depicted din the key frames… detect objects”); and 
generating, based on a similarity as the content feature (Soni, ¶53; ¶54 “determine confidence values for each identified content feature”) between the plurality of objects as the identified content features (Id) and the one or more keywords, as the terms of the label (Soni, ¶53) … a segment label as the determined categorization (Soni, ¶53 “can identify a name, a type, or a category for a content feature depicted within a key frame”) … and the portion of the content asset as the key frame (Id).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the tagging/annotation system taught by Issa using the automatic categorization features taught by Soni as it yields the predictable results of providing a means to automatically tag/annotate the content without human intervention.  Within the device taught by Issa the annotations are expected to be created by the user (Issa, ¶37).  The proposed combination is modifying Issa, to use the techniques taught by Soni to automatically generate category annotations for the key frames (Soni, ¶53-56).  This enables the annotations to be more efficient and more accurate as the process is automated and does not require a human user, and is more consistent as the labeling is not subject to individual user whims.
Issa in view of Soni does not explicitly teach one or more keywords of a natural language description of a segment profile … a segment label of the segment profile.
Ronfard teaches one or more keywords of a natural language description of a segment profile as the two medium size faces (Ronfard, Page 5, section 4 “an interview (two medium size faces)”; Note this term has been interpreted in light of Paragraph [0060] of the original specification: “A segment profile may identify a segment label and one or more attributes indicative of the particular segment… A segment profile for an ‘interview’ segment may identify ‘two’ faces”) … a segment label as the label “interview” (Id) of the segment profile as the two medium size faces (Id).
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the key frame classification/label taught by the combination of Issa-Soni using the Classification techniques taught by Ronfard to provide a powerful video classification functionality that is more meaningful to browsing/searching users.  The use of the techniques taught by Ronfard may enable the frames to be “searched according to the number, the size, or the positions of these features” (Page 4, Section 4 “Shot classification”).  The proposed combination enables frames to be more granularly classified and labeled, in a manner user are expected to search, browse, and consume the content.  Thereby allowing users to more effectively discover and understand the content.

With regard to claim 17 the proposed combination further teaches wherein the metadata further indicates an association between a plurality of segment labels as annotations (Issa, ¶37 “annotations may optionally be created for the segments of the video content item represented by the key frames”) and a plurality of portions of the content asset as the video content item (Issa, ¶31 “key frames corresponds to or is representative of a segment of the video content item”), DMFIRM #403085732 v24ATTORNEY DOCKET NO.: 26141.0310U1 APPLICATION NO.: 16/196,907 wherein the segment label of the segment profile is a segment label of the plurality of segment labels and the portion of the content asset is a portion of the plurality of portions of the content asset (Ronfard, Page 5, Section 4 “an interview (two medium size faces)”).  

With regard to claim 18 the proposed combination further teaches wherein the segment profile further indicates a quantity of faces associated with the portion of the content asset (See Ronfard Section 4 segment profile indicates quantity of faces is two).  

With regard to claim 19 the proposed combination further teaches wherein determining the plurality of keyframes is based on at least one of: 
a color histogram for a frame of the content asset; or a quantity of changes between a plurality of frames of the content asset (See Soni [0042]-[0046] wherein content and non-content based methods for determining keyframes are provided including as in [0045] based on “color” as analyzed in [0046] based on “histogram similarity” which is based on a color histogram of the frames (i.e. between frames). Further as in the same portions of Soni the “changes in imagery from one frame to a next” (i.e. quantity of changes) are compared to determine keyframes such as by inter-frames entropy which by definition determines keyframes based on the appearance and the disappearance of one or more significant objects, that is quantity of changes. See also Ronfard Section 3 color content).  

With regard to claim 21 the proposed combination further teaches sending the metadata to a user device (See Issa Fig. 1, 204, Fig. 3, 302, and [0042]-[0043] key frames and metadata as annotations are sent/published to second user device.  See also [0069]-[0070] and [0105]-[00110]).  

With regard to claim 25 the proposed combination further teaches wherein determining, based on the plurality of keyframes and the image classifier (Soni, ¶49), a plurality of objects from the portion of the content asset comprises determining, based on the plurality of the keyframes, an aggregate number of each of the plurality of objects in the plurality of keyframes for the portion of the content asset (Ronfard as in Section 4 and 4.1 shot/image classification applied based on defined classes of faces/objects representing specific labels/categories, such as “an interview (2 medium size faces)”).

Claims 23 is rejected under 35 U.S.C. 103 as being unpatentable over Issa in view of Sony, Ronfard, Bennet and Hogg [2016/0042621].

With regard to claim 23 the proposed combination further teaches a plurality of elements that comprises the element (Soni, ¶32 “the term ‘content feature’ refers to a digital element is included and/or depicted in one or more frames of a video content”),
… a first element from the plurality of elements (Soni, ¶32 “the term ‘content feature’ refers to a digital element is included and/or depicted in one or more frames of a video content”)… 
Issa does not explicitly teach wherein the method further comprises filtering a first element from the plurality of elements, based on a quantity of keyframes of the plurality of keyframes in which the first element appears, not satisfying a quantity of keyframes threshold. 
Hogg teaches wherein the method further comprises filtering a first element… (Hogg, ¶196 “a filtering mechanism is used whenever a moving object is detected for a small number of frames - for example three or less, and can be ignored as unlikely to be the result of the motion of a real object”; ¶195), based on a quantity of keyframes of the plurality of keyframes in which the first element appears (Hogg, ¶196 “is detected for a small number of frames”; ¶195), not satisfying a quantity of keyframes threshold (Hogg, ¶196 “for example three or less”; ¶195). 
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the proposed device to filter unlikely identified objects during the content feature detection operation as it yields the predictable results of recuing errors by excluding detections that are not likely real objects.  Hogg states that when an object appears to 3 frames of a camera operating at 15 frames per second, then the object would only have appeared for .2 seconds and that this is not characteristics of a real object (Hogg, ¶195). 

Claim 24 is rejected under 35 U.S.C. 103 as being unpatentable over Issa in view of Sony, Ronard and Hogg [2016/0042621].
	With regard to claim 24, the proposed combination further teaches a first object from the plurality of objects (Soni, ¶32 “the term ‘content feature’ refers to a digital element is included and/or depicted in one or more frames of a video content”).    Issa does not explicitly teach filtering a first object from the plurality of objects, based on a quantity of keyframes of the plurality of keyframes in which the first object appears, not satisfying a quantity of keyframes threshold.  Hogg teaches filtering a first object (Hogg, ¶196 “a filtering mechanism is used whenever a moving object is detected for a small number of frames - for example three or less, and can be ignored as unlikely to be the result of the motion of a real object”; ¶195)…, based on a quantity of keyframes of the plurality of DMFIRM #403085732 v25ATTORNEY DOCKET NO.: 26141.0310U1keyframes in which the first object appears (Hogg, ¶196 “a moving object is detected for a small number of frames” ; ¶195), not satisfying a quantity of keyframes threshold (Hogg, ¶196 “for example three or less” ; ¶195).  
It would have been obvious to one of ordinary skill to which said subject matter pertains at the time the invention was filed to have implemented the proposed device to filter unlikely identified objects during the content feature detection operation as it yields the predictable results of recuing errors by excluding detections that are not likely real objects.  Hogg states that when an object appears to 3 frames of a camera operating at 15 frames per second, then the object would only have appeared for .2 seconds and that this is not characteristics of a real object (Hogg, ¶195). 

Response to Arguments
Applicant's arguments filed June 27, 2022 have been fully considered but they are not persuasive. 

With regard to claim 1, applicant states that the prior art does not teach the newly added claim limitation.  This argument regarding the newly added limitations are addressed in the above rejections. (Page 8, argument A.i.)

With regard to claim 10, applicant states that the prior art does not teach “determining, based on a quantity of matches between the first plurality of disparate objects and a second plurality of disparate objects satisfying a threshold, a segment profile indicating a category of segment in the content asset”.  Applicant paraphrases the prior art, and states that the prior art does not teach the claim limitation. (Page 9 argument B.1.)
With regard to claim 16, applicant states that the prior art does not teach “generating, based on a similarity between the plurality of objects and the one or more keywords, metadata”.  Applicant paraphrases the prior art and states the prior art does not teach the claim limitation. (Page 10, Argument C.i)
With regard to claim 16, applicant states that the prior art does not teach “generating … metadata, wherein the metadata indicate san association between a segment label of the segment profile and the portion of the content asset and wherein the metadata facilitates navigation o the portion of the content asset”. Applicant paraphrases the prior art and states the prior art does not teach the claim limitation. (Page 10, Argument C.ii)
Applicant's arguments do not comply with 37 CFR 1.111(c) because they do not clearly point out the patentable novelty which he or she thinks the claims present in view of the state of the art disclosed by the references cited or the objections made. Further, they do not show how the amendments avoid such references or objections.  The distinction applicant sees between the prior art and the claim language is unclear.

Applicant argues that the motivation to combine Issa and Soni is not supported.  (Page 12, Argument D).  Applicant asserts that the office has failed to provide a proper motivation to combine Issa and Soni.  Applicant asserts that Issa does not suffer ‘the problem alleged by the office’.
In response to the preceding argument, the provided motivation is that it provides a means of automatically tagging/annotating the content without human intervention.  Contrary to applicants assertion, this is not a conclusionary statement, but is in fact a benefit that is achieved via the application of Soni to the device taught by Issa in the manner the office has presented.  The office detailed how the prior art is to be combined, and why this combination is beneficial.  This benefit is achieved via the combination regardless of is Issa ‘suffers the problem’.  Furthermore, it is explicitly stated by Issa that the annotations may be provided by human (¶37), while the keyframes may be manually selected (Issa, ¶31) or automatically selected (Issa, ¶32).  The proposed combination improves the device taught by Issa, by allowing the automatically selected keyframes to be categorized and annotated automatically, thereby eliminating the need for human interaction.  The office further stated additional benefits that are achieved by automating this process, in that when humans provide labels, they are not consistent and are subject to individual whims of users.  When the annotations are generated automatically, they will be consistently applied.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMANDA WILLIS whose telephone number is (571)270-7691. The examiner can normally be reached Monday-Friday 8am-2pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tamara Kyle can be reached on 571-272-4241. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AMANDA L WILLIS/           Primary Examiner, Art Unit 2156