DETAILED ACTION
This office action is in response to submission of application on 4/10/2014					
Priority
Applicant’s claim for the benefit of a prior-filed application 13041457 (PAT 9189137) filed on 3/7/2011, which further claims benefit of provisional application 61311524 filed on 3/8/2010 is acknowledged and admitted.  

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 6/16/2021 has been entered. 

Response to Amendment
In the response filed 6/16/2021, Applicant amends claims 8, 17, and 21-24.  Claims 25-27 has been added.  Accordingly, claims 8-9 and 17-18, and 21-27 stand pending.

Response to Arguments
Applicant's arguments filed 6/16/2021 have been fully considered but they are moot in view of new grounds of rejection.
The applicant argues that Schneiderman does not teach “increasing the total duration of the display time of said media portions which correspond to at least one face which is included in said selected cluster of face images, while reducing the total duration of the display time of said media portions which correspond to at least one face which is not included in said selected cluster of face images”.  The examiner respectfully disagrees.  Schneiderman teaches, in paragraphs 12, 15, 36, and 48, that the user can select to view only the person/face-specific video segments of the video and that the video data may be received from a single or multiple video sources.  This is interpreted as increasing the duration of display time of said media portions which correspond to at least one face since multiple sources are used to create the face specific video segments. Schneiderman further teaches removing people from the face database and video segments.  This is interpreted as reducing the total duration of display time since removing them is a way to reduce display time to zero.  Therefore, the examiner is not persuaded.
The applicant also argues that Schneiderman does not teach “wherein the automatically generating further comprises automatically selecting media portion based on at least one of the following factors: camera motion/zoom, video and image quality, action saliency, photo aesthetics, type of voice/sound, facial expression, detected speech, face size, and face location”.  The examiner respectfully disagrees.  Schneiderman teaches, in paragraphs 40, 42, and 47, using face tracks for their face 


Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 8-9, 17-18, 24-25, and 27 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 
Claims 8-9, 17-18, and 24 now include limitations specifying “increasing the total duration of the display time of said media portions which correspond to at least one face which is included in said selected cluster of face images, while reducing the total duration of the display time of said media portions which correspond to at least one face 
Claim 25 incorporates the argued new matter of claim 8 as stated above as well as limitations specifying “the overall display time of the media portions which correspond to faces which are included in the cluster of face images and the display time of the media portions which correspond to faces which are not included in the cluster of face images are equal to the duration of the display times before the increasing and the decreasing thereof”, however, no reference to increasing or decreasing durations of display times corresponding or not corresponding to faces as well as the overall display time being the same before and after the increasing or decreasing is found in the applicants specifications or drawings.
Claim 27 incorporates the argued new matter of claim 8 as stated above as well as limitations specifying “wherein the increasing and the decreasing is based on maximizing a sum of the editing scores of the selected media portions in the automatically edited video”, however, no reference to increasing or decreasing durations of display times corresponding or not corresponding to faces as well as the increasing or decreasing based on a maximizing sum of the editing score is found in the applicants specifications or drawings.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


Claim(s) 8-9, 17-18, and 21-25 is rejected under 35 U.S.C. 103 as being unpatentable over Schneiderman et al. (US2008/0080743), hereinafter Schneiderman, in view of Casares et al. (“Simplifying Video Editing Using Metadata”), hereinafter Casares.

Regarding Claim 8:
Schneiderman teaches:
A method comprising: obtaining at least one user-captured video footage (Schneiderman, figure 2, [0012], note receiving video data from user);
automatically computing, by a computer processor, at least one image descriptor from the at least one user-captured video footage (Schneiderman, [0012, 0015], note detecting human faces from the video);
using said at least one image descriptor to compute, by a computer processor, visual metadata describing a plurality of clusters of face images detected in the video footage, wherein each of said clusters comprise face images of a common person (Schneiderman, [0012, 0015], note detecting human faces from video and grouping to unique people based on the faces);
automatically selecting, by a computer processor, from said at least one user-captured video footage a sequence of media portions, wherein the selecting results in at least two selected media portions taken from a common video footage wherein a start 
allowing a user to apply modification operations to said sequence of selected media portions, wherein said modification operations comprise selecting, by the user, at least one cluster of face images from said plurality of clusters (Schneiderman, abstract, [0012, 0015, 0048], note the user can select to view only the person/face-specific video segments, e.g. selecting at least one cluster of face images, of the video);
automatically generating, by said computer processor, and responsive to said modification operations, an automatically edited video, by increasing the total duration of the display time of said media portions which correspond to at least one face which is included in said selected cluster of face images, while reducing the total duration of the display time of said media portions which correspond to at least one face which is not included in said selected cluster of face images (Schneiderman, [0012, 0015, 0036, 0048], note the user can select to view only the person/face-specific video segments of the video and that the video data may be received from a single or multiple video sources.  This is interpreted as increasing the duration of display time of said media portions which correspond to at least one face since multiple sources are used to create the face specific video segments. Note removing people from the face database and 
While Schneiderman teaches automatically generating edited video, Schneiderman doesn’t specifically teach wherein the automatically generating of the automatically edited video is further carried out by applying to the selected media portions effects and transitions, of which, at least some are determined according to the selected media portions and the computed visual metadata. However, Casares is in the same field of endeavor, video editing, and Casares teaches:
automatically generating, by said computer processor, and responsive to said modification operations, an automatically edited video, by filtering out, using said computer processor, at least one of said media portion that corresponds to a face that is not included in the at least one cluster selected by the user (Casares, page 165 3rd column, note the application could incorporate face detection and automatically filter out video with a selected face.  When combined with the previously cited reference this would be for the selected face and video as taught by Schneiderman).
wherein the automatically generating of the automatically edited video is further carried out by applying to the selected media portions effects and transitions, of which, at least some are determined according to the selected media portions and the computed visual metadata (Casares, page 162, column 2 and 3, note applying special effects between segments based on the selected media portions and visual metadata such as a gap in video; page 165 column 3, note using a special effect for a transition based on visual metadata such as a video gap).


Regarding Claim 9:
Schneiderman as modified shows the method as disclosed above:
Schneiderman as modified further teaches:
automatically generating, prior to the modification operations by the user, an initial automatically edited video comprising the selected media portions, and wherein the emphasized faces appear more often in the automatically edited video than in the initial automatically edited video (Schneiderman, figure 2, [0012], note the received video data from the user is analyzed and video segments based on faces are generated is interpreted as an automatically edited video which comprises the selected media portions based on the face detections). 

Claim 17 discloses substantially the same limitations as claim 8 respectively, except claim 17 is directed to a system comprising a computer processor, a computer readable medium, and a display device (Schneiderman, [0011, 0029], note processor, storage medium, and monitor display screen), while claim 8 is directed to a method. Therefore claim 17 is rejected under the same rationale set forth for claim 8.

Claim 18 discloses substantially the same limitations as claim 9 respectively, except claim 18 is directed to a system comprising a computer processor, a computer readable medium, and a display device (Schneiderman, [0011, 0029], note processor, storage medium, and monitor display screen), while claim 9 is directed to a method. Therefore claim 18 is rejected under the same rationale set forth for claim 9.

Regarding Claim 21:
Schneiderman teaches:
A method comprising: obtaining at least one user-captured video footage (Schneiderman, figure 2, [0012], note receiving video data from user);
automatically computing at least one image descriptor from the at least one user- captured video footage (Schneiderman, [0012, 0015], note detecting human faces from the video); 
using said at least one image descriptor to compute visual metadata describing a plurality of clusters of face images detected in the video footage, wherein each of said clusters comprise face images of a common person (Schneiderman, [0012, 0015], note detecting human faces from video and grouping to unique people based on the faces);
automatically selecting from said at least one user-captured video footage a sequence of media portions, wherein the selecting results in at least two selected media portions taken from a common video footage, wherein a start time and an end time of each said selected media portion are determined based on said visual meta data (Schneiderman, [0012, 0015], note detecting human faces from video, note grouping found faces to unique people, note grouping video segments which the individual is 
selecting a set of single representative images, wherein each single representative image corresponds to one of said clusters (Schneiderman, figure 5, [0050], note thumbnail images of person from the clusters); 
displaying a user with the set of single representative images (Schneiderman, figure 5, [0050], note thumbnail images of person from the clusters are displayed to the user); 
allowing the user to select at least one of the set of single representative images (Schneiderman, abstract, [0012, 0015, 0048-0049], note the user can select to view only the person/face-specific video segments, e.g. selecting at least one cluster of face images, of the video); and 
automatically generating, responsive to said user selection automatically edited video that emphasizes the at least one selected cluster over the rest of the clusters (Schneiderman, abstract, [0012, 0015, 0048], note the user can select to view only the person/face-specific video segments of the video)
wherein the automatically generating further comprises automatically selecting media portion based on at least one of the following factors: camera motion/zoom, video and image quality, action saliency, photo aesthetics, type of voice/sound, facial expression, detected speech, face size, and face location (Schneiderman, [0040, 0042, 0047], note the use of “face tracks” for face association which uses a description of 
While Schneiderman teaches automatically generating edited video, Schneiderman doesn’t specifically teach wherein the automatically generating of the automatically edited video is further carried out by applying to the selected media portions effects and transitions, of which, at least some are determined according to the selected media portions and the computed visual metadata. However, Casares is in the same field of endeavor, video editing, and Casares teaches:
automatically generating, responsive to said user selection automatically edited video that emphasizes the at least one selected cluster over the rest of the clusters (Casares, page 165 3rd column, note the application could incorporate face detection and automatically filter out video with a selected face.  When combined with the previously cited reference this would be for the selected face and video as taught by Schneiderman).
wherein the automatically generating of the automatically edited video is further carried out by applying to the selected media portions effects and transitions, of which, at least some are determined according to the selected media portions and the computed visual metadata (Casares, page 162, column 2 and 3, note applying special effects between segments based on the selected media portions and visual metadata such as a gap in video; page 165 column 3, note using a special effect for a transition based on visual metadata such as a video gap).
It would have been obvious to one of ordinary skill in the art before the effective date of filing to modify the cited references to incorporate the teachings of Casares as 

Regarding Claim 22:
Schneiderman teaches:
A method comprising: obtaining at least two user-captured video footages (Schneiderman, figure 2, [0012], note receiving video data from user);
automatically computing at least one image descriptor from the at least two user- captured video footages (Schneiderman, [0012, 0015], note detecting human faces from the video); 
using said at least one image descriptor to compute visual metadata describing a plurality of clusters of face images detected in the video footages, wherein each of said clusters comprises face images of a common person, and wherein the clusters are generated by computing visual similarities between said face images in said at least two user-captured video footages (Schneiderman, [0012, 0015], note detecting human faces from video and grouping to unique people based on the faces); 
automatically selecting from said at least two user-captured video footages a sequence of media portions, wherein the selecting results in at least two selected media portions taken from a common video footage, and wherein a start time and an end time of each said selected media portion are determined based on said visual meta data (Schneiderman, [0012, 0015], note detecting human faces from video, note grouping found faces to unique people, note grouping video segments which the individual is present into separate indices and since these segments are only the portions which the 
allowing a user to apply modification operations to said sequence of selected media portions, wherein said modification operations comprise selecting at least one cluster of face images from said plurality of clusters (Schneiderman, abstract, [0012, 0015, 0048], note the user can select to view only the person/face-specific video segments, e.g. selecting at least one cluster of face images, of the video); and 
automatically generating, responsive to said modification operations, an automatically edited video that emphasizes the at least one selected cluster over the rest of the clusters (Schneiderman, abstract, [0012, 0015, 0048], note the user can select to view only the person/face-specific video segments of the video), wherein said emphasizing is based on at least one of: camera motion/zoom, video and image quality, action saliency, photo aesthetics, type of voice/sound, facial expression, detected objects, detected speech, face size, and face location (Schneiderman, [0040, 0042, 0047], note the use of “face tracks” for face association which uses a description of motion that includes position, e.g. face location, as well as color signature of a person, e.g. photo aesthetics; note the face mapping uses face location and size).
While Schneiderman teaches automatically generating edited video, Schneiderman doesn’t specifically teach wherein the automatically generating of the automatically edited video is further carried out by applying to the selected media portions effects and transitions, of which, at least some are determined according to the selected media portions and the computed visual metadata. However, Casares is in the same field of endeavor, video editing, and Casares teaches:
rd column, note the application could incorporate face detection and automatically filter out video with a selected face.  When combined with the previously cited reference this would be for the selected face and video as taught by Schneiderman).
wherein the automatically generating of the automatically edited video is further carried out by applying to the selected media portions effects and transitions, of which, at least some are determined according to the selected media portions and the computed visual metadata (Casares, page 162, column 2 and 3, note applying special effects between segments based on the selected media portions and visual metadata such as a gap in video; page 165 column 3, note using a special effect for a transition based on visual metadata such as a video gap).
It would have been obvious to one of ordinary skill in the art before the effective date of filing to modify the cited references to incorporate the teachings of Casares as modified because this would improve the ease and efficiency of video editing (abstract and introduction).

Regarding Claim 23:
Schneiderman teaches:
A method comprising: obtaining at least one user-captured video footage (Schneiderman, figure 2, [0012], note receiving video data from user);

using said at least one image descriptor to compute visual metadata describing a plurality of clusters of face images detected in the video footage, wherein each of said clusters comprise face images of a common person (Schneiderman, [0012, 0015], note detecting human faces from video and grouping to unique people based on the faces); 
automatically selecting from said at least one user-captured video footage a sequence of media portions, wherein the selecting results in at least two selected media portions taken from a common video footage wherein a start time and an end time of each said selected media portion are determined based on said visual meta data (Schneiderman, [0012, 0015], note detecting human faces from video, note grouping found faces to unique people, note grouping video segments which the individual is present into separate indices and since these segments are only the portions which the person/face is present the start time and the end time are automatically determined based on the visual metadata, e.g. face); 
automatically selecting a subset of clusters from said clusters, wherein the subset of clusters is associated with faces that have a representation in the video footage (Schneiderman, figure 5, [0050, 0055], note a subset of clusters are selected to be displayed to the user with the ability to sort); wherein the representation is affected by at least one of: camera motion/zoom, video and image quality, action saliency, photo aesthetics, type of voice/sound, facial expression, detected objects, detected speech, face size, and face location (Schneiderman, [0040, 0042, 0047], note the use of “face 
displaying said subset of clusters to a user (Schneiderman, figure 5, [0050, 0055], note the subset is displayed to the user); 
allowing the user to select at least one cluster of face images from said subset of clusters (Schneiderman, abstract, [0012, 0015, 0048], note the user can select to view only the person/face-specific video segments, e.g. selecting at least one cluster of face images, of the video); and 
automatically generating, responsive to said user selection, an automatically edited video that emphasizes the at least one selected cluster over the rest of the clusters (Schneiderman, abstract, [0012, 0015, 0048], note the user can select to view only the person/face-specific video segments of the video).
While Schneiderman teaches automatically generating edited video, Schneiderman doesn’t specifically teach wherein the automatically generating of the automatically edited video is further carried out by applying to the selected media portions effects and transitions, of which, at least some are determined according to the selected media portions and the computed visual metadata. However, Casares is in the same field of endeavor, video editing, and Casares teaches:
automatically generating, responsive to said user selection, an automatically edited video that emphasizes the at least one selected cluster over the rest of the clusters (Casares, page 165 3rd column, note the application could incorporate face detection and automatically filter out video with a selected face.  When combined with 
wherein the automatically generating of the automatically edited video is further carried out by applying to the selected media portions effects and transitions, of which, at least some are determined according to the selected media portions and the computed visual metadata (Casares, page 162, column 2 and 3, note applying special effects between segments based on the selected media portions and visual metadata such as a gap in video; page 165 column 3, note using a special effect for a transition based on visual metadata such as a video gap).
It would have been obvious to one of ordinary skill in the art before the effective date of filing to modify the cited references to incorporate the teachings of Casares as modified because this would improve the ease and efficiency of video editing (abstract and introduction).

Regarding Claim 24:
Schneiderman teaches:
A method comprising: obtaining at least one user-captured video footage (Schneiderman, figure 2, [0012], note receiving video data from user);
automatically computing, by a computer processor, at least one image descriptor from the at least one user-captured video footage (Schneiderman, [0012, 0015], note detecting human faces from the video);
using said at least one image descriptor to compute, by a computer processor, visual metadata describing a plurality of clusters of face images detected in the video 
automatically selecting, by a computer processor, from said at least one user-captured video footage a sequence of media portions, wherein the selecting results in at least two selected media portions taken from a common video footage, wherein a start time and an end time of each said selected media portion are determined based on said visual meta data (Schneiderman, [0012, 0015], note detecting human faces from video, note grouping found faces to unique people, note grouping video segments which the individual is present into separate indices and since these segments are only the portions which the person/face is present the start time and the end time are automatically determined based on the visual metadata, e.g. face);
allowing a user to apply modification operations to said sequence of selected media portions, wherein said modification operations comprise selecting, by the user, at least one cluster of face images from said plurality of clusters (Schneiderman, abstract, [0012, 0015, 0048], note the user can select to view only the person/face-specific video segments, e.g. selecting at least one cluster of face images, of the video);
automatically generating, by said computer processor, and responsive to said modification operations, an automatically edited video, by increasing the total duration of the display time of said media portions which correspond to at least one face which is included in said selected cluster of face images, while reducing the total duration of the display time of said media portions which correspond to at least one face which is not included in said selected cluster of face images (Schneiderman, [0012, 0015, 0036, 
While Schneiderman teaches automatically generating edited video, Schneiderman doesn’t specifically teach wherein the automatically edited video, the selected media portions are synchronized with a soundtrack added to the automatically edited video. However, Casares is in the same field of endeavor, video editing, and Casares teaches:
automatically generating, by said computer processor, and responsive to said modification operations, an automatically edited video, by filtering out, using said computer processor, at least one of said media portion that corresponds to a face that is not included in the at least one cluster selected by the user (Casares, page 165 3rd column, note the application could incorporate face detection and automatically filter out video with a selected face.  When combined with the previously cited reference this would be for the selected face and video as taught by Schneiderman).
wherein the automatically edited video, the selected media portions are synchronized with a soundtrack added to the automatically edited video (Casares, page 162, column 2 and 3, note applying special effects, such as audio synchronization, between segments based on the selected media portions and metadata; page 165 
It would have been obvious to one of ordinary skill in the art before the effective date of filing to modify the cited references to incorporate the teachings of Casares as modified because this would improve the ease and efficiency of video editing (abstract and introduction).

Regarding Claim 25:
Schneiderman as modified shows the method as disclosed above:
Schneiderman as modified further teaches:
wherein in the automatically edited video, the overall display time of the media portions which correspond to faces which are included in the cluster of face images and the display time of the media portions which correspond to faces which are not included in the cluster of face images are equal to the duration of the display times before the increasing and the decreasing thereof (Schneiderman, [0012, 0015, 0036, 0048], note that the automatically edited video consists of video segments corresponding to faces included in the cluster and the length of those segments are the same both before and after they are indexed into the automatically edited video or the not included faces removed from the video).

Claim Rejections - 35 USC § 103
(s) 26-27 are rejected under 35 U.S.C. 103 as being unpatentable over Schneiderman, in view of Casares and Girgensohn et al. (US20060288288), hereinafter Girgensohn.

Regarding Claim 26:
Schneiderman as modified shows the method as disclosed above:
While Schneiderman teaches automatically generating edited video, Schneiderman doesn’t specifically teach wherein the action saliency is calculated as a prediction that a specified point of time in the at least one selected media portion is surprising given other events appearing in the at least one user-captured video footage. However, Girgensohn is in the same field of endeavor, video modifications, and Girgensohn teaches:
wherein the action saliency is calculated as a prediction that a specified point of time in the at least one selected media portion is surprising given other events appearing in the at least one user-captured video footage (Girgensohn, [0027-0029], note detecting activities by using other events in the video.  It is also note that this limitation is a contingent limitation since it is a limitation regarding a step, re action saliency, that may not have been required due to it not being required in the parent method claim and under broadest reasonable interpreted interpretation as explained in section 2111.04 (II) of the MPEP could be interpreted as a contingent limitation that would not be required to be performed).
It would have been obvious to one of ordinary skill in the art before the effective date of filing to modify the cited references to incorporate the teachings of Girgensohn 

Regarding Claim 27:
Schneiderman as modified shows the method as disclosed above:
While Schneiderman teaches automatically generating edited video, Schneiderman doesn’t specifically teach allocating an editing score to each one of the selected media portions, wherein the editing score is independent of an appearing or clustering of a face in the selected media portion, and wherein the increasing and the decreasing is based on maximizing a sum of the editing scores of the selected media portions in the automatically edited video. However, Girgensohn is in the same field of endeavor, video modifications, and Girgensohn teaches:
allocating an editing score to each one of the selected media portions, wherein the editing score is independent of an appearing or clustering of a face in the selected media portion, and wherein the increasing and the decreasing is based on maximizing a sum of the editing scores of the selected media portions in the automatically edited video (Girgensohn, claim 1, [0029], note that once a measure of interest has been computed, using the importance score with a moving average above a threshold to select the sequences.  When combined with the previous references this would be for the generating the video segments as taught by Schneiderman and Casares).
It would have been obvious to one of ordinary skill in the art before the effective date of filing to modify the cited references to incorporate the teachings of Girgensohn .

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Singer et al. (US20110142420) teaches transforming and editing video files; Ishizaka (US2010/0026842) teaches face detection; Toyama (US738508) teaches processing video segments based on points of interests; Trivedi (US20060187305) teaches face recognition of persons in videos; Ayaki (US2007/0159533) teaches face extraction and recognition.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHN J MORRIS whose telephone number is (571)272-3314. The examiner can normally be reached M-F 6:30-2:30 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Neveen Abel-Jalil can be reached on 571-270-0474. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, 





/JOHN J MORRIS/Examiner, Art Unit 2152                                                                                                                                                                                                        12/4/2021

/NEVEEN ABEL JALIL/Supervisory Patent Examiner, Art Unit 2152