Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Applicant's amendment filed November, 02nd 2022 have been fully entered and considered. Claims 1-20 remain pending in the application. Applicant’s amendment to the Specification, Claims have overcome each and every objection and 112(a) and 112(b) rejections previously set forth in the Non-Final Office Action mailed on September 07th, 2022. Examiner further acknowledges applicant’s statement regarding the statement of substance of interview. 
Applicant's arguments filed November, 02nd 2022 have been fully considered but not persuasive.
Independent claim 1 has been amended to include features similar to those previously recited in claim 3 and intervening claim 2. On pages 18-21, Applicants’ contend that the proposed combination Wu, Wei, Ancona, and Toshikazu (previously applied in the rejection of claim 3) does not teach or suggest the newly added features of independent claim 1:
"inputting each image frame in the video into the image classification model to obtain a model classification result that indicates the shot type for the each image frame in the video; determining whether there exists a first category mode of a first image group prior to a second image frame, and a second category mode of a second image group subsequent to the second image frame, the first category mode indicating a shot type of the first image group having a maximum quantity of corresponding image frames of the first group, and the second category mode indicating a shot type of the second image group having a maximum quantity of corresponding image frames of the second group ; and based on the first category mode and the second category mode being the same, setting the shot type of the second image frame to the shot type of the first image group," (emphasis added by the Applicants). 

In support of the above argument, Applicants assert that Toshikazu (the reference relied upon by the Examiner for teaching the limitation in question highlighted above) determines the sub-shot types in an entire section based on merging the section with the preceding and following section. Applicants thus conclude that Toshikazu does not teach or suggest setting a sub-shot type for a particular image frame based on image groups prior to and subsequent to the particular image frame having a same type.
The Examiner respectfully submits that Applicants’ arguments are not commensurate with the scope of the claim language. What is Toshikazu’s entire section comprised of but particular image frames? If the sub-shot type of the entire section of frames is merged, then the sub-shot type of each particular frame within the entire section is merged. As Applicants have acknowledged, Toshikazu discloses that the sub-shot types in an entire section is merged with the preceding and following section. In so doing, the sub-shot types of each of the particular frames within the said entire section are changed based on the sub-shot types of the preceding and subsequent sections matching one another. Importantly, any of the particular frames within this entire section can correspond to the claimed second image frame and the limitation in question would be met. Nothing in the claim language precludes this interpretation. 
The Examiner recommends amending the independent claims such that their scope is commensurate with Applicants’ arguments. For example, amending the claim to recite “based on the first category mode and the second category mode being the same, setting the shot type of only the second image frame to the shot type of the first image group” or “a first category mode of a first image group immediately prior to a second image frame, and a second category mode of a second image group immediately subsequent to the second image frame” would appear to capture the features of the subject invention that Applicants argue distinguish over Toshikazu.
Therefore
Regarding claims 11 and 20 which are the same argument by the applicant, since claim 1 remains rejected therefore claims 11 and 20 remain rejected under the same references as discussed above. Accordingly, dependent claims 2-10 and 12-19 remain rejected.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-3, 9-13, 19, 20 are rejected under 35 U.S.C. 103 as being unpatentable over WU et al. (U.S. 2008/0118153), as modified by Wen-Li Wei (Deep-Net Fusion to Classify Shots in Concert Videos) and further in view of ANCONA et al. (US 2005/0074161 A1) and of Karitsuka Toshikazu (foreign reference JP4606278B2).
Regarding claim 1, Wu teaches a method for recognizing a highlight in a video (paragraph 70, lines 1-2 where highlight of Wu can be reasonably recognized as containing a key time point), performed by a computer device (paragraph 75, lines 1-2), the method comprising: obtaining at least one video segment by processing each image frame in the video by an image classification model (paragraph 70 lines 1-4 and 18-20, paragraph 71 lines 1-5, the detection unit processes images each, a highlight is extracted, a highlight is a shot group/video segment recognized by a discrimination model containing an image classification unit to establish a first rule of finding relevance between shots, paragraph 305 line 1-3, rule of relevance of shot types between shots; by BRI of a shot, it is understood to be an image frame or a series of uninterrupted frames of a video, paragraph 74 lines 6-8 indicates that the apparatus includes steps to perform the tasks), the image classification model obtained by training according to a first sample image frame marked with a shot type (paragraph 205, lines 2-6, after shot-cut detecting process, the video is divided into a plurality of shots, and each shot is classified by the shot classification unit 14 in FIG.1, into at least one predate trained type called shot types), wherein each of the at least one video segment comprises at least two consecutive image frames in the video (paragraph 70, line 11-13, a shot group includes at least one shot each, a shot group can be understood as a group of shots and a shot is an image frame or a series of uninterrupted frames) and each of the at least one video segment corresponds to one shot type among a plurality of shot types (paragraph 208, lines 2-4, FIG. 31, steps S55- S61, shots are each classified into one of the shot type); determining a target video segment in the at least one video segment based on the shot type of the at least one video segment (paragraph 70, lines 16-20, an extraction unit to recognize a shot group as a highlight according to the shot classification model of the first discrimination model FIG. 2 step S4; in paragraph 70,  lines 10-14, the extraction unit is to extract a shot group according to the first rule which governs the relevance between shots; in paragraph 71, lines 6-9, preferably, the first rule means a state obtained from learning shot types transitions for the probabilistic time-series model, this is from a preferred method, any other approach to find relevance between shots can be used to recognize a highlight such as, a single shot type to be a state instead of a transition between shot types); 
Wu does not explicitly teach that the highlight is a key time point. Wu teaches a highlight is a shot group extracted in accordance with the time-series model (paragraph 71, lines 6-9,), by knowledge in the art, time-series model is based on time-stamped data. Furthermore, the term highlight is used in one embodiment as a corner kick moment of a soccer match (paragraph 100, line 2-4, FIGS 10 and 11 depicting a corner kick scene). Therefore, highlight is analogous to the video segment of interest that contains the key time point. Wu does not teach about an image classification model being a machine learning model, and an image detection model being a machine learning model that can detect locations of two objects to determine a key time point.
Wu teaches an image classification model for classifying video image frame to a certain shot type using image processing algorithm. Wu does not explicitly disclose to teach an image classification model that is based on machine learning. However, in the same field of image classification, Wei teaches an image classification model being a machine learning model, specifically a probabilistic fusion model, termed as error weighted deep cross-correlation model (2nd page, paragraph 2, lines 5-8, the model to perform shot type classification). The model can classify an image to base on shot types (2nd page, paragraph 2, lines 10-13, different shot types are described in table 1).
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to modify Wu’s highlight detection to include Wei’s image classification model because such a modification is the result of simple substitution of one known element for another producing a predictable result. More specifically, Wu’s image classification model using image process algorithm and Wei’s image classification model using machine learning perform the same general and predictable function of classifying each video image frame into a certain shot type. Since each individual element and its function are shown in the prior art, albeit shown in separate references, the difference between the claimed subject matter and the prior art rests not on any individual element or function but in the very combination itself - that is in the substitution of Wu’s image classification model using image process algorithm by replacing it with Wei’s image classification model using machine learning. The motivation for the proposed modification would have been to enhance accuracy of shot classification (abstract, Wei)
Furthermore, although Wu as modified by Wei does not explicitly disclose to further teach an image detection model to obtain the locations of two objects of interest within an image frame. In the same field of object detection, ANCONA teaches about obtaining a location of a first object and a location of a second object in an image frame by an image detection model (paragraph 19, a system for measurement of relative position of an object with respect to a point of reference; the system includes paragraph 22 which is a step for recognizing the object in an image using a classifier; paragraph 1 an object can be a ball and the reference point can be a specific line of a field such as the goal plane; paragraph 11, even though the goal plane is a still point of reference within the image frames obtained by the camera, the reference point detection process is part of the system including a machine learning and reference point delimitation), the image detection model being a machine learning model (paragraph 102, lines 1-8, a neural network is used for the classifier) obtained by training according to a sample image frame marked with location of the first object and location of the second object (paragraph 78, acquiring positive and negative example images of the object (ball) for training; paragraph 82, example images of goal area is used for training); and based on a distance between the first location of the first object and the second location of the second object in the image frame satisfying a preset condition (paragraph 52, lines 5-8, the subsystem includes a camera, and the camera can detect with certainty the goal-scoring event solely when the ball crosses the goal line of a distance at least equal to a certain threshold to be goal-scoring moment; paragraph 102 the monocular subsystem (of the camera) applies classification techniques to detect the object), determining a time point of the image frame as the key time point of the video (paragraph 84, the position of the ball in the image with respect to the position of the goalpost is used to decide a possible goal-scoring event).
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to modify Wu in view of Wei classify each video frame into a certain shot type and divide the video into a video segment of interest based on the shot type and detect within the video segment a highlight moment by calculating the distance between two objects approaching a certain distance threshold to be a highlight moment as taught by Ancona to arrive at the claimed invention discussed above. Such a modification is the result of combing prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to robustly and reliably detect a scored goal event in a sport game ([0026], Ancona).
Furthermore, although Wu as modified by Wei in view of Ancona disclose wherein the obtaining the at least one video segment through the image classification model,  Wu as modified by Wei in view of Ancona does not discloses inputting each image frame in the video into the image classification model to obtain a model classification result that indicates the shot type for the each image frame in the video; determining whether there exists a first category mode of a first image group prior to a second image frame, and a second category mode of a second image group subsequent to the second image frame, the first category mode indicating a shot type of the first image group having a maximum quantity of corresponding image frames of the first group, and the second category mode indicating a shot type of the second image group having a maximum quantity of corresponding image frames of the second group; and based on the first category mode and the second category mode being the same, setting the shot type of the second image frame to the shot type of the first image group; Toshikazu discloses inputting each image frame in the video into the image classification model to obtain a model classification result that indicates the shot type for the each image frame in the video (paragraph 70 lines 1-4 and 18-20, paragraph 71 lines 1-5, the detection unit processes images each, a highlight is extracted, a highlight is a shot group/video segment recognized by a discrimination model containing an image classification unit to establish a first rule of finding relevance between shots, paragraph 305 line 1-3, rule of relevance of shot types between shots; by BRI of a shot, it is understood to be an image frame or a series of uninterrupted frames of a video paragraph, 74 lines 6-8 indicates that the apparatus includes steps to perform the tasks); determining whether there exists a first category mode of a first image group prior to a second image frame, and a second category mode of a second image group subsequent to the second image frame (paragraph 59, lines 6-7, sub-sot type obtained for the sub-shot section, sub-shot type can be understood as category mode and sub-shot section is the image group; moreover, [0060] discloses when a majority vote approach is not identified for a minimum sub-shot segment, then the identification result of the preceding and following sub-shot minimum sections is adopted and assigned to the current sub-shot segment the sub-shot type of the identification result being the same; and the majority of a certain sub-shot type present in a minimum sub-shot detection is voted to be the sub-shot type of the overall sub-shot segment according to [0059]; since a minimum sub-shot segment includes image frames, any of the frame within the segment can be understood as the second image frame, and the preceding group can be understood as the first image group prior to the image frame and the following group can be understood as the second image group subsequent to the image frame as claimed), the first category mode indicating a shot type of the first image group having a maximum quantity of corresponding image frames of the first group, and the second category mode indicating a shot type of the second image group having a maximum quantity of corresponding image frames of the second group ([0060] discloses the identification result of the preceding and following sub-shot minimum sections is adopted which is the majority vote of these groups as discussed previously; moreover, [0059] discloses the majority vote is the sub-shot type as discussed previously; therefore, the identification result through majority vote of the preceding group can be understood as the first category mode of the first image group and its maximum quantity of corresponding image frames can be understood as the majority vote of the preceding group, the same for the following group its identification result can be understood as the second category mode and the majority vote can be understood as the maximum quantity as claimed); and based on the first category mode and the second category mode being the same, setting the shot type of the second image frame to the shot type of the first image group (paragraph 61, lines 1-6, when the majority decision is made by merging the current subshot with the preceding and following sections, paragraph 61, lines 1-3, merging when the type before and after is the same, set the switched point to the sub-shot; therefore, when the minimum sub-shot segment is being merged to matched with the sub-shot type of the identification result, the image frames within it will also be assigned to the result; therefore, this covers the instances of each of the image frames in the minimum sub-shot segment can be understood as the second image frame and the category is assigned to match the result as claimed).
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to combine the teachings of Wu as modified by the teachings of Wei and ANCONA with the teachings of Toshikazu, to obtain shot types for all image frames of the video during the shot type classifying step of Wu using the shot type classification model by Wei. Further, the shot type classified image frames of the video are smoothed using Toshikazu’s smoothing method which is the majority decision method to smooth out noises based on majority vote of the sub-shot type of preceding and following segments to correct the sub-shot types each frame of the current segment to correctly identify and divide the video into corresponding segment based on shot type (Toshikazu [0061]).  Such a modification is the result of combining prior art elements according to known methods to yield predictable results.
Regarding claim 2, Wu as modified by Wei in view of Ancona and further in view of Toshikazu teaches the method according to claim 1, wherein the obtaining the at least one video segment through the first discrimination model/image classification model comprises (Wu, paragraph 70 lines 1-4 and 18-20, paragraph 71 lines 1-5, the detection unit processes images each, a highlight is extracted as discussed above in claim 1): 37dividing the video into the at least one video segment according to the shot type of the each image frame in the video (Wu’s paragraph 70, lines 16-20, an extraction unit to recognize a shot group as a highlight according to the shot classification model of the first discrimination model FIG. 2 step S4; in paragraph 70,  lines 10-14, the extraction unit is to extract a shot group according to the first rule which governs the relevance between shots; in paragraph 71, lines 6-9, preferably, the first rule means a state obtained from learning shot types transitions for the probabilistic time-series model, this is from a preferred method, any other approach to find relevance between shots can be used to recognize a highlight such as, a single shot type to be a state instead of a transition between shot types). Wu as modified by Wei in view of Ancona does not teach about performing a smoothing correction according to the shot type of the each image frame in the video prior to dividing the video into the at least one video segment according to the shot type of the each image frame in the video.
However, Toshikazu teaches smoothing correction according to the shot type of each image frame in the video (paragraph 59, lines 1-9, the sub-shot type is associated with each of the frame of the video, divided into sub-shots based on the sub-shot types, noise may be present in some places; therefore, a smoothing method called majority decision is used to smooth out noises based on majority vote of the sub-shot type. Paragraph 34 lines 1-4, explanation sub-shot type is based on the camera work and moving object in the video). The reasons for combining the references are the same as those discussed above in conjunction with claim 1. 
Regarding claim 3, Wu as modified by Wei, in view of Ancona and Toshikazu teaches the method according to claim 2, wherein the first image group includes r image frames prior to a second image frame, the second image group including r image frames subsequent to the second image frame (Toshikazu , paragraph 60, lines 1-3, the majority decision is performed using the preceding and following sections, even though there is no particular number of frames give in this invention, it still covers the limitation of the current application’s invention of the instances when the preceding and following sections share the same length), the second image frame being any one image frame among a plurality of image frames in the video other than the first r frames and the last r frames, where r is an integer greater than or equal to 1 (Toshikazu, paragraph 59, the video is divided into sub-shot based on sub-shot type, therefore this covers instances when the sub-shot has a single frame when there is a frame that got labelled to a sub-shot type that is different to its adjacent frames).
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to combine the teachings of Wu as modified by the teachings of Wei and ANCONA with the teachings of Toshikazu, to obtain shot types for all image frames of the video during the shot type classifying step of Wu using the shot type classification model by Wei. Further, the shot type classified image frames of the video are smoothed using Toshikazu’s smoothing method which is the majority decision method to smooth out noises based on majority vote of the sub-shot type of preceding and following segments to correct the sub-shot types each frame of the current segment, and the preceding and following segments have the same number of frames and greater than 1, to correctly identify and divide the video into corresponding segment based on shot type (Toshikazu [0061]).  Such a modification is the result of combining prior art elements according to known methods to yield predictable results
Regarding claim 9, Wu as modified by Wei in view of Ancona and Toshikazu teaches a method according to claim 1. ANCONA further teaches that based on the distance between the location of the first object and the location of the second object in the first image frame being less than a predetermined distance threshold (paragraph 52, lines 5-8, the subsystem includes a camera, and the camera can detect with certainty the goal-scoring event solely when the ball crosses the goal line of a distance at least equal to a certain threshold to be goal-scoring moment; paragraph 102 the monocular subsystem (of the camera) applies classification techniques to detect the object), determining the time point of the first image frame in the video as the key time point of the video (paragraph 84, the position of the ball in the image with respect to the position of the goalpost is used to decide a possible goal-scoring event).
Regarding claim 10, Wu as modified by Wei in view of Ancona and  Toshikazu teaches a method according to claim 1. ANCONA further teaches that based on the distance between the location of the first object and the location of the second object in the first image frame being greater than or equal to a predetermined distance threshold (paragraph 52, lines 5-8, the subsystem includes a camera, and the camera can detect with certainty the goal-scoring event solely when the ball crosses the goal line of a distance at least equal to a certain threshold to be goal-scoring moment; paragraph 102 the monocular subsystem applies classification techniques to detect the object), determining that the time point of the first image frame in the video is not the key time point of the video (paragraph 84, the position of the ball in the image with respect to the position of the goalpost is used to decide a possible goal-scoring event).
Regarding claim 11, Wu teaches an apparatus for recognizing a highlight, Wu further discloses at least one memory storing computer program code (paragraph 320, lines 2-4, the programs being stored in ROM or in  a storage device, FIG. 45, 402-ROM & 403-RAM), and at least one processor configured to access the at least one memory and operate as instructed by the computer program code (paragraph 320, lines 1-2, a CPU processing unit to perform the processes accordance with the programs, FIG. 45, 401-CPU), the computer program code comprising: first processing code configured to cause the at least one processor to obtain at least one video segment by processing each frame in the video by an image classification model (paragraph 70 lines 1-4 and 18-20, paragraph 71 lines 1-5, the detection unit processes images each, a highlight is extracted, a highlight is a shot group/video segment recognized by a discrimination model containing an image classification unit to establish a first rule of finding relevance between shots, paragraph 305 line 1-3, rule of relevance of shot types between shots; by BRI of a shot, it is understood to be an image frame or a series of uninterrupted frames of a video, paragraph 74 lines 6-8 indicates that the apparatus includes steps to perform the tasks), the image classification model obtained by training according to a first sample image frame marked with a shot type (paragraph 205, lines 2-6, after shot-cut detecting process, the video is divided into a plurality of shots, and each shot is classified by the shot classification unit 14 in FIG.1, into at least one predate trained type called shot types), wherein each of the at least one video segment comprises at least two consecutive image frames in the video (paragraph 70, line 11-13, a shot group includes at least one shot each, a shot group can be understood as a group of shots and a shot is an image frame or a series of uninterrupted frames) and each of the at least one video segment corresponds to one shot type among a plurality of shot types (paragraph 208, lines 2-4, FIG. 31, steps S55- S61, shots are each classified into one of the shot type); second processing code configured to cause the at least one processor to determine a target video segment in the at least one video segment based on the shot type of the at least one video segment (paragraph 70, lines 16-20, an extraction unit to recognize a shot group as a highlight according to the shot classification model of the first discrimination model FIG. 2 step S4; in paragraph 70,  lines 10-14, the extraction unit is to extract a shot group according to the first rule which governs the relevance between shots; in paragraph 71, lines 6-9, preferably, the first rule means a state obtained from learning shot types transitions for the probabilistic time-series model, this is from a preferred method, any other approach to find relevance between shots can be used to recognize a highlight such as, a single shot type to be a state instead of a transition between shot types); 
Wu does not explicitly teach that the highlight is a key time point. Wu teaches a highlight is a shot group extracted in accordance with the time-series model (paragraph 71, lines 6-9,), by knowledge in the art, time-series model is based on time-stamped data. Furthermore, the term highlight is used in one embodiment as a corner kick moment of a soccer match (paragraph 100, line 2-4, FIGS 10 and 11 depicting a corner kick scene). Therefore, highlight is analogous to the video segment of interest that contains the key time point. Wu does not teach about an image classification model being a machine learning model, and an image detection model being a machine learning model that can detect locations of two objects to determine a key time point.
Wu teaches an image classification model for classifying video image frame to a certain shot type using image processing algorithm. Wu does not explicitly disclose to teach an image classification model that is based on machine learning. However, in the same field of image classification, Wei teaches an image classification model being a machine learning model, specifically a probabilistic fusion model, termed as error weighted deep cross-correlation model (2nd page, paragraph 2, lines 5-8, the model to perform shot type classification). The model can classify an image to base on shot types (2nd page, paragraph 2, lines 10-13, different shot types are described in table 1).
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to modify Wu’s highlight detection to include Wei’s image classification model because such a modification is the result of simple substitution of one known element for another producing a predictable result. More specifically, Wu’s image classification model using image process algorithm and Wei’s image classification model using machine learning perform the same general and predictable function of classifying each video image frame into a certain shot type. Since each individual element and its function are shown in the prior art, albeit shown in separate references, the difference between the claimed subject matter and the prior art rests not on any individual element or function but in the very combination itself - that is in the substitution of Wu’s image classification model using image process algorithm by replacing it with Wei’s image classification model using machine learning. The motivation for the proposed modification would have been to enhance accuracy of shot classification (abstract, Wei)
Furthermore, although Wu as modified by Wei does not explicitly disclose to further teach an image detection model to obtain the locations of two objects of interest within an image frame. In the same field of object detection, ANCONA teaches about obtaining a location of a first object and a location of a second object in an image frame by an image detection model (paragraph 19, a system for measurement of relative position of an object with respect to a point of reference; the system includes paragraph 22 which is a step for recognizing the object in an image using a classifier; paragraph 1 an object can be a ball and the reference point can be a specific line of a field such as the goal plane; paragraph 11, even though the goal plane is a still point of reference within the image frames obtained by the camera, the reference point detection process is part of the system including a machine learning and reference point delimitation), the image detection model being a machine learning model (paragraph 102, lines 1-8, a neural network is used for the classifier) obtained by training according to a sample image frame marked with location of the first object and location of the second object (paragraph 78, acquiring positive and negative example images of the object (ball) for training; paragraph 82, example images of goal area is used for training); and determining code configured to cause the at least one processor to, based on a distance between the first location of the first object and the second location of the second object in the image frame satisfying a preset condition (paragraph 52, lines 5-8, the subsystem includes a camera, and the camera can detect with certainty the goal-scoring event solely when the ball crosses the goal line of a distance at least equal to a certain threshold to be goal-scoring moment; paragraph 102 the monocular subsystem (of the camera) applies classification techniques to detect the object), determining a time point of the image frame as the key time point of the video (paragraph 84, the position of the ball in the image with respect to the position of the goalpost is used to decide a possible goal-scoring event).
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to modify Wu in view of Wei classify each video frame into a certain shot type and divide the video into a video segment of interest based on the shot type and detect within the video segment a highlight moment by calculating the distance between two objects approaching a certain distance threshold to be a highlight moment as taught by Ancona to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to robustly and reliably detect a scored goal event in a sport game ([0026], Ancona).
Furthermore, although Wu as modified by Wei in view of Ancona disclose wherein the obtaining the at least one video segment through the image classification model,  Wu as modified by Wei in view of Ancona does not discloses inputting each image frame in the video into the image classification model to obtain a model classification result that indicates the shot type for the each image frame in the video; determining whether there exists a first category mode of a first image group prior to a second image frame, and a second category mode of a second image group subsequent to the second image frame, the first category mode indicating a shot type of the first image group having a maximum quantity of corresponding image frames of the first group, and the second category mode indicating a shot type of the second image group having a maximum quantity of corresponding image frames of the second group; and based on the first category mode and the second category mode being the same, setting the shot type of the second image frame to the shot type of the first image group; Toshikazu discloses inputting each image frame in the video into the image classification model to obtain a model classification result that indicates the shot type for the each image frame in the video (paragraph 70 lines 1-4 and 18-20, paragraph 71 lines 1-5, the detection unit processes images each, a highlight is extracted, a highlight is a shot group/video segment recognized by a discrimination model containing an image classification unit to establish a first rule of finding relevance between shots, paragraph 305 line 1-3, rule of relevance of shot types between shots; by BRI of a shot, it is understood to be an image frame or a series of uninterrupted frames of a video paragraph, 74 lines 6-8 indicates that the apparatus includes steps to perform the tasks); determining whether there exists a first category mode of a first image group prior to a second image frame, and a second category mode of a second image group subsequent to the second image frame (paragraph 59, lines 6-7, sub-sot type obtained for the sub-shot section, sub-shot type can be understood as category mode and sub-shot section is the image group; moreover, [0060] discloses when a majority vote approach is not identified for a minimum sub-shot segment, then the identification result of the preceding and following sub-shot minimum sections is adopted and assigned to the current sub-shot segment the sub-shot type of the identification result being the same; and the majority of a certain sub-shot type present in a minimum sub-shot detection is voted to be the sub-shot type of the overall sub-shot segment according to [0059]; since a minimum sub-shot segment includes image frames, any of the frame within the segment can be understood as the second image frame, and the preceding group can be understood as the first image group prior to the image frame and the following group can be understood as the second image group subsequent to the image frame as claimed), the first category mode indicating a shot type of the first image group having a maximum quantity of corresponding image frames of the first group, and the second category mode indicating a shot type of the second image group having a maximum quantity of corresponding image frames of the second group ([0060] discloses the identification result of the preceding and following sub-shot minimum sections is adopted which is the majority vote of these groups as discussed previously; moreover, [0059] discloses the majority vote is the sub-shot type as discussed previously; therefore, the identification result through majority vote of the preceding group can be understood as the first category mode of the first image group and its maximum quantity of corresponding image frames can be understood as the majority vote of the preceding group, the same for the following group its identification result can be understood as the second category mode and the majority vote can be understood as the maximum quantity as claimed); and based on the first category mode and the second category mode being the same, setting the shot type of the second image frame to the shot type of the first image group (paragraph 61, lines 1-6, when the majority decision is made by merging the current subshot with the preceding and following sections, paragraph 61, lines 1-3, merging when the type before and after is the same, set the switched point to the sub-shot; therefore, when the minimum sub-shot segment is being merged to matched with the sub-shot type of the identification result, the image frames within it will also be assigned to the result; therefore, this covers the instances of each of the image frames in the minimum sub-shot segment can be understood as the second image frame and the category is assigned to match the result as claimed).
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to combine the teachings of Wu as modified by the teachings of Wei and ANCONA with the teachings of Toshikazu, to obtain shot types for all image frames of the video during the shot type classifying step of Wu using the shot type classification model by Wei. Further, the shot type classified image frames of the video are smoothed using Toshikazu’s smoothing method which is the majority decision method to smooth out noises based on majority vote of the sub-shot type of preceding and following segments to correct the sub-shot types each frame of the current segment to correctly identify and divide the video into corresponding segment based on shot type (Toshikazu [0061]).  Such a modification is the result of combining prior art elements according to known methods to yield predictable results.
Regarding claim 12, Wu as modified by Wei in view of ANCONA and further in view of Toshikazu teaches the method according to claim 11. wherein the first processing code is configured to cause the at least one processor to: 37divide the video into the at least one video segment according to the shot type of the each image frame in the video (Wu’s paragraph 70, lines 16-20, an extraction unit to recognize a shot group as a highlight according to the shot classification model of the first discrimination model FIG. 2 step S4; in paragraph 70,  lines 10-14, the extraction unit is to extract a shot group according to the first rule which governs the relevance between shots; in paragraph 71, lines 6-9, preferably, the first rule means a state obtained from learning shot types transitions for the probabilistic time-series model, this is from a preferred method, any other approach to find relevance between shots can be used to recognize a highlight such as, a single shot type to be a state instead of a transition between shot types). Wu as modified by Wei and ANCONA does not teach about performing a smoothing correction according to the shot type of the each image frame in the video prior to dividing the video into the at least one video segment according to the shot type of the each image frame in the video.
However, Toshikazu teaches smoothing correction according to the shot type of each image frame in the video (paragraph 59, lines 1-9, the sub-shot type is associated with each of the frame of the video, divided into sub-shots based on the sub-shot types, noise may be present in some places; therefore, a smoothing method called majority decision is used to smooth out noises based on majority vote of the sub-shot type. Paragraph 34 lines 1-4, explanation sub-shot type is based on the camera work and moving object in the video). The reasons for combining the references are the same as those discussed above in conjunction with claim 11.
Regarding claim 13, Wu as modified by Wei, in view of Ancona and Toshikazu teaches the method according to claim 12, wherein the first image group includes r image frames prior to a second image frame, the second image group including r image frames subsequent to the second image frame (Toshikazu, paragraph 60, lines 1-3, the majority decision is performed using the preceding and following sections, even though there is no particular number of frames give in this invention, it still covers the limitation of the current application’s invention of the instances when the preceding and following sections share the same length), the second image frame being any one image frame among a plurality of image frames in the video other than the first r frames and the last r frames, where r is an integer greater than or equal to 1 (Toshikazu, paragraph 59, the video is divided into sub-shot based on sub-shot type, therefore this covers instances when the sub-shot has a single frame when there is a frame that got labelled to a sub-shot type that is different to its adjacent frames).
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to combine the teachings of Wu as modified by the teachings of Wei and ANCONA with the teachings of Toshikazu, to obtain shot types for all image frames of the video during the shot type classifying step of Wu using the shot type classification model by Wei. Further, the shot type classified image frames of the video are smoothed using Toshikazu’s smoothing method which is the majority decision method to smooth out noises based on majority vote of the sub-shot type of preceding and following segments to correct the sub-shot types each frame of the current segment, and the preceding and following segments have the same number of frames and greater than 1, to correctly identify and divide the video into corresponding segment based on shot type (Toshikazu [0061]).  Such a modification is the result of combining prior art elements according to known methods to yield predictable results
Regarding claim 19 according to claim 11, Wu as modified by Wei and further in view of Ancona and  Toshikazu teaches the apparatus according to claim 11. Ancona further teaches wherein the determining code is further configured to cause the at least one processor to (paragraph 19, a system for measurement of …., explanation: a system consists of software and hardware components), the image detection model being a machine learning model (paragraph 102, lines 1-8, a neural network is used for the classifier) obtained by training according to a sample image frame marked with location of the first object and location of the second object (paragraph 78, lines 1-3 acquiring positive and negative example images of the object (ball) for training; paragraph 82, lines 1-4 example images of goal area is used for training); and determining code configured to cause the at least one processor to, based on a distance between the first location of the first object and the second location of the second object in the image frame satisfying a preset condition, determine a time point of the image frame as the key time point of the video (paragraph 107, lines 1-3, the computing means are alike those thereto in connection with the monocular systems (camera system), paragraph 52, lines 5-8, the subsystem of at least a camera, and the camera can detect with certainty the goal-scoring event solely when the ball crosses the goal line of a distance at least equal to a certain threshold to be goal-scoring moment; paragraph 102, lines 1-6 the monocular subsystem (of the camera) applies classification techniques to detect the object), determining a time point of the image frame as the key time point of the video (paragraph 84, the position of the ball in the image with respect to the position of the goalpost is used to decide a possible goal-scoring event).
Thus, it would have been obvious for a person of ordinary skill in the art at the effective filing date to combine the teachings of Wu as modified by Wei with the teachings of ANCONA to obtain an apparatus with a storage device to store programs and a processing unit configured to access the storage device and execute the programs to obtain a video segment of interest (a highlight), and the output from Wu and Wei, which is highlight can be inputted in ANCONA’s machine learning model to detect locations of two designated objects to determine the distance between them for recognition of an important moment. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to robustly and reliably detect a scored goal event in a sport game ([0026], Ancona).
Regarding claim 20, Wu teaches a non-transitory computer-readable storage medium storing at least one computer program code configured to cause a computer processor to: (paragraph 325, lines 1-4, the program storage medium that accommodates computer-executable programs to be installed into the computer, FIG. 45, 408-storage device, 411-removable media). The steps comprise of: obtain at least one video segment by processing each image frame in the video by an image classification model (paragraph 70 lines 1-4 and 18-20, paragraph 71 lines 1-5, the detection unit processes images each, a highlight is extracted, a highlight is a shot group/video segment recognized by a discrimination model containing an image classification unit to establish a first rule of finding relevance between shots, paragraph 305 line 1-3, rule of relevance of shot types between shots; by BRI of a shot, it is understood to be an image frame or a series of uninterrupted frames of a video, paragraph 74 lines 6-8 indicates that the apparatus includes steps to perform the tasks), the image classification model obtained by training according to a first sample image frame marked with a shot type (paragraph 205, lines 2-6, after shot-cut detecting process, the video is divided into a plurality of shots, and each shot is classified by the shot classification unit 14 in FIG.1, into at least one predate trained type called shot types), wherein each of the at least one video segment comprises at least two consecutive image frames in the video (paragraph 70, line 11-13, a shot group includes at least one shot each, a shot group can be understood as a group of shots and a shot is an image frame or a series of uninterrupted frames) and each of the at least one video segment corresponds to one shot type among a plurality of shot types (paragraph 208, lines 2-4, FIG. 31, steps S55- S61, shots are each classified into one of the shot type); determine a target video segment in the at least one video segment based on the shot type of the at least one video segment (paragraph 70, lines 16-20, an extraction unit to recognize a shot group as a highlight according to the shot classification model of the first discrimination model FIG. 2 step S4; in paragraph 70,  lines 10-14, the extraction unit is to extract a shot group according to the first rule which governs the relevance between shots; in paragraph 71, lines 6-9, preferably, the first rule means a state obtained from learning shot types transitions for the probabilistic time-series model, this is from a preferred method, any other approach to find relevance between shots can be used to recognize a highlight such as, a single shot type to be a state instead of a transition between shot types); 
Wu does not explicitly teach that the highlight is a key time point. Wu teaches a highlight is a shot group extracted in accordance with the time-series model (paragraph 71, lines 6-9,), by knowledge in the art, time-series model is based on time-stamped data. Furthermore, the term highlight is used in one embodiment as a corner kick moment of a soccer match (paragraph 100, line 2-4, FIGS 10 and 11 depicting a corner kick scene). Therefore, highlight is analogous to the video segment of interest that contains the key time point. Wu does not teach about an image classification model being a machine learning model, and an image detection model being a machine learning model that can detect locations of two objects to determine a key time point.
Wu teaches an image classification model for classifying video image frame to a certain shot type using image processing algorithm. Wu does not explicitly disclose to teach an image classification model that is based on machine learning. However, in the same field of image classification, Wei teaches an image classification model being a machine learning model, specifically a probabilistic fusion model, termed as error weighted deep cross-correlation model (2nd page, paragraph 2, lines 5-8, the model to perform shot type classification). The model can classify an image to base on shot types (2nd page, paragraph 2, lines 10-13, different shot types are described in table 1).
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to modify Wu’s highlight detection to include Wei’s image classification model, being stored in a non-transitory storage medium, because such a modification is the result of simple substitution of one known element for another producing a predictable result. More specifically, Wu’s image classification model using image process algorithm and Wei’s image classification model using machine learning perform the same general and predictable function of classifying each video image frame into a certain shot type. Since each individual element and its function are shown in the prior art, albeit shown in separate references, the difference between the claimed subject matter and the prior art rests not on any individual element or function but in the very combination itself - that is in the substitution of Wu’s image classification model using image process algorithm by replacing it with Wei’s image classification model using machine learning. The motivation for the proposed modification would have been to enhance accuracy of shot classification (abstract, Wei)
Furthermore, although Wu as modified by Wei does not explicitly disclose to further teach an image detection model to obtain the locations of two objects of interest within an image frame. In the same field of object detection, ANCONA teaches about obtain a location of a first object and a location of a second object in an image frame by an image detection model (paragraph 19, a system for measurement of relative position of an object with respect to a point of reference; the system includes paragraph 22 which is a step for recognizing the object in an image using a classifier; paragraph 1 an object can be a ball and the reference point can be a specific line of a field such as the goal plane; paragraph 11, even though the goal plane is a still point of reference within the image frames obtained by the camera, the reference point detection process is part of the system including a machine learning and reference point delimitation), the image detection model being a machine learning model (paragraph 102, lines 1-8, a neural network is used for the classifier) obtained by training according to a sample image frame marked with location of the first object and location of the second object (paragraph 78, acquiring positive and negative example images of the object (ball) for training; paragraph 82, example images of goal area is used for training); and based on a distance between the first location of the first object and the second location of the second object in the image frame satisfying a preset condition (paragraph 52, lines 5-8, the subsystem includes a camera, and the camera can detect with certainty the goal-scoring event solely when the ball crosses the goal line of a distance at least equal to a certain threshold to be goal-scoring moment; paragraph 102 the monocular subsystem (of the camera) applies classification techniques to detect the object), determining a time point of the image frame as the key time point of the video (paragraph 84, the position of the ball in the image with respect to the position of the goalpost is used to decide a possible goal-scoring event).
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to modify Wu in view of Wei classify each video frame into a certain shot type and divide the video into a video segment of interest based on the shot type and detect within the video segment a highlight moment by calculating the distance between two objects approaching a certain distance threshold to be a highlight moment as taught by Ancona to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to robustly and reliably detect a scored goal event in a sport game ([0026], Ancona).
Furthermore, although Wu as modified by Wei in view of Ancona disclose wherein the obtaining the at least one video segment through the image classification model,  Wu as modified by Wei in view of Ancona does not discloses inputting each image frame in the video into the image classification model to obtain a model classification result that indicates the shot type for the each image frame in the video; determining whether there exists a first category mode of a first image group prior to a second image frame, and a second category mode of a second image group subsequent to the second image frame, the first category mode indicating a shot type of the first image group having a maximum quantity of corresponding image frames of the first group, and the second category mode indicating a shot type of the second image group having a maximum quantity of corresponding image frames of the second group; and based on the first category mode and the second category mode being the same, setting the shot type of the second image frame to the shot type of the first image group; Toshikazu discloses inputting each image frame in the video into the image classification model to obtain a model classification result that indicates the shot type for the each image frame in the video (paragraph 70 lines 1-4 and 18-20, paragraph 71 lines 1-5, the detection unit processes images each, a highlight is extracted, a highlight is a shot group/video segment recognized by a discrimination model containing an image classification unit to establish a first rule of finding relevance between shots, paragraph 305 line 1-3, rule of relevance of shot types between shots; by BRI of a shot, it is understood to be an image frame or a series of uninterrupted frames of a video paragraph, 74 lines 6-8 indicates that the apparatus includes steps to perform the tasks); determining whether there exists a first category mode of a first image group prior to a second image frame, and a second category mode of a second image group subsequent to the second image frame (paragraph 59, lines 6-7, sub-sot type obtained for the sub-shot section, sub-shot type can be understood as category mode and sub-shot section is the image group; moreover, [0060] discloses when a majority vote approach is not identified for a minimum sub-shot segment, then the identification result of the preceding and following sub-shot minimum sections is adopted and assigned to the current sub-shot segment the sub-shot type of the identification result being the same; and the majority of a certain sub-shot type present in a minimum sub-shot detection is voted to be the sub-shot type of the overall sub-shot segment according to [0059]; since a minimum sub-shot segment includes image frames, any of the frame within the segment can be understood as the second image frame, and the preceding group can be understood as the first image group prior to the image frame and the following group can be understood as the second image group subsequent to the image frame as claimed), the first category mode indicating a shot type of the first image group having a maximum quantity of corresponding image frames of the first group, and the second category mode indicating a shot type of the second image group having a maximum quantity of corresponding image frames of the second group ([0060] discloses the identification result of the preceding and following sub-shot minimum sections is adopted which is the majority vote of these groups as discussed previously; moreover, [0059] discloses the majority vote is the sub-shot type as discussed previously; therefore, the identification result through majority vote of the preceding group can be understood as the first category mode of the first image group and its maximum quantity of corresponding image frames can be understood as the majority vote of the preceding group, the same for the following group its identification result can be understood as the second category mode and the majority vote can be understood as the maximum quantity as claimed); and based on the first category mode and the second category mode being the same, setting the shot type of the second image frame to the shot type of the first image group (paragraph 61, lines 1-6, when the majority decision is made by merging the current subshot with the preceding and following sections, paragraph 61, lines 1-3, merging when the type before and after is the same, set the switched point to the sub-shot; therefore, when the minimum sub-shot segment is being merged to matched with the sub-shot type of the identification result, the image frames within it will also be assigned to the result; therefore, this covers the instances of each of the image frames in the minimum sub-shot segment can be understood as the second image frame and the category is assigned to match the result as claimed).
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to combine the teachings of Wu as modified by the teachings of Wei and ANCONA with the teachings of Toshikazu, to obtain shot types for all image frames of the video during the shot type classifying step of Wu using the shot type classification model by Wei. Further, the shot type classified image frames of the video are smoothed using Toshikazu’s smoothing method which is the majority decision method to smooth out noises based on majority vote of the sub-shot type of preceding and following segments to correct the sub-shot types each frame of the current segment to correctly identify and divide the video into corresponding segment based on shot type (Toshikazu [0061]).  Such a modification is the result of combining prior art elements according to known methods to yield predictable results.
Claim 4, 14 are rejected under 35 U.S.C. 103 as being unpatentable over WU et al. (U.S. 2008/0118153), as modified by Wen-Li Wei (Deep-Net Fusion to Classify Shots in Concert Videos) and in view of ANCONA et al. (US 2005/0074161 A1), and Karitsuka Toshikazu (foreign reference JP4606278B2) and further in view of Baoxin Li (US 20030034996 A1).
Regarding claim 4, Wu as modified by Wei and in view of Ancona and Toshikazu, further Wu teaches dividing the video into the at least one video segment according to the shot type of the each image frame in the video comprises: dividing the video into at least one temporary video segment according to the shot type of the each image frame in the video such that each of the at least one temporary video segment includes image frames belong to the same shot type (paragraph 74 lines 4-6, moving images divided into a plurality of shots, paragraph 74 lines 13-16, based the first rule which governs relevance between the shots, paragraph 305 line 1-3, rule of relevance of shot types between shots, paragraph 96, lines 2-4, each of the shot is classified into one of the shot types), wherein the shot types of two consecutive temporary video segments are different (paragraph 97, lines 2-5, in one example consecutive shots are picked up in chronological order showing different shot types). Wu does not explicitly teach about combing of video segments into a larger video segment with the condition of under a certain length per segment.
Wu as modified by Wei and in view of Ancona and Toshikazu does not explicitly disclose to teach the method of combining two adjacent video segments into one single larger segment when the quantity of image frames of either segment is below a certain threshold. In the same field of video processing, Li teaches that 38based on number of image frames in a target temporary video segment being less than a preset threshold (paragraph 50, lines 8-9, time duration is less than a predetermined time period, then perform the merging connection, inherently time duration includes number of frames, therefore it can be understood as number of image frames under a certain threshold), then combining the target temporary video segment with the temporary video segment previous to the target temporary video segment (paragraph 50, lines 9-10, the two plays should be connected as a single play when the time duration is less than the threshold); and using each remaining temporary video segment in the at least one temporary video segment as the at least one video segment (paragraph 50, lines 10-14, the time period between two plays may be included within  the total play, meaning now they are considered as a single play with time durations combined as one). 
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to combine the teachings of Wu as modified by the teachings of Wei and ANCONA with the teachings that are provided by Toshikazu and Li to divide a moving image into shots based on shot types by Wu’s discrimination model, and then perform smoothing method provided by Li to combine two plays/shots into one large play when the durations of them are less than a predetermined time period, this help smoothen connection between adjacent plays for video (Li’s paragraph 50 lines 21-23). Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to avoid frequent audio and visual disruptions in the video summary creation ([0050], Li)
Regarding claim 14 the apparatus according to claim 12, Wu as modified by Wei and in view of Ancona and Toshikazu, further Wu teaches processing code configured to perform dividing the video into the at least one video segment according to the shot type of the each image frame in the video comprises: dividing the video into at least one temporary video segment according to the shot type of the each image frame in the video such that each of the at least one temporary video segment includes image frames belong to the same shot type (paragraph 326 lines 1-6, the program is to carry out the steps described in the specification, paragraph 74 lines 4-6, moving images divided into a plurality of shots, paragraph 74 lines 13-16, based the first rule which governs relevance between the shots, paragraph 305 line 1-3, rule of relevance of shot types between shots, paragraph 96, lines 2-4, each of the shot is classified into one of the shot types), wherein the shot types of two consecutive temporary video segments are different (paragraph 97, lines 2-5, in one example consecutive shots are picked up in chronological order showing different shot types). 
Wu as modified by Wei and in view of Ancona and Toshikazu does not explicitly disclose to teach a processing code configured to perform the method combining two adjacent video segments into one single larger segment when the quantity of image frames of either segment is below a certain threshold. In the same field of video processing, In the same field of video processing, Li teaches a processing code that 38based on number of image frames in a target temporary video segment being less than a preset threshold (paragraph 49, lines 5-7, a system to compute …., paragraph 50, line 1 continuing to the aforementioned technique of the system in paragraph 49; paragraph 50, lines 8-9, time duration is less than a predetermined time period, then perform the merging connection) then combining the target temporary video segment with the temporary video segment previous to the target temporary video segment (paragraph 50, lines 9-10, the two plays should be connected as a single play when the time duration is less than the threshold, inherently time duration includes number of frames, therefore it can be understood as number of image frames under a certain threshold); and using each remaining temporary video segment in the at least one temporary video segment as the at least one video segment (paragraph 49, lines 5-7, a system to compute …., paragraph 50, line 1 continuing to the aforementioned technique of the system in paragraph 49; by BRI, a system consists of both hardware and software; paragraph 50, lines 10-14, the time period between two plays may be included within  the total play, meaning now they are considered as a single play with time durations combined as one). 
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to combine the teachings of Wu as modified by the teachings of Wei and ANCONA with the teachings that are provided by Toshikazu and Li to obtain a program that can divide a moving image into shots based on shot types by Wu’s discrimination model, and then perform smoothing method provided by Li to combine two plays/shots into one large play when the durations of them are less than a predetermined time period, this help smoothen connection between adjacent plays for video (Li’s paragraph 50 lines 21-23). Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to avoid frequent audio and visual disruptions in the video summary creation ([0050], Li)
Claims 5-7, 15-17 are rejected under 35 U.S.C. 103 as being unpatentable over WU et al. (U.S. 2008/0118153), as modified by Wen-Li Wei (Deep-Net Fusion to Classify Shots in Concert Videos) and further in view of ANCONA et al. (US 2005/0074161 A1) and in view of Karitsuka Toshikazu (foreign reference JP4606278B2), and further in view of Ed Gronenschild (“the Accuracy and Reproducibility of a Global Method to Correct for Geometric Image Distortion in the X-ray Imaging Chain”).
Regarding claim 5, Wu as modified by Wei in view of Ancona and Toshikazu according to claim 1, Ancona further teaches the obtaining the first location of the first object and the second location of the second object in the image frame of the target video segment comprises: inputting each image frame in the target video segment into the image detection model to obtain a model detection result that indicates respective temporary locations of the first object and the second object in the image frame of the target video segment (paragraph 19, a system for measurement of relative position of an object with respect to a point of reference; the system includes paragraph 22 which is a step for recognizing the object in an image using a classifier; paragraph 1 an object can be a ball and the reference point can be a specific line of a field such as the goal plane; paragraph 11, even though the goal plane is a still point of reference within the image frames obtained by the camera, the reference point detection process is part of the system including a machine learning and reference point delimitation). Wu as modified by Wei in view of Ancona and Toshikazu does not explicitly disclose a smoothing correction on the locations of the two objects.
However, in the same field of location detection smoothing, Gronenschild teaches performing smoothing correction on the respective temporary locations of the first object and the second object in the image frame of the target video segment to obtain the respective locations of the first object and the second object in the image frame of the target video segment (page 1880, paragraph 4, lines 20-23, each image was processed to produce sets of detected grid points, lines 26-27, the adjustment method is applied to cancel systematic errors and random error in computed position of the grid points; page 1877, column 2, section 3 paragraph 2, line 5 grid points have certain coordinate positions in space can arrive at a very specific location in the image).
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to combine the teachings of Wu as modified by Wei with the teachings of ANCONA and Toshikazu to obtain a location of a first object and a location of a second object detected from image frames of a video and perform smoothing on the locations as taught by Gronenschild to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to eliminate random offset errors (page 1880, paragraph 4, lines 24-27).
Regarding claim 6 according to claim 5, Wu as modified by Wei in view of Ancona, Toshikazu and Gronenschild teach a method according to claim 5. Gronenschild further teaches wherein the performing the smoothing correction on the respective temporary locations of the first object and the second object in the image frame of the target video segment comprises: 39obtaining temporary locations of a target object in image frames of a third image group and a fourth image group (page 1880, paragraph 4, lines 20-21,, each image is processed to produce sets of grid points, these grid point positions will be used for averaging and adjustment), the target object being any one of the first object and the second object (page 1880, section 1, lines 8-10, grid points cover an area of interesting that can vary as long it is within a small area of interest containing the object), the third image group including w image frames previous to a third image frame, the fourth image group including w image frames subsequent to the third image frame, the third image frame being any one of a plurality of image frames in the target video segment other than the first w frames and the last w frames, where w is an integer greater than or equal to 1 (page 1880, paragraph 4, lines 20-21, each image is processed to produce sets of grid points, these grid point positions of the other images are used for averaging to make the overall grid points position adjustment; this reference can cover the instances of when there is one image within the imaging chain has an offset in grid points positions and that needs to be corrected using adjacent image frames’ average location, meaning w is 1 for this instance; and another instance is when the image is located in the middle of the image chain where both preceding group of the current image and the following group of the current image have the same number of images, and since the reference can also cover the instances where the number of first w frames and the last w frames do not have offset in their locations and do not need to be averaged); obtaining an average location that is indicated as an average value of the temporary locations of the target object in the image frames of the third image group and the fourth image group (page 1880, paragraph 4, lines 21-24, an average is obtained to make the adjustment in grid points positions); and correcting the temporary location of the target object in the third image frame according to the average location (page 1880, paragraph 4, lines 21-24, the adjustment is performed so  that the average positions of grid points is equal in all images, this reference can include the instance of the current invention where the average position of all preceding and following images are calculated to correct the current image grid position to match with the overall average).
Regarding claim 7 according to claim 6, Wu as modified by Wei in view of Ancona, Toshikazu, Gronenschild teach a method according to claim 6; Gronenschild further teaches wherein the correcting the temporary location of the target object in the third image frame according to the average location comprises: obtaining an offset of the temporary location of the target object in the third image frame relative to the average location (page 1880, paragraph 4, lines 14-18, when the image is processed to obtain the grid point positions, random offset can appear between images); and correcting, in a case that the offset is greater than an offset threshold, the temporary location of the target object in the third image frame to be the average location (page 1880, paragraph 4, lines 21-24, the offset of the images is added to the grid positions so that all grid positions across images share the same average position, this reference covers the instance where offset value is found between of the point position with the preceding and following images, this can be corrected by adding the offset to that position to match with the overall average).
Regarding claim 15 according to claim 11, Wu as modified by Wei in view of Ancona and Toshikazu teaches an apparatus according to claim 11. Ancona further teaches wherein the second processing code is further configured to cause the at least one processor to: input each image frame in the target video segment into the image detection model to obtain a model detection result that indicates respective temporary locations of the first object and the second object in the image frame of the target video segment (paragraph 19, a system for measurement of relative position of an object with respect to a point of reference by BRI a system includes both hardware and software/code; the system includes paragraph 22 which is a step for recognizing the object in an image using a classifier; paragraph 1 an object can be a ball and the reference point can be a specific line of a field such as the goal plane; paragraph 11, even though the goal plane is a still point of reference within the image frames obtained by the camera, the reference point detection process is part of the system including a machine learning and reference point delimitation). Wu as modified by Wei in view of Ancona and Toshikazu does not explicitly disclose a smoothing correction on the locations of the two objects.
However, in the same field of location detection smoothing, Gronenschild teaches a processing code to perform smoothing correction on the respective temporary locations of the first object and the second object in the image frame of the target video segment to obtain the respective locations of the first object and the second object in the image frame of the target video segment (page 1880, part 2 title indicates “digitization error …. detection process”, by BRI digitization required processing through computer by a software/code; Page 1880, paragraph 4, 20-23, each image was processed to produce sets of detected grid points, lines 26-27, the adjustment method is applied to cancel systematic errors and random error in computed position of the grid points; page 1877, column 2, section 3 paragraph 2, line 5 grid points have certain coordinate positions in space can arrive at a very specific location in the image).
Thus, it would have been obvious for a person of ordinary skill in the art before the effective filing date to combine the teachings of Wu as modified by Wei with the teachings of Ancona and Toshikazu to obtain a location of a first object and a location of a second object detected from image frames of a video and perform smoothing on the locations as taught by Gronenschild to arrive at the claimed invention discussed above. Such a modification is the result of combing prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to eliminate random offset errors (page 1880, paragraph 4, lines 24-27).
Regarding claim 16 according to claim 15, Wu as modified by Wei in view of Ancona and Toshikazu teaches the apparatus in claim 15. Gronenschild further teaches wherein the second processing code is further configured to cause the at least one processor to 39obtain temporary locations of a target object in image frames of a third image group and a fourth image group (page 1880, part 2 title indicates “digitization error …. detection process”, by BRI digitization required processing through computer by a software/code; page 1880, paragraph 4, lines 20-21, each image is processed to produce sets of grid points, these grid point positions of the other images are used for averaging to make the overall grid points position adjustment; this reference can cover the instances of when there is one image within the imaging chain has an offset in grid points positions and that needs to be corrected using adjacent image frames’ average location, meaning w is 1 for this instance; and another instance is when the image is located in the middle of the image chain where both preceding group of the current image and the following group of the current image have the same number of images, and since the reference can also cover the instances where the number of first w frames and the last w frames do not have offset in their locations and do not need to be averaged), the target object being any one of the first object and the second object (page 1880, section 1, lines 8-10, grid points cover an area of interesting that can vary as long it is within a small area of interest containing the object), the third image group including w image frames previous to a third image frame, the fourth image group including w image frames subsequent to the third image frame, the third image frame being any one of a plurality of image frames in the target video segment other than the first w frames and the last w frames, where w is an integer greater than or equal to 1 (page 1880, paragraph 4, lines 20-21, each image is processed to produce sets of grid points, these grid point positions will be used for averaging and adjustment, this reference invention can cover the instance of when the one image within the imaging chain has an offset  and that needs to be corrected using adjacent image frames’ average location, and another instance is when the image is located in the middle of the image chain where both preceding group of the current image and the following group of the current image have the same number of images, and since the reference can also cover the instances where the number of first w frames and the last w frames do not have offset in their locations and do not need to be averaged); obtaining an average location that is indicated as an average value of the temporary locations of the target object in the image frames of the third image group and the fourth image group (page 1880, paragraph 4, lines 21-24, an average is calculated to make the adjustment); and correcting the temporary location of the target object in the third image frame according to the average location (page 1880, paragraph 4, lines 21-24, the adjustment is performed so  that the average positions of grid points is equal in all images, this reference can include the instance of the current invention where the average position of all preceding and following images are calculated to correct the current image grid position to match with the overall average).
Regarding claim 17 according to claim 16, Wu as modified by Wei in view of Ancona and Toshikazu and Gronenschild teaches an apparatus according to claim 16. Gronenschild further teaches wherein the second processing code is further configured to cause at least one processor to obtain: obtain an offset of the temporary location of the target object in the third image frame relative to the average location (page 1880, part 2 title indicates “digitization error …. detection process”, by BRI digitization required processing through computer by a software/code; Page 1880, part 2 title indicates “digitization error …. detection process”, by BRI digitization required processing through computer by a software/code; Page 1880, paragraph 4, lines 14-18, when the image is processed to obtain the grid point positions, random offset can appear between images); and correcting, in a case that the offset is greater than an offset threshold, the temporary location of the target object in the third image frame to be the average location (page 1880, paragraph 4, lines 21-24, the offset of the images is added to the grid positions so that all grid positions across images share the same average position, this reference covers the instance where offset value is found between of the point position with the preceding and following images, this can be corrected by adding the offset to that position to match with the overall average).
Claims 8, 18 are rejected under 35 U.S.C. 103 as being unpatentable over WU et al. (U.S. 2008/0118153), as modified by Wen-Li Wei (Deep-Net Fusion to Classify Shots in Concert Videos) and further in view of ANCONA et al. (US 2005/0074161 A1) and in view of Karitsuka Toshikazu (foreign reference JP4606278B2), and further in view of Ed Gronenschild (“the Accuracy and Reproducibility of a Global Method to Correct for Geometric Image Distortion in the X-ray Imaging Chain”) and Zhang Lei (US 9978149 B1).
Regarding claim 8, The method according to claim 5, Wu as modified by Wei in view of Ancona and Toshikazu, Gronenschild teaches the method according to claim 5. Wu as modified by Wei in view of Ancona and Toshikazu, Gronenschild  does not explicitly teach the template matching method. Lei further teaches a method further comprises: obtaining a template image corresponding to a fourth image frame which is an image frame including a target object and that is not detected by the image detection model (column 12, lines 6-8, when no vanishing point is detected in the current image, a template in the buffer is used to find possible matched image patch), the target object being any one of the first object and the second object (column 12, lines 14-15, a door line is the object to be detected, assuming the claimed invention only tries to template match one of the two objects can be any object of interest), the template image being an image corresponding to a temporary location of the target object in a fifth image frame which is an image frame including the target object and that is detected by the image detection model 40previous or subsequent to the fourth image frame among the image frames in the target video segment (column 12, lines 14-16, a door line is detected using previously detected door line through template matching on the current image); and performing a template matching in the fourth image frame through the template image, to obtain the temporary location of the target object in the fourth image frame (column 12, lines 17, when the template matches to the current image patch, the previously detected door line is shifted to the new location as the new door line).
Thus, it would have been obvious for a person of ordinary skill in the art at the effective filing date to combine the teachings of Wu as modified by Wei in view of Ancona and Toshikazu, Gronenschild to obtain locations of objects in an image and perform template matching on the locations of objects in the image based on another image frame as taught by Lei to arrive at the claimed invention discussed above. Such a modification is the result of combining prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to correctly detect object (Lei’s column 11, last paragraph “Template Matching for Door Detection”).
Regarding claim 18, according to claim 15, Wu as modified by Wei in view of Ancona and Toshikazu, Gronenschild teaches the apparatus according to claim 15 Wu as modified by Wei in view of Ancona and Toshikazu, Gronenschild does not teach the second processing code is further configured to perform template matching. Lei further teaches wherein the second processing code is further configured to cause the at least one processor to: obtain a template image corresponding to a fourth image frame which is an image frame including a target object and that is not detected by the image detection model (column 12,  lines 1- 4, a buffer is used for storing template information for template matching, and buffer is online updated, by BRI a buffer includes a program and online updating requires a processing code; column 12, lines 6-8, when no vanishing point is detected in the current image, a template in the buffer is used to find possible matched image patch), the target object being any one of the first object and the second object (column 12, lines 14-15, a door line is the object to be detected, assuming the claimed invention only tries to template match one of the two objects can be any object of interest), the template image being an image corresponding to a temporary location of the target object in a fifth image frame which is an image frame including the target object and that is detected by the image detection model 40previous or subsequent to the fourth image frame among the image frames in the target video segment (column 12, lines 14-16, a door line is detected using previously detected door line through template matching on the current image); and performing a template matching in the fourth image frame through the template image, to obtain the temporary location of the target object in the fourth image frame (column 12, lines 17, when the template matches to the current image patch, the previously detected door line is shifted to the new location as the new door line).
Thus, it would have been obvious for a person of ordinary skill in the art at the effective filing date to combine the teachings of Wu as modified by Wei in view of Ancona and Toshikazu, Gronenschild to obtain locations of objects in an image and perform template matching on the locations of objects in the image based on another image frame as taught by Lei to arrive at the claimed invention discussed above. Such a modification is the result of combing prior art elements according to known methods to yield predictable results. The motivation for the proposed modification would have been to correctly detect object (Lei’s column 11, last paragraph “Template Matching for Door Detection”).

Pertinent Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Sugano Masaru et al., JP4979029B2, priority date: June 02nd, 2009, “SCENE CLASSIFICATION APPARATUS FOR MOVING IMAGE DATA”: this reference teaches shot type correction based on preceding and following images of the current frame when sharing the same type, the correction will update the current frame’s shot type to match with the preceding and following images (page 4, third paragraph from down up of the page). 
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PHUONG HAU CAI whose telephone number is (571)272-9424. The examiner can normally be reached M-F 8:30 am - 5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Claire X. Wang can be reached on (571) 270-1051 or the examiner’s primary examiner reviewer Sean Conner can be reach on (571)-272-1486. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/PHUONG HAU CAI/Examiner, Art Unit 2663                                                                                                                                                                                                        
/SEAN M CONNER/Primary Examiner, Art Unit 2663