Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED OFFICE ACTION

Status of Claims

Claims 1-20 are pending in this Office Action.

Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b) (2) (C) for any potential 35 U.S.C. 102(a) (2) prior art against the later invention.

1.	Claims 1, 8 and 15  are rejected under 35 U.S.C 103 as being patentable over Smith et al.  ( USPUB 20190042900)  in view of Dong Xu ( NPL DOC:  " Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment," 30 May 2008, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 11, NOVEMBER 2008, Pages 1985-1995) .


As per claim 1,  Smith et al. teaches A computer-implemented method ( Paragraphs [0144-0145]) , the method comprising: receiving, by one or more processors ( Paragraphs [0611]- “…processors 6613-6615”) , feature vectors  ( Paragraphs [0230-0231]) corresponding to audio and video components of a video ( neural network training of visual data taught within Paragraph [0219-0220] ) ; providing, by one or more processors( Paragraphs [0611]- “…processors 6613-6615”), the feature vectors as input to a trained neural network ( Paragraph [0284], [0292-0293]) ; receiving, by one or more processors( Paragraphs [0611]- “…processors 6613-6615”), 
Smith e al. does not explicitly teach from the trained neural network, a plurality of output feature vectors that correspond to shots of the video;  applying, by one or more processors, optimal sequence grouping to the output feature vectors  ; and further training, by one or more processors, the trained neural network based, at least in part, on the applied optimal sequence grouping.
However, within analogous art, Dong Xu teaches from the trained neural network, a plurality of output feature vectors that correspond to shots of the video ( video clips which are similar to one in the ordinary skills in the art as video shots are taught within Page 1987- Col. 2 – “…utilize the information from multiple frames of a video clip for event recognition. We will extend this method to multiple levels in Section 4. One video clip P can be represented as a signature:…, where m is the total number of frames, pi is the feature extracted from the ith frame, and wpi is the weight of the ith frame. The weight wpi is used as the total supply of suppliers or the total capacity of consumers in the EMD method, with the default value of 1=m. pi can be any feature, such as Grid Color Moment [27] or CS feature [33]….”) ;  applying, by one or more processors, optimal sequence grouping to the output feature vectors ( Page 1995- Col. 1 – “5.7 Concept Score Feature versus Low-Level Features – we have used 90 video programs from the TRECVID 2005 data set for training the CS feature. For fair comparison, the test videos clips for event recognition are only from another 47 video programs in the TRECVID 2005 data set such that there is no overlap between the test data used in event recognition and the training data for training the CS feature. Note that the test video clips for event recognition…three low-level features using (7) with equal weights, which was also used in [33]. In contrast to early fusion techniques [20], [45], [46], which either concatenate input feature vectors or average multiple distances or kernels from different features to generate a single kernel,…”) ; and further training, by one or more processors, the trained neural network based, at least in part, on the applied optimal sequence grouping ( optimal flow determination from video clips for optimal matching among frames of the video taught within Page 1989- Col. 1 – lines 1- 10 and the neural network shown within Page 1986- Fig. 1  and Page 1989- Col. 1- “…In the training stage, we set hyperparameter A to _A0,where the normalization factor A0 is the mean of the EMD distances between all training video clips and the optimal scaling…”).
	One of ordinary skill in the art would have been motivated to combine the teaching of Dong Xu within the modified teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.   because the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu provides a system and method for implementing video event recognition through video sub clip alignment. 
	Therefore, it would have been obvious for one in the ordinary skills in the art before the effective filing date of the claimed invention to implement the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu within the modified teaching of  the Automated semantic inference of visual features and scenes mentioned by Smith e al.   for implementation of a system and method  for a video event recognition through video sub clip alignment.

As per claim 8, Smith et al. teaches A computer program product ( Paragraphs [0144-0145]) comprising: one or more computer-readable storage media and program instructions stored ( Paragraphs [0128] and [0447])  on the one or more computer-readable storage media ( Paragraphs [0164] and [0447]) , the stored program instructions comprising: program instructions to receive feature vectors ( Paragraphs [0230-0231]) corresponding to audio and video components of a video ( neural network training of visual data taught within Paragraph [0219-0220] ) ; program instructions to provide the feature vectors as input to a trained neural network ( Paragraph [0284], [0292-0293]) ; program instructions to receive from the trained neural network ( Paragraphs [0234-0235]), 
Smith e al. does not explicitly teach a plurality of output feature vectors that correspond to shots of the video; program instructions to apply optimal sequence grouping to the output feature vectors; and program instructions to further train the trained neural network based, at least in part, on the applied optimal sequence grouping.  
However, within analogous art, Dong Xu teaches a plurality of output feature vectors that correspond to shots of the video ( video clips which are similar to one in the ordinary skills in the art as video shots are taught within Page 1987- Col. 2 – “…utilize the information from multiple frames of a video clip for event recognition. We will extend this method to multiple levels in Section 4. One video clip P can be represented as a signature: P ¼ fðp1; wp1 Þ; . . . ; ðpm; wpmÞg, where m is the total number of frames, pi is the feature extracted from the ith frame, and wpi is the weight of the ith frame. The weight wpi is used as the total supply of suppliers or the total capacity of consumers in the EMD method, with the default value of 1=m. pi can be any feature, such as Grid Color Moment [27] or CS feature [33]….”) ;  program instructions to apply optimal sequence grouping to the output feature vectors ( Page 1995- Col. 1 – “5.7 Concept Score Feature versus Low-Level Features – we have used 90 video programs from the TRECVID 2005 data set for training the CS feature. For fair comparison, the test videos clips for event recognition are only from another 47 video programs in the TRECVID 2005 data set such that there is no overlap between the test data used in event recognition and the training data for training the CS feature. Note that the test video clips for event recognition…three low-level features using (7) with equal weights, which was also used in [33]. In contrast to early fusion techniques [20], [45], [46], which either concatenate input feature vectors or average multiple distances or kernels from different features to generate a single kernel,…”) ; and program instructions to further train the trained neural network based, at least in part, on the applied optimal sequence grouping ( optimal flow determination from video clips for optimal matching among frames of the video taught within Page 1989- Col. 1 – lines 1- 10 and the neural network shown within Page 1986- Fig. 1  and Page 1989- Col. 1- “…In the training stage, we set hyperparameter A to _A0,where the normalization factor A0 is the mean of the EMD distances between all training video clips and the optimal scaling…”).
	One of ordinary skill in the art would have been motivated to combine the teaching of Dong Xu within the modified teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.   because the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu provides a system and method for implementing video event recognition through video sub clip alignment. 
	Therefore, it would have been obvious for one in the ordinary skills in the art before the effective filing date of the claimed invention to implement the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu within the modified teaching of  the Automated semantic inference of visual features and scenes mentioned by Smith e al.   for implementation of a system and method  for a video event recognition through video sub clip alignment.

As per claim 15, Smith et al. teaches A computer system ( Paragraphs [0144-0145]) , the computer system comprising: one or more computer processors ( Paragraphs [0611]- “…processors 6613-6615”) : one or more computer readable storage medium( Paragraph [0148]) ; and program instructions stored on the computer readable storage medium ( Paragraph [0148]) for execution by at least one of the one or more processors ( Paragraphs [0112-0114]) , the stored program instructions comprising:  program instructions to receive feature vectors ( Paragraphs [0230-0231]) corresponding to audio and video components of a video ( neural network training of visual data taught within Paragraph [0219-0220] ) ; program instructions to provide ( Paragraphs [0611]- “…processors 6613-6615”) the feature vectors as input to a trained neural network ( Paragraph [0284], [0292-0293]) ; receiving, by one or more processors( Paragraphs [0611]- “…processors 6613-6615”), 
Smith e al. does not explicitly teach program instructions to receive from the trained neural network, a plurality of output feature vectors that correspond to shots of the video; program instructions to apply optimal sequence grouping to the output feature vectors; and program instructions to further train the trained neural network based, at least in part, on the applied optimal sequence grouping.  
However, within analogous art, Dong Xu teaches program instructions to receive from the trained neural network( Fig. 1 and Fig. 3) , a plurality of output feature vectors that correspond to shots of the video ( video clips which are similar to one in the ordinary skills in the art as video shots are taught within Page 1987- Col. 2 – “…utilize the information from multiple frames of a video clip for event recognition. We will extend this method to multiple levels in Section 4. One video clip P can be represented as a signature: P ¼ fðp1; wp1 Þ; . . . ; ðpm; wpmÞg, where m is the total number of frames, pi is the feature extracted from the ith frame, and wpi is the weight of the ith frame. The weight wpi is used as the total supply of suppliers or the total capacity of consumers in the EMD method, with the default value of 1=m. pi can be any feature, such as Grid Color Moment [27] or CS feature [33]….”) ; program instructions to apply optimal sequence grouping to the output feature vectors ( Page 1995- Col. 1 – “5.7 Concept Score Feature versus Low-Level Features – we have used 90 video programs from the TRECVID 2005 data set for training the CS feature. For fair comparison, the test videos clips for event recognition are only from another 47 video programs in the TRECVID 2005 data set such that there is no overlap between the test data used in event recognition and the training data for training the CS feature. Note that the test video clips for event recognition…three low-level features using (7) with equal weights, which was also used in [33]. In contrast to early fusion techniques [20], [45], [46], which either concatenate input feature vectors or average multiple distances or kernels from different features to generate a single kernel,…”) ; and program instructions to further train the trained neural network based, at least in part, on the applied optimal sequence grouping ( optimal flow determination from video clips for optimal matching among frames of the video taught within Page 1989- Col. 1 – lines 1- 10 and the neural network shown within Page 1986- Fig. 1  and Page 1989- Col. 1- “…In the training stage, we set hyperparameter A to _A0,where the normalization factor A0 is the mean of the EMD distances between all training video clips and the optimal scaling…”).
	One of ordinary skill in the art would have been motivated to combine the teaching of Dong Xu within the modified teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.   because the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu provides a system and method for implementing video event recognition through video sub clip alignment. 
	Therefore, it would have been obvious for one in the ordinary skills in the art before the effective filing date of the claimed invention to implement the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu within the modified teaching of  the Automated semantic inference of visual features and scenes mentioned by Smith e al.   for implementation of a system and method  for a video event recognition through video sub clip alignment.


2.	Claims 2, 9 and 16  are rejected under 35 U.S.C 103 as being patentable over Smith et al.  ( USPUB 20190042900)  in view of Dong Xu ( NPL DOC:  " Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment," 30 May 2008, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 11, NOVEMBER 2008, Pages 1985-1995) in further view of ZAHEER et al. ( USPUB 20160050465).

As per claim 2,Combination of Smith e al. and Dong Xu teach claim 1, 
Within analogous art, ZAHEER et al. teaches the method further comprising: determining, by one or more processors, scene boundaries for the video ( Paragraph [0089]- “…the shot segmentation interface 1100 as shown in FIG. 11, the user can correct any of the incorrectly identified boundaries. The interface allows increasing/decreasing scene boundaries, deleting a scene, and adding a scene on frames not being assigned to any scene…”) , wherein the scene boundaries are determined based, at least in part, on a second plurality of output feature vectors received from the further trained neural network ( Paragraphs [0092] “…A feature vector is then formed using these scores on which a classifier (such as SVM) is trained to classify an interval of frames as containing shot boundary or shot transition….” AND Paragraph [0100]) .  
One of ordinary skill in the art would have been motivated to combine the teaching of ZAHEER et al.  within the modified combined teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.  and  the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu because the Dynamically targeted ad augmentation in video mentioned by ZAHEER et al.  provides a system and method for implementing automated augmentation of videos with contents. 
	Therefore, it would have been obvious for one in the ordinary skills in the art before the effective filing date of the claimed invention to implement the Dynamically targeted ad augmentation in video mentioned by ZAHEER et al. within the modified combined teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.  and  the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu for implementation of a system and method  for automated augmentation of videos with contents.

As per claim 9, Combination of Smith e al. and Dong Xu teach claim 8,
Within analogous art, ZAHEER et al. teaches the stored program instructions further comprising: program instructions to determine scene boundaries for the video ( Paragraph [0089]- “…the shot segmentation interface 1100 as shown in FIG. 11, the user can correct any of the incorrectly identified boundaries. The interface allows increasing/decreasing scene boundaries, deleting a scene, and adding a scene on frames not being assigned to any scene…”) , wherein the scene boundaries are determined based, at least in part, on a second plurality of output feature vectors received from the further trained neural network ( Paragraphs [0092] “…A feature vector is then formed using these scores on which a classifier (such as SVM) is trained to classify an interval of frames as containing shot boundary or shot transition….” AND Paragraph [0100]) .  
One of ordinary skill in the art would have been motivated to combine the teaching of ZAHEER et al.  within the modified combined teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.  and  the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu because the Dynamically targeted ad augmentation in video mentioned by ZAHEER et al.  provides a system and method for implementing automated augmentation of videos with contents. 
	Therefore, it would have been obvious for one in the ordinary skills in the art before the effective filing date of the claimed invention to implement the Dynamically targeted ad augmentation in video mentioned by ZAHEER et al. within the modified combined teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.  and  the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu for implementation of a system and method  for automated augmentation of videos with contents.

As per claim 16,  Combination of Smith e al. and Dong Xu teach claim 15,
Within analogous art, ZAHEER et al. teaches the stored program instructions further comprising: program instructions to determine scene boundaries for the video ( Paragraph [0089]- “…the shot segmentation interface 1100 as shown in FIG. 11, the user can correct any of the incorrectly identified boundaries. The interface allows increasing/decreasing scene boundaries, deleting a scene, and adding a scene on frames not being assigned to any scene…”) , wherein the scene boundaries are determined based, at least in part, on a second plurality of output feature vectors received from the further trained neural network ( Paragraphs [0092] “…A feature vector is then formed using these scores on which a classifier (such as SVM) is trained to classify an interval of frames as containing shot boundary or shot transition….” AND Paragraph [0100]) .  
One of ordinary skill in the art would have been motivated to combine the teaching of ZAHEER et al.  within the modified combined teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.  and  the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu because the Dynamically targeted ad augmentation in video mentioned by ZAHEER et al.  provides a system and method for implementing automated augmentation of videos with contents. 
	Therefore, it would have been obvious for one in the ordinary skills in the art before the effective filing date of the claimed invention to implement the Dynamically targeted ad augmentation in video mentioned by ZAHEER et al. within the modified combined teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.  and  the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu for implementation of a system and method  for automated augmentation of videos with contents.

3.	Claims 3,4,10,11,17 and 18   are rejected under 35 U.S.C 103 as being patentable over Smith et al.  ( USPUB 20190042900)  in view of Dong Xu ( NPL DOC:  " Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment," 30 May 2008, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 11, NOVEMBER 2008, Pages 1985-1995) in further view of Foote et al. ( USPUB 20040221237).

As per claim 3, Combination of Smith e al. and Dong Xu teach claim 1,
Within analogous art, Foote et al. teaches the method further comprising: 
generating, by one or more processors ( Paragraph [0068]), a distance matrix using output feature vectors ( FIG. 31 and 35 and Paragraphs [0143] and [0153]) , wherein the distance matrix is represented by a block-diagonal structure ( FIG. 10 and Paragraphs [0088-0089]) , and 
wherein the distance matrix defines distances between the output features vectors ( Paragraph [0104]- “…, the magnitudes of the difference vector and the standard deviation are computed as Euclidean distances. The magnitude of the difference vector is computed by the square root of the sum of the squares of its d entries. The standard deviation of the image class is computed as the square root of the sum of the diagonal elements of the diagonal covariance matrix….”) ; and identifying, by one or more processors, from the distance matrix, diagonal blocks that represent sequences of shots having respective the output feature vectors with respective distances within a certain proximity ( Paragraphs [0153-0154] teaches the distance matrices and the diagonal line to indicate frames within the video and Paragraphs [0155-0156] teaches the feature vectors)  .  
One of ordinary skill in the art would have been motivated to combine the teaching of Foote et al.  within the modified combined teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.  and  the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu because the Methods And Apparatuses For Interactive Similarity Searching, Retrieval And Browsing Of Video mentioned by Foote et al.   provides a system and method for implementing automatic  retrieval of video of similarities . 
	Therefore, it would have been obvious for one in the ordinary skills in the art before the effective filing date of the claimed invention to implement the Methods And Apparatuses For Interactive Similarity Searching, Retrieval And Browsing Of Video mentioned by Foote et al. within the modified combined teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.  and  the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu for implementation of a system and method  for automatic  retrieval of video of similarities .

As per claim 4, Combination of Smith e al. and Dong Xu  and Foote et al. teach claim 3,
Within analogous art, Foote et al. teaches the method further comprising: determining, by one or more processors, a division of scenes in the video based ( Paragraph [0067]- “…The present invention includes methods for segmenting and classifying video sequences into a pre-defined set of classes. Examples of video classes include close-ups of people, crowd scenes, and shots of presentation material such as power point slides. …”) , at least in part, on the diagonal blocks identified from the distance matrix ( FIG. 41 AND Paragraphs [0179-180]- “FIG. 41 illustrate an inter-segment acoustic distance matrix according to the present invention. Diagonal entries 4101 through 4105 are black indicating that each segment is similar to itself. Grey regions 4106 and 4107 represent the partial similarity of the audio intervals at the beginning and end of the source audio. The white regions represent non-similarity of audio segments….”) . 

As per claim 10, Combination of Smith e al. and Dong Xu teach claim 8,
Within analogous art, Foote et al. teaches the stored program instructions  ( Paragraph [0068]) further comprising: program instructions to generate a distance matrix using output feature vectors ( FIG. 31 and 35 and Paragraphs [0143] and [0153]) , wherein the distance matrix is represented by a block-diagonal structure ( FIG. 10 and Paragraphs [0088-0089]) , and wherein the distance matrix defines distances between the output features vectors ( Paragraph [0104]- “… the magnitudes of the difference vector and the standard deviation are computed as Euclidean distances. The magnitude of the difference vector is computed by the square root of the sum of the squares of its d entries. The standard deviation of the image class is computed as the square root of the sum of the diagonal elements of the diagonal covariance matrix.”) ; and program instructions to identify from the distance matrix, diagonal blocks that represent sequences of shots having respective the output feature vectors with respective distances within a certain proximity( Paragraphs [0153-0154] teaches the distance matrices and the diagonal line to indicate frames within the video and Paragraphs [0155-0156] teaches the feature vectors)  .  
One of ordinary skill in the art would have been motivated to combine the teaching of Foote et al.  within the modified combined teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.  and  the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu because the Methods And Apparatuses For Interactive Similarity Searching, Retrieval And Browsing Of Video mentioned by Foote et al.   provides a system and method for implementing automatic  retrieval of video of similarities . 
	Therefore, it would have been obvious for one in the ordinary skills in the art before the effective filing date of the claimed invention to implement the Methods And Apparatuses For Interactive Similarity Searching, Retrieval And Browsing Of Video mentioned by Foote et al. within the modified combined teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.  and  the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu for implementation of a system and method  for automatic  retrieval of video of similarities .

As per claim 11,  Combination of Smith e al. and Dong Xu  and Foote et al. teach claim 10,
Within analogous art, Foote et al. teaches the stored program instructions ( Paragraph [0068])  further comprising: program instructions to determine a division of scenes in the video based ( Paragraph [0067]- “…The present invention includes methods for segmenting and classifying video sequences into a pre-defined set of classes. Examples of video classes include close-ups of people, crowd scenes, and shots of presentation material such as power point slides. …”) , at least in part, on the diagonal blocks identified from the distance matrix ( FIG. 41 AND Paragraphs [0179-180]- “FIG. 41 illustrate an inter-segment acoustic distance matrix according to the present invention. Diagonal entries 4101 through 4105 are black indicating that each segment is similar to itself. Grey regions 4106 and 4107 represent the partial similarity of the audio intervals at the beginning and end of the source audio. The white regions represent non-similarity of audio segments….”) . 

As per claim 17, Combination of Smith e al. and Dong Xu teach claim 15,
Within analogous art, Foote et al. teaches the stored program instructions ( Paragraph [0068]) further comprising: program instructions to generate a distance matrix using output feature vectors,  ( FIG. 31 and 35 and Paragraphs [0143] and [0153]) , wherein the distance matrix is represented by a block-diagonal structure ( FIG. 10 and Paragraphs [0088-0089]) , and wherein the distance matrix defines distances between the output features vectors ( Paragraph [0104]- “…, the magnitudes of the difference vector and the standard deviation are computed as Euclidean distances. The magnitude of the difference vector is computed by the square root of the sum of the squares of its d entries. The standard deviation of the image class is computed as the square root of the sum of the diagonal elements of the diagonal covariance matrix….”) ; and program instructions to identify from the distance matrix, diagonal blocks that represent sequences of shots having respective the output feature vectors with respective distances within a certain proximity ( Paragraphs [0153-0154] teaches the distance matrices and the diagonal line to indicate frames within the video and Paragraphs [0155-0156] teaches the feature vectors)  .  
One of ordinary skill in the art would have been motivated to combine the teaching of Foote et al.  within the modified combined teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.  and  the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu because the Methods And Apparatuses For Interactive Similarity Searching, Retrieval And Browsing Of Video mentioned by Foote et al.   provides a system and method for implementing automatic  retrieval of video of similarities . 
	Therefore, it would have been obvious for one in the ordinary skills in the art before the effective filing date of the claimed invention to implement the Methods And Apparatuses For Interactive Similarity Searching, Retrieval And Browsing Of Video mentioned by Foote et al. within the modified combined teaching of the Automated semantic inference of visual features and scenes mentioned by Smith e al.  and  the Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment mentioned Dong Xu for implementation of a system and method  for automatic  retrieval of video of similarities .

As per claim 18,  Combination of Smith e al. and Dong Xu  and Foote et al. teach claim 17,
Within analogous art, Foote et al. teaches the stored program instructions further comprising: program instructions to determine a division of scenes in the video based ( Paragraph [0067]- “…The present invention includes methods for segmenting and classifying video sequences into a pre-defined set of classes. Examples of video classes include close-ups of people, crowd scenes, and shots of presentation material such as power point slides. …”) , at least in part, on the diagonal blocks identified from the distance matrix ( FIG. 41 AND Paragraphs [0179-180]- “FIG. 41 illustrate an inter-segment acoustic distance matrix according to the present invention. Diagonal entries 4101 through 4105 are black indicating that each segment is similar to itself. Grey regions 4106 and 4107 represent the partial similarity of the audio intervals at the beginning and end of the source audio. The white regions represent non-similarity of audio segments….”) . 
It is noted that any citations to specific, pages, columns, lines, or figures in the prior art references and any interpretation of the reference should not be considered to be limiting in any way. A reference is relevant for all it contains and may be relied upon for all that it would have reasonably suggested to one having ordinary skill in the art. See MPEP 2123. 

Allowable Subject Matter

4.          Claims 5,6,7,12-14,19 and 20  are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

5.         The following is an examiner’s statement of reasons for objecting the claims as allowable subject matter: 

As to claim 5, prior art of record does not teach or suggest the limitation mentioned within claim 5: “…the applying of optimal sequence grouping to the output feature vectors includes: applying, by one or more processors, an optimal sequence grouping probability function to the division of the scenes in the video, resulting in a modified division of scenes; calculating, by one or more processors, an error in the modified division of scenes based, at least in part, on a division probability loss; and modifying, by one or more processors, the output feature vectors based, at least in part, on the modified division of scenes and the calculated error in the modified division of scenes.”
 
As to claim 6 ,  Claim 6 depends on objected allowable claim 5, therefore claim 6  considered  objected over prior art of record. 

As to claim 7 ,  Claim 7 depends on objected allowable claim 6, therefore claim 7  considered  objected over prior art of record. 

As to claim 12, prior art of record does not teach or suggest the limitation mentioned within claim 12: “…apply optimal sequence grouping to the output feature vectors include: program instructions to apply an optimal sequence grouping probability function to the division of the scenes in the video, resulting in a modified division of scenes; program instructions to calculate an error in the modified division of scenes based, at least in part, on a division probability loss; and program instructions to modify the output feature vectors based, at least in part, on the modified division of scenes and the calculated error in the modified division of scenes.  ” 

As to claim 13 ,  Claim 13 depends on objected allowable claim 12, therefore claim 13  considered  objected over prior art of record. 

As to claim 14 ,  Claim 14 depends on objected allowable claim 13, therefore claim 14  considered  objected over prior art of record. 

As to claim 19, prior art of record does not teach or suggest the limitation mentioned within claim 19: “…apply optimal sequence grouping to the output feature vectors include: program instructions to apply an optimal sequence grouping probability function to the division of the scenes in the video, resulting in a modified division of scenes; program instructions to calculate an error in the modified division of scenes based, at least in part, on a division probability loss; and program instructions to modify the output feature vectors based, at least in part, on the modified division of scenes and the calculated error in the modified division of scenes.”

As to claim 20 ,  Claim 20 depends on objected allowable claim 19, therefore claim 20  considered  objected over prior art of record.




Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
6. 	Any inquiry concerning this communication or earlier communications from the examiner should be directed to OMAR S. ISMAIL whose telephone number is (571)272-9799 and Fax # (571)273-9799. The examiner can normally be reached on M-F: 9:00 AM - 6:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http:/ If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, David C. Payne can be reached on (571)272-3024. The fax phone number for the organization where this application or proceeding is assigned is (571)273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free)? If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/OMAR S ISMAIL/Primary Examiner, Art Unit 2637