DETAILED ACTION

Allowable Subject Matter
Claims 2, 4-16, 26 and 27 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Claim 33 is allowed.
The prior art of record, alone or in combination, fails to fairly teach or suggest these limitations, including the concept of a computer-implemented method for temporally localizing a target action in a video, comprising: inputting a video into a machine-learned model comprising one or more weakly supervised temporal action localization models; analyzing the video by the one or more weakly-supervised temporal action localization models to determine one or more weighted temporal class activation maps; each temporal class activation map comprising a one dimensional class-specific activation map in a temporal domain; and determining a temporal location of a target action in the video based at least in part on the one or more weighted temporal class activation maps, and wherein the machine-learned model comprises a sparse temporal pooling network comprising a first weakly supervised temporal action localization model and a second weakly supervised temporal action localization model.
For example, Wang teaches an algorithm called UntrimmedNet which couples a classification module and a selection module to learn the action models and reason about the temporal localization of action instances. The goal is to temporally localize the target action using its convolutional neural network architecture under weakly supervised training. The claims however require an architecture which uses a technique called 'sparse temporal pooling' to improve the temporal aspect of the localization. This feature isn't taught or suggested by Wang or any of the other closest art. The claim language goes beyond the similarities of these devices and Applicant’s invention and a combination could not reasonably be made without impermissible hindsight. The differences here are viewed as allowable over the prior art.

Examiner’s Note
The text of cancelled claims should be removed from the claim listing. See cancelled claims 3, 28, 31, 32, and 34-37.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 29 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. Claim 29 is dependent upon cancelled claim 28. For the purposes of expedited prosecution Examiner is interpreting this claim as dependent upon parent claim 22.
Claim 29 recites the limitation "the groundtruth video-level action classification." There is insufficient antecedent basis for this limitation in the claim since the antecedent basis was previously included in now cancelled claim 28.


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1, 17 and 18 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Wang (“UntrimmedNets for Weakly Supervised Action Recognition and Detection”; provided by Applicant).
Regarding claim 1, Wang teaches a computer-implemented method for temporally localizing a target action in a video, comprising: (Wang, abstract, “Our UntrimmedNet couples two important components, the classification module and the selection module, to learn the action models and reason about the temporal duration of action instances, respectively.” Also see pg. 2, left column, ¶ 1 which teaches that the goal is to temporally localize the target action.)
inputting a video into a machine-learned model comprising one or more weakly supervised temporal action localization models; (See pg. 3, Fig. 2, inputting the video into the weakly supervised network for classification and temporal selection.)
analyzing the video by the one or more weakly-supervised temporal action localization models to determine one or more weighted temporal class activation maps; (Pg. 2, left column, ¶ 3 that the network first determines clip proposals. The temporal selection is then performed by the soft selection step which determines an attention weight for the proposals to rank the importance of different clips. Also see pg. 4, right column, ¶ 1. This attention weight map is used for temporal localization, see pg. 5, left column, last paragraph.)
each temporal class activation map comprising a one dimensional class-specific activation map in a temporal domain; and (Pg. 4, right column, ¶ 1 teaches determining an attention weight for the proposals as the selection score                         
                            
                                
                                    
                                        
                                            x
                                        
                                        -
                                    
                                
                                
                                    s
                                
                            
                            (
                            c
                            )
                        
                    . This is a one dimensional class-specific selection score that serves as an activation map in a temporal domain for evaluating the temporally different clips. Pg. 1, right column, last paragraph teaches using this one dimensional class-specific selection score.)
determining a temporal location of a target action in the video based at least in part on the one or more weighted temporal class activation maps. (As above, pg. 1, right column, last paragraph teaches using the class-specific selection score                         
                            
                                
                                    
                                        
                                            x
                                        
                                        -
                                    
                                
                                
                                    s
                                
                            
                            (
                            c
                            )
                        
                     as a score signifying a temporal location of a target action in the video. Also see pg. 2, left column, ¶ 1 which teaches that the goal is to temporally localize the target action.)
Regarding claim 17, Wang teaches the computer-implemented method of claim 1, further comprising: determining one or more relevant target action class labels for the video based at least in part on a video-level classification score. (See pg. 4, left column, ¶ 2, “Classification module.”)
Regarding claim 18 Wang teaches the computer-implemented method of claim 1, wherein the one or more weakly supervised temporal action localization models have been trained using a training dataset comprising untrimmed videos labelled with video-level class labels of target actions. (See pg. 1, right column, ¶ 2 training the untrimmed videos with video-level class labels using a weakly supervised classifier.)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 19-24, 29 and 30 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang (“UntrimmedNets for Weakly Supervised Action Recognition and Detection”; provided by Applicant) in view of Zhou (“An End-to-End Sparse Coding”).
Regarding claim 19 Wang teaches the computer-implemented method of claim 1, wherein the one or more weakly supervised temporal action localization models have been trained using a loss function comprising a classification loss. (See loss function at pg. 5, left column, ¶ 2 which teaches computing a loss function from the classification model loss and an γ1 norm of the label vector.)
In the field if machine learning Zhou teaches computing a loss function with a sparsity loss term. (See pg. 3. Fig. 1 which teaches computing a sparsity loss term in the loss function and pg. 3, left column, ¶ 1-2 which teaches that the γ1 norm term in the loss function is the sparsity loss term used to control sparsity.)
It would have been obvious to one of ordinary skill in the art to have combined Wang’s machine learning loss function with Zhou’s machine learning loss function (which explicitly teaches a sparsity loss term). As noted above, Wang teaches a loss function at pg. 5, left column, ¶ 2 which teaches computing a loss term with an γ1 norm of the label vector. Zhou teaches that the γ1 norm term in the loss function is the sparsity loss term used to control sparsity. The combination constitutes the repeatable and predictable result of simply applying Zhou’s teaching here of using an γ1 norm term in the loss function called sparsity loss. This cannot be considered a non-obvious improvement in view of the relevant prior art here. Using known engineering design, no “fundamental” operating principle of the teachings are changed; they continue to perform the same functions as originally taught prior to being combined.
Regarding claim 20 the above combination teaches the computer-implemented method of claim 19, wherein the classification loss is determined based at least in part on a comparison of a video-level classification score and a groundtruth classification. (Wang, pg. 5, left column, “3.3 Training” teaches training based on a comparison of the output classification with the classification labels/groundtruth.)
Regarding claim 21 the above combination teaches the computer-implemented method of  claim 19, wherein the sparsity loss is determined based at least in part on a L1 norm of an attention weight parameter. (Wang, pg. 5, left column, “3.3 Training”, the attention weight is used to compute                         
                            
                                
                                    
                                        
                                            x
                                        
                                        -
                                    
                                
                                
                                    k
                                
                                
                                    r
                                
                            
                        
                     as per pg. 4, right column, equation (4), and                         
                            
                                
                                    
                                        
                                            x
                                        
                                        -
                                    
                                
                                
                                    k
                                
                                
                                    r
                                
                            
                        
                     is used in the sparsity/L1 norm term of the loss function.)
Regarding claim 22 the above combination teaches the computer-implemented method of training a weakly supervised temporal action localization model, comprising: (See rejection of claim 1.)
inputting an untrimmed video into the weakly supervised temporal action localization model; (See pg. 3, Fig. 2, inputting the video into the weakly supervised network for classification and temporal selection.)
analyzing the untrimmed video by the weakly supervised temporal action localization model to determine a predicted score for an action classification; (As above, pg. 2, left column, ¶ 3 teaches that the network first determines clip proposals. The temporal selection is then performed by the soft selection step which determines an attention weight for the proposals to rank the importance of different clips. Also see pg. 4, right column, ¶ 1. This attention weight is used to determine a predicted score for an action classification, see pg. 5, left column, last paragraph.)
 determining a loss function based at least in part on the predicted score, the loss function comprising a sparsity loss and a classification loss; and (See rejection of claim 19).
training the weakly supervised temporal action localization model based at least in part on the loss function. (See rejection of claim 19 and Wang, pg. 5, left column, “3.3 Training”)
Regarding claim 23 the above combination teaches the computer-implemented method of claim 22, wherein analyzing the untrimmed video by the weakly supervised temporal action localization model to determine a predicted score for an action classification comprises: 
sampling a plurality of segments from the untrimmed video; and (See pg. 3, Fig. 2, “clip sampling”)
analyzing each of the plurality of segments with one or more pretrained convolutional neural networks to determine a respective feature representation. (See pg. 3, Fig. 2, which teaches analyzing the segments via pretrained classification and selection convolutional neural networks.)
Regarding claim 24 the above combination teaches the computer-implemented method of claim 23, wherein analyzing the untrimmed video by the weakly supervised temporal action localization model to determine a predicted score for an action classification comprises: inputting each respective feature representation into an attention module to determine a respective attention weight. (As above, Wang pg. 2, left column, ¶ 3 teaches that the network first determines clip proposals. The temporal selection is then performed by the soft selection step which determines an attention weight for the proposals to rank the importance of different clips.)
Regarding claim 29 the above combination teaches the computer-implemented method of claim 28, wherein determining the classification loss comprises determining a multi-label cross-entropy loss between the groundtruth video-level action classification and the predicted score for the action classification. (Wang, pg. 5, left column, “3.3 Training” teaches training based on a comparison of the output classification with the classification labels/groundtruth of the various label classes.) Note the 112(b) rejection above, for purposes of the rejection, the claims is interpreted as being dependent upon claim 22.
Regarding claim 30 the above combination teaches the computer-implemented method of claim 22, wherein determining the sparsity loss comprises determining the sparsity loss based at least in part on a L1 norm of one or more attention weights received from an attention module of the temporal action localization model. (See rejection of 19. Wang pg. 5, left column, ¶ 2 teaches computing a loss function from the classification model loss and an γ1 norm of a label vector in the attention weight term. Zhou teaches that the γ1 norm term in the loss function is the sparsity loss term used to control sparsity.)

Claim(s) 25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang (“UntrimmedNets for Weakly Supervised Action Recognition and Detection”; provided by Applicant) in view of Zhou (“An End-to-End Sparse Coding”) and Xinggang Wang (“Revisiting Multiple Instance Neural Networks”).
Regarding claim 25 the above combination teaches the computer-implemented method of claim 24, wherein the attention module comprises a fully connected layer and a sigmoid layer (Wang, pg. 3, Fig. 2 teaches a fully connected layer and a softmax/sigmoid layer)
In the field of machine learning Xinggang Wang teaches a first fully connected layer, a rectified linear unit layer, a second fully connected layer, and a sigmoid layer. (Pg. 4, left column, ¶ 2 teaches a neural network including a first fully connected layer, a rectified linear unit layer, a second fully connected layer, and a sigmoid layer.)
It would have been obvious to one of ordinary skill in the art to have combined Wang’s machine learning function with Xinggang Wang’s machine learning function. Both references teach using convolutional neural network-based machine learning for the purposes of weakly-supervised learning. Wang does not provide much detail on its network’s layer architecture. Xinggang Wang teaches a specific architecture tailored for weakly-supervised learning which includes a first fully connected layer, a rectified linear unit layer, a second fully connected layer, and a sigmoid layer. The combination constitutes the repeatable and predictable result of simply applying Xinggang Wang’s teaching here. This cannot be considered a non-obvious improvement in view of the relevant prior art here. Using known engineering design, no “fundamental” operating principle of the teachings are changed; they continue to perform the same functions as originally taught prior to being combined.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Raphael Schwartz whose telephone number is (571)270-3822. The examiner can normally be reached Monday to Friday 9am-5pm CT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent Rudolph can be reached on (571) 272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/RAPHAEL SCHWARTZ/           Examiner, Art Unit 2661      

/AMANDEEP SAINI/           Primary Examiner, Art Unit 2661