DETAILED ACTION

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

This action is responsive to the original application filed on 5/17/2018.  

Claim Interpretation

The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.


This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: “a motion component” and “an action detection 1 and its dependents and “a loss component” in claim 6 and its dependents1.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Applicant may:
(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 

If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-6, 11-14, 17, 18, and 20 are rejected under 35 U.S.C. § 103 as being obvious over Yang et al. (US 20170220854 A1, hereinafter “Yang”) in view of Bertasius et al. (Bertasius et al., “Object Detection in Video with Spatiotemporal Sampling Networks”, Mar. 15, 2018, arXiv:1803.05549v1, pp. 1-16, hereinafter “Bertasius”).

Regarding claim 1, Yang discloses [a] system, comprising: a memory that stores computer executable components; and a processor that executes the computer executable components stored in the memory, wherein the computer executable components comprise: ([0004]; “A multimodal sensing system”; and [0020])
a motion component ([0020] and Figure 6, 605) that extracts a motion vector from a plurality of adaptive receptive fields in a  . . . convolution layer of a neural network model (Figure 3; the figure discloses the motion component that extracts a motion vector (128xN) from a plurality of adaptive receptive fields in a convolution layer (fully connected layer) of a neural network model; and [0024]; “A computing device 14 such as a server will analyze the transferred data to extract features and identify actions embodied in the data”; and [0026]; “Examples of hand-engineered features that can be extracted from digital video signals include 3D HOG, dense trajectories, histograms of motion, optical flow vector fields, as well as temporal sequences of features that can be extracted from still images”, which discloses the adaptive receptive fields; and [0028];  and [0043])
an action detection component (([0020] and Figure 6, 605; and Figure 2, 222) that generates a spatio-temporal feature (Figure 3, “Temporal Fusion (LSTM)”;  the figure discloses, under a broadest reasonable interpretation of the claim language, generating the spatio-temporal feature from a video and motion feature extracted at the same time period; and [0004]; “The system will then fuse a group of the extracted video features corresponding to a time period and a group of the extracted other features corresponding to the time period to create a fused feature representation”; and [0008]) by concatenating ([0029]; “In early fusion processes, features are extracted from each data modality and a fused representation is achieved by combining or concatenating the extracted features”; and [0043]; and [0045]) the motion vector (Figure 3, “Compressed motion feature 128xN”) with a spatial feature (Figure 3, “Compressed video feature 128xN”) extracted from the  . . .  convolution layer (Figure 3, “Fully connected layer”).
	Yang fails to explicitly disclose but Bertasius discloses the deformable convolution layer (Page 2, ¶2; “we introduce a simple, yet effective Spatiotemporal Sampling Network (STSN) that uses deformable convolutions [25] across space and time to leverage temporal information for object detection in video”; and Page 4, Figure 2; the figure discloses the deformable layers at “def. conv.”; and Page 5, ¶5; “our backbone network employs 6 deformable convolutional layers”).
Yang and Bertasius are analogous art because both are concerned with decision spatiotemporal feature fusion and convolutional neural networks.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in spatiotemporal feature fusion and convolutional neural network to combine the deformable convolution layers of Bertasius with the system of Yang to yield the predictable result of a motion component that extracts a motion vector from a plurality of adaptive receptive fields in a deformable convolution layer of a neural network model; and an action detection component that generates a spatio-temporal feature by concatenating the motion vector with a spatial feature extracted from the deformable convolution layer. The motivation for doing so would be to generate a decision tree that is to be used to determine an order of inquiries asking about attributes in order to perform object detection in a video frame by learning to spatially sample features from the adjacent frames (Bertasius; Abstract).

	Regarding claim 11, it is a method claim corresponding to the steps of claim 1, and is rejected for the same reasons as claim 1.

	Regarding claim 17, it is a computer program product claim corresponding to the steps of claim 1, and is rejected for the same reasons as claim 1.

	Regarding claims 2, 12, and 18, the rejection of claims 1, 11, and 17 are incorporated and Yang further discloses wherein the spatio-temporal feature is a vector that characterizes a fine-grained action associated with the spatial feature ([0033]; “The output of the last layer of the LSTM is a set of cross-modality or fused features that can be used in support of automated decision-making processes”, which discloses that the output of the temporal fusion of features is a vector that is used to characterize or classify a fine-grained action associated with the spatial feature; and Figure 2, Element s 202, 212, 221, and 222; and [0036]; “The processor will implement instructions to analyze the fused feature representation to perform a classification process 222 that includes identifying a class that applies to both the extracted video features and the extracted other data features, and also identifying an action that is associated with the class”; and Figure 3; the temporal fusion element results in a vector, under a broadest reasonable interpretation of the claim language).

Regarding claims 3 and 20, the rejection of claims 1 and 17 are incorporated and Yang further discloses wherein the neural network model is trained end-to-end ([0036]; “In another embodiment, the system performs the classification by applying a previously trained statistical classifier which was trained to learn the correspondences between extracted features and actions”; and [0040]; and [0042]; “they may be trained, either in a supervised or an unsupervised manner”; and Claim 4; “end-to-end deep neural network”).

Regarding claim 4, the rejection of claim 1 is incorporated and Yang further discloses the motion vector (Figure 3, “Compressed motion feature 128XN”).
Yang fails to explicitly disclose but Bertasius discloses wherein the neural network model comprises a plurality of deformable convolution layers, wherein the deformable convolution layer is comprised within the plurality of deformable convolution layers, and wherein the motion vector is extracted from the plurality of deformable convolution layers (Page 2, ¶2; “we introduce a simple, yet effective Spatiotemporal Sampling Network (STSN) that uses deformable convolutions [25] across space and time to leverage temporal information for object detection in video”; and Page 4, Figure 2; the figure discloses the deformable layers at “def. conv.”; and Page 5, ¶5; “our backbone network employs 6 deformable convolutional layers”).
The motivation to combine Yang and Bertasius is the same as discussed above with respect to claim 1.

Regarding claims 5 and 13, the rejection of claims 1 and 11 are incorporated and Yang further discloses wherein the motion vector is extracted by computing a difference between a first adaptive receptive field from the plurality of adaptive receptive fields at a first time frame and a second adaptive receptive field from the plurality of adaptive receptive fields at a second time frame ([0026]; “Examples of hand-engineered features that can be extracted from digital video signals include 3D HOG, dense trajectories, histograms of motion, optical flow vector fields, as well as temporal sequences of features that can be extracted from still images”; and [0032]; “The system will extract other features from the other sensor data 212 so that each extracted feature is associated with a time period (single point or time span) corresponding to the time stamp(s) associated with the motion data from which the feature is extracted.”; and [0045]; “Consequently, a 7-dimensional motion vector may be available for each time stamp”).

Regarding claims 6 and 14, the rejection of claims 1 and 11 are incorporated and Yang further discloses a loss component that computes a motion loss from an aggregation of a plurality of motion vectors extracted by the motion component, wherein the motion loss is a regularization that enforces a consistency of learned motion characterized by the plurality of motion vectors over a period of time ([0028]; “optimal feature representation may be obtained by finding the set of weights that minimize a loss function between an output elicited by a given input and the label of the input”; and [0034]; “In one embodiment, where the features used for the representation of each modality are hand-engineered, the system may learn the parameters of the LSTM by minimizing a loss function between the output produced by incoming features associated with data of a given class and the desired output which corresponds to the class label; and [0035]).




Claims 7 and 15 are rejected under 35 U.S.C. § 103 as being obvious over Yang in view of Bertasius and further in view of Haeusser et al. (US 20200057936 A1, hereinafter “Haeusser”).

Regarding claims 7 and 15, the rejection of claims 1, 6, 11, and 14 are incorporated and Yang further discloses wherein the loss component further computes a class loss from a second aggregation of a plurality of spatial features extracted by the action detection component ([0028]; “optimal feature representation may be obtained by finding the set of weights that minimize a loss function between an output elicited by a given input and the label of the input”; and [0034]; “In one embodiment, where the features used for the representation of each modality are hand-engineered, the system may learn the parameters of the LSTM by minimizing a loss function between the output produced by incoming features associated with data of a given class and the desired output which corresponds to the class label; and [0035]).
Yang fails to explicitly disclose but Haeusser discloses wherein the class loss is cross-entropy loss that enforces a correctness of predicted labels generated by the neural network model ([0052]; “In these implementations, the loss function can also include a classification loss term that is a cross-entropy loss between, for each labeled training item, the label for the training item and the classification for the training item. Thus, by performing the iteration, the system minimizes this cross-entropy loss to determine a third value update to the current values of the network parameters that increases the accuracy of the classification outputs generated by the neural network” (emphasis added)).
Yang, Bertasius, and Haeusser are analogous art because all are concerned with neural networks.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in neural networks to combine the cross-entropy loss of Haeusser with the class loss of Yang to yield the predictable result of wherein the loss component further computes a class loss from a second aggregation of a plurality of spatial features extracted by the action detection component, wherein the class loss is cross-entropy loss that enforces a correctness of predicted labels generated by the neural network model. The motivation for doing so would be to increase the accuracy of the classification outputs generated by the neural network (Haeusser; [0052]).

Claims 8 and 16 are rejected under 35 U.S.C. § 103 as being obvious over Yang in view of Bertasius and further in view of Srivatsa et al. (US 20190291723 A1, hereinafter “Sristava”).

Regarding claims 8 and 16, the rejection of claims 1 and 11 are incorporated and Yang fails to explicitly disclose but Srivatsa discloses wherein the neural network model is a single stream model ([0006]; “In am embodiment, the convolutional neural network is a single stream convolutional neural network” (emphasis added); and Claim 11).
Yang, Bertasius, and Srivatsa are analogous art because all are concerned with neural networks.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in neural networks to combine the single stream model of Srivatsa with the system of Yang and Bertasius to yield the predictable result of wherein the neural network model is a single stream model. The motivation for doing so would be to provide for object localization for obstacle avoidance (Srivatsa; [0001]).

Claims 9 and 19 is rejected under 35 U.S.C. § 103 as being obvious over Yang in view of Bertasius and further in view of Buckler et al. (US 20200410352 A1, hereinafter “Buckler”).

Regarding claims 9 and 19, the rejection of claims 1 and 17 are incorporated and Yang fails to explicitly disclose but Buckler discloses extracts the motion vector by computing a difference in the plurality of adaptive receptive fields on a plurality of feature spaces of the neural network model ([0012]; “Estimating the motion of each pixel block of the receptive fields in the plurality of receptive fields relative to a corresponding prior location in the first input frame may include calculating absolute pixel difference sums for each pixel block in the plurality of receptive fields”; and [0013]; “Estimating the motion of the pixel block of each receptive field in the plurality of receptive fields relative to a prior location in the first input frame can include computing a difference of each S.times.S pixel tile in the plurality of receptive fields relative to a number of potential prior locations of each S.times.S pixel tile in the first input frame, wherein the number of potential prior locations is greater than 1”).
Yang, Bertasius, and Buckler are analogous art because all are concerned with neural networks.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in neural networks to combine the single stream model of Buckler with the system of Yang and Bertasius to yield the predictable result of wherein the motion component extracts the motion vector by computing a difference in the plurality of adaptive receptive fields on a plurality of feature spaces of the neural network model. The motivation for doing so would be to facilitate the performance of computer vision tasks by processing spatial data (Buckler; [0002-0003]).

Claim 10 is rejected under 35 U.S.C. § 103 as being obvious over Yang in view of Bertasius and further in view of Varadarajan et al. (US 20190042851 A1, hereinafter “Varadarajan”).

Regarding claim 10, the rejection of claim 1 is incorporated and Yang fails to explicitly disclose but Varadarajan discloses herein the action detection component generates the spatio- temporal feature in a cloud computing environment ([0056]; the paragraph discloses the use of a cloud server 100 that is used for action detection).
Yang, Bertasius, and Varadarajan are analogous art because all are concerned with detection systems.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in detection systems to combine the single stream model of Varadarajan with the system of Yang and Bertasius to yield the predictable result of wherein the action detection component generates the spatio- temporal feature in a cloud computing environment. The motivation for doing so would be to recognize an abnormal activity and one or more persons associated with the abnormal activity in a video frame of the video stream (Varadarajan; Abstract).

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Brent Hoover whose telephone number is (303)297-4403. The examiner can normally be reached Monday - Friday 9-5 MST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached on 571-270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/BRENT JOHNSTON HOOVER/Examiner, Art Unit 2127                                                                                                                                                                                                        




    
        
            
        
            
        
            
    

    
        1 Note that the Specification appears to provide sufficient structural support for “a motion component”, “an action detection component”, and “a loss component” in at least paragraphs [0030] and [0105] and Figures 1 and 10 of the originally filed specification, and all of the components appear to be generic processing elements.