Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-10 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Lee (US 2020/0117906).
As per claim 1, Lee teaches, a method for determining action recognition in frames of a video through spatio-temporal object tracking ( Lee, ¶[0025] “Certain aspects involve using a space-time memory network to locate one or more target objects in video content for segmentation or other object classification.” ), the method comprising: detecting visual objects in frames of the video (Lee, ¶[007] “a neural network model includes an object-specific detector that independently processes each video frame to segment out the target object.”); linking visual objects that are the same through time to form object tracks (Lee, fig.3 showing the frames linked through time to track the object); organizing and combining the object tracks with embeddings (Lee, fig.3 302a encoding operation represents embeddings); applying the organized and combined object tracks to a neural network model (Lee, ¶[007] “a neural network model includes an object-specific detector that independently processes each video frame to segment out the target object.” And ¶[0025] object classification), said model trained to generate representative embeddings and discriminative video features through high-order interaction formulated as a matrix operation without iterative processing delay (Lee, ¶[0034] “one example of a data structure for a probability map is a two-dimensional matrix with entries corresponding to pixels in a digital image, wherein each entry reflects the likelihood that the corresponding pixel is part of a target object.”  And ¶[0097] “In some aspects, the video processing engine 102 can use a subset of previous frames, rather than an entire set of previous frames, when applying the space-time memory network 103, which can thereby address one or more of these issues.” By not repeating frames there won’t be that iterative processing delay as there won’t be the processing for that same frame).

As per claim 2, Lee teaches, the method of claim 1 wherein the neural network model is a transformer (Lee, ¶[0031] “The space-time memory network 103 can be a neural network model having external memory storage (e.g., the video data store 104) to which information can be written and from which information can be read.” It is a transformer as new frame combinations output).

As per claim 3, Lee teaches, the method of claim 1 wherein the neural network model includes redesigning input token embeddings for relationship modeling employing a transformer encoder for embedding sequence of image features per frame (Lee, fig.3 encoding operation represents transformer encoder and as seen per frame).

As per claim 4, Lee teaches, the method of claim 1 wherein the neural network model includes redesigning input token embeddings for relationship modeling employing a transformer encoder for embedding sequence of top-K object features per frame (Lee, ¶[0025] “leveraging the guidance provided by this stored classification data can avoid the inefficient utilization of computing resources that is present in online learning methods. In additional or alternative aspects, the space-time memory network can provide greater flexibility than existing memory networks that could be used for object 

As per claim 5, Lee teaches, the method of claim 1 wherein the neural network model includes redesigning input token embeddings for relationship modeling using a transformer encoder for embedding sequence of image + object features per frame (Lee, fig.3 “key-value embedding” and using encoder 302a).

As per claim 6, Lee teaches, the method of claim 1 wherein the applying and organizing includes top 15 objects per frame in transformer based interaction modelling unit with position embeddings (Lee, ¶[0025] the classification would come up with a top objects and after the system runs 15 objects would be identified).

As per claim 7, Lee teaches, the method of claim 6 further comprising with 2 layers of transformer encoder having 2 parallel heads each (Lee, 302a and 304b represents 2 layers of the encoder and as seen there are 2 parallel heads 304c for example  ).

As per claim 8, Lee teaches, the method of claim 1, wherein the neural network includes the object tracks are then further organized and input to a model that is trained to generate representative embeddings and discriminative video features through high-order interaction which is formulated as an efficient matrix operation without iterative processing delay ( Lee, ¶[0081] “ The video processing engine 102 also applies a concatenation 626 that concatenates the memory value map, as modified via the transpose and reshaping operation 618 and the matrix product operation 627, with the query value map 608. The concatenation 626 outputs an output value map 628 (i.e., value map y). The output value 

As per claim 9, Lee teaches, the method of claim 1, wherein the neural network includes a tracking enabled action recognition process for intra-tracklet and inter-tracklet attention (Lee, fig.3 108 to 106 represent intra-tracklet and inter-tracklet attention since it is right to left).

As per claim 10, Lee teaches, the method of claim 1, wherein the neural network includes a video representation input to a tracklet transformer which operationally produces a classification (Lee, ¶[0032] “The semi-supervised video object segmentation involves identifying feature classification data (e.g., a segmentation mask) for a first frame based on one or more user inputs (e.g., boundary clicks). The semi-supervised video feature classification also involves estimating the feature-classification data (e.g., segmentation masks) of other frames in the video that include the target feature of object.” This represents operationally producing a classification and the tracklet transformer shown in fig.3  ).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SANTIAGO GARCIA whose telephone number is (571)270-5182. The examiner can normally be reached Monday-Friday 9:30am-5:30pm.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Le Vu can be reached on (571) 272-7332. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/SANTIAGO GARCIA/Primary Examiner, Art Unit 2668                                                                                                                                                                                                        



/SG/