DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of the Claims
Original claims 1-20, filed December 18, 2019, are pending in the instant application.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 

(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 

Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-9 and 12-15 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by ‘Li’ (“Multiple Object Tracking with Motion and Appearance Cues,” 1 September 2019).
Regarding claim 1, Li discloses a method for object detection or tracking (e.g. Figure 2), comprising:
receiving a first frame comprising a candidate object (e.g. Figure 2, input frame at time t-1, which includes people candidate objects); 
detecting, via a cascade neural network (Figure 2, object detector; Fifth page, sentence spanning columns, three object detection algorithms are used, including Cascade R-CNN and improved Cascade R-CNN, both of which are cascade neural networks), first object recognition information (Fifth page, right column, Evaluation metrics, item 1, “Each algorithm outputs a list of bounding boxes with confidence scores and the corresponding identities”; Also see Figure 2, top-right image, bounding boxes from object detector are superimposed [note that they are small and hard to see without zooming in]) based at least in part on one or more of the first frame (e.g. Figure 2, object detection is performed on first frame) or a portion of the first frame, the first object recognition information comprising one or more of the candidate object (Fifth page, right column, Evaluation metrics, item 1, corresponding identity of bounding box) or a first candidate bounding box associated with the candidate object (Fifth page, right column, Evaluation metrics, item 1, bounding box)
detecting, via the cascade neural network, second object recognition information based at least in part on one or more of the first object recognition information, a second frame (Figure 2, object detector is also applied to second frame at time t), or a portion of the second frame, the second object recognition information comprising one or more of the candidate object in the second frame (Fifth page, right column, Evaluation metrics, item 1, corresponding identity of bounding box), a second candidate bounding box associated with the candidate object (Fifth page, right column, Evaluation metrics, item 1, bounding box; also note the bounding boxes superimposed on the lower-right image of Figure 2), or one or more features of the candidate object (see e.g. Figure 4 and Section 3.4, features may be extracted from candidate object detected in the second frame); 
estimating, via the cascade neural network, motion information associated with the candidate object in the first frame (Figure 2, motion estimation; Section 3.2, especially second paragraph, optical flow is used to update bounding box position from where it was detected by the cascade neural network in the first frame to where it is estimated to have moved to in the second frame); and 
tracking the candidate object in the second frame (Figure 2, First Level Matching through Tracking Results; also see Section 3.4 and Figure 5) based at least in part on the motion information (Section 3.4, second paragraph, first level matching includes determining whether intersection over union (i.e. IoU) between motion estimation – i.e. flow-based motion prediction from first to second frame – and detection in second frame satisfies a threshold; Also see Figure 3, Section 3.3, auxiliary tracker uses motion estimation for tracking if detection from first frame isn’t matched to a corresponding detection in second frame).

Regarding claim 2, Li discloses the method of claim 1, and Li further discloses:
determining, via the cascade neural network, third object recognition information based at least in part on the motion information, the third object recognition information comprising one or more of the candidate object, the first candidate bounding box associated with the candidate object (Section 3.2, second paragraph, first candidate bounding box is projected onto second frame using flow), one or more object features of the candidate object, or a combination thereof, 
wherein tracking the candidate object in the second frame is based at least in part on the third object recognition information (e.g. Figure 5, Section 3.4, first level matching).

Regarding claim 3, Li discloses the method of claim 2, and Li further discloses:
detecting one or more additional candidate objects in one or more of the first frame or the portion of the first frame (e.g. Figure 2, top-right image, several additional candidate people objects are detected in the first frame, as shown by their superimposed bounding boxes)
wherein the third object recognition information comprises one or more of the one or more additional candidate objects or additional candidate bounding boxes associated with the one or more additional candidate objects (Section 3.2, second paragraph, each of the candidate bounding boxes from first frame – including any additional candidate bounding boxes – are projected onto second frame using flow).

Regarding claim 4, Li discloses the method of claim 1, and Li further discloses:
determining an absence of the candidate object over a quantity of frames, wherein the quantity of frames comprises at least the first frame and the second frame (Section 3.3, when object is missing detection for up to                                 
                                    
                                        
                                            t
                                        
                                        
                                            m
                                            a
                                            x
                                        
                                    
                                
                             frames); and 
pausing the tracking based at least in part on the absence of the candidate object over the quantity of frames (Section 3.3, position of object is predicted for up to                                 
                                    
                                        
                                            t
                                        
                                        
                                            m
                                            a
                                            x
                                        
                                    
                                
                            , but tracking will be terminated unless a matching detection is found; This is within the scope of “pausing the tracking” at least because the tracking has not yet been stopped/terminated, but it also has not continued with an additional detection either).

Regarding claim 5, Li discloses the method of claim 4, and Li further discloses:
comparing the absence of the candidate object over the quantity of frames to a threshold, wherein pausing the tracking is based at least in part on the absence of the candidate object over the quantity of frames satisfying the threshold (Section 3.3, the pausing is performed if the candidate object has been absent for less than                                 
                                    
                                        
                                            t
                                        
                                        
                                            m
                                            a
                                            x
                                        
                                    
                                
                             frames).

Regarding claim 6, Li discloses the method of claim 1, and Li further discloses:
determining an absence of the candidate object over a quantity of frames, wherein the quantity of frames comprises at least the first frame and the second frame (Section 3.3, when object disappears due to missing detection for greater than                                 
                                    
                                        
                                            t
                                        
                                        
                                            m
                                            a
                                            x
                                        
                                    
                                
                             frames); and 
terminating the tracking based at least in part on the absence of the candidate object over the quantity of frames (Section 3.3, object is assumed to have disappeared and track is terminated if missing for greater than                                 
                                    
                                        
                                            t
                                        
                                        
                                            m
                                            a
                                            x
                                        
                                    
                                
                             frames).

Regarding claim 7, Li discloses the method of claim 6, and Li further discloses:
comparing the absence of the candidate object over the quantity of frames to a threshold, wherein terminating the tracking is based at least in part on the absence of the candidate object over the quantity of frames satisfying the threshold (Section 3.3, object is assumed to have disappeared and track is terminated if missing for greater than                                 
                                    
                                        
                                            t
                                        
                                        
                                            m
                                            a
                                            x
                                        
                                    
                                
                             frames).

Regarding claim 8, Li discloses the method of claim 1, and Li further discloses:
determining, based at least in part on the second object recognition information, a first confidence score of one or more of the candidate object in the second (Examiner notes that two alternative mappings each read on the claimed invention; Mapping A: Figure 5 and Section 3.4, IoU is determined based at least on second candidate bounding box; Figure 5 and Section 3.4, appearance distance is calculated based on features of the candidate object; Both of these matching metrics fall within the scope of a confidence score, and either can be a first or second confidence score; Mapping B: Section 4.1, Evaluation metrics, item 1, “Each algorithm outputs a list of bounding boxes with confidence scores”; The first confidence score may be a confidence score associated with the second candidate bounding box); and 
determining, based at least in part on third object recognition information (Mapping A: flow-based projection of bounding box from first frame to second frame – see mapping for claim 2; Mapping B: detection results for a third frame), a second confidence score of one or more of the candidate object, the first candidate bounding box associated with the candidate object, one or more object features of the candidate object, or a combination thereof (Mapping A: Figure 5 and Section 3.4, IoU is determined based at least on flow-based projection of first candidate bounding box onto the second frame; Figure 5 and Section 3.4, appearance distance is calculated based on features of the candidate object in the first frame; Both of these matching metrics fall within the scope of a confidence score, and either can be a first or second confidence score; Mapping B: Section 4.1, Evaluation metrics, item 1, “Each algorithm outputs a list of bounding boxes with confidence scores”; The second confidence score may be, for example, a confidence score associated with a bounding box of the candidate object found in a third frame – i.e. third object recognition information), 
wherein tracking the candidate object in the second frame is based at least in part on one or more of the first confidence score or the second confidence score (Mapping A: Figure 5, Section 3.4, tracking is based on both the IoU and appearance distance confidence scores; Mapping B: Section 3.1, second-to-last paragraph, tracking – including tracking of the second frame – is based on length of a track meeting a threshold                                 
                                    
                                        
                                            t
                                        
                                        
                                            m
                                            i
                                            n
                                        
                                    
                                
                            , and “detection score” – i.e. confidence score – also meeting a threshold                                 
                                    
                                        
                                            σ
                                        
                                        
                                            h
                                        
                                    
                                
                            ).

Regarding claim 9, Li discloses the method of claim 8, and Li further discloses:
determining a union between the second object recognition information and the third object recognition information by comparing the second object recognition information and the third object recognition information (Section 3.4, IoU between object recognition information – i.e. bounding boxes – in a current frame and projected from prior frame is calculated; Section 3.1 provides more detail about IoU, including that it involves determining a union; Mapping A: Section 3.4, IoU between detection bounding box in second frame and flow-based projection of detection bounding box from first frame to second frame – i.e. between the detected object and the tracked object; Mapping B: Section 3.4, IoU between detection bounding box in third frame and flow-based projection of detection bounding box from second frame to third frame – i.e. between the detected object and the tracked object); and 
determining that the union satisfies a threshold (e.g. Section 3.4,                                 
                                    
                                        
                                            σ
                                        
                                        
                                            I
                                            o
                                            U
                                            1
                                        
                                    
                                
                            ), wherein tracking the candidate object in the second frame is based at least in part on the union satisfying the threshold (Section 3.4, if threshold is satisfied, then the candidate object in the second frame is considered to match the candidate object in the other frame – i.e. the first frame in Mapping A or the third frame in Mapping B – and the tracking is continued).

Regarding claim 12, Li discloses the method of claim 1, and Li further discloses that detecting the first object recognition information further comprises: 
detecting the first object recognition information based at least in part on a frame count associated with the first frame (e.g. Sections 3.1 and 3.4, any detection not matched to previous frame – i.e. any detection for which the frame count is zero – is created as a new track including a new candidate object); and 
detecting the second object recognition information further comprises detecting the second object recognition information based at least in part on one or more of the frame count associated with the first frame (e.g. Sections 3.1 and 3.4, second object recognition information is matched to track that was started or not in the first frame based on the first frame count) or a frame count associated with the second frame (e.g. Section 3.1, second-to-last paragraph, Section 3.3, track may be deleted, paused, or terminated depending on frame count at second frame and whether a particular object has been detected as part of the second object recognition information).

Regarding claim 13, Li discloses the method of claim 1, and Li further discloses:
capturing one or more of the first frame, the second frame, or a third frame (e.g. Figure 3 illustrates operating on not only the first and second frames of Figure 2, but also third, fourth, etc. frames); 
estimating second motion information associated with the candidate object in the second frame (Section 3.2, second paragraph, flow-based projection of candidate object position in previous frame to current frame, where previous frame is second frame and current frame is third frame); and 
tracking the candidate object in the third frame based at least in part on the second motion information (Section 3.4, tracked object – i.e. the second motion information – is used to calculated IoU, which is used to control the tracking).

Regarding claim 14, Li discloses the method of claim 13, and Li further discloses that one or more of the first frame, the second frame, or the third frame are contiguous (According to the mapping applied to claim 13, the first and second frames are contiguous – i.e. they are adjacent in time without any intervening frames).

Regarding claim 15, Li discloses the method of claim 13, and Li further discloses that one or more of the first frame, the second frame, or the third frame are (According to the mapping applied to claim 13, the first and third frames are noncontiguous – i.e. they are not adjacent in time and are separated by an intervening frame, which is the second frame).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 10-11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Li in view of ‘Girshick’ (“Fast R-CNN,” 2015).
Regarding claim 10, Li teaches the method of claim 1.
Li cites multiple object detection algorithms that can be used in its method, such as Cascade R-CNN (e.g. Section 4.1, sentence spanning columns), but does not itself teach details of these algorithms, such as whether they include scaling frames based on a parameter.

However, Girshick does teach, as part of an object detection algorithm (Fast R-CNN), scaling an input frame based at least in part on a parameter (Section 5.2, single-scale implementation includes scaling input images based on                         
                            s
                        
                     and a maximum side length, both of which are parameters).
Girshick teaches that scaling is advantageous for object detectors for multiple reasons (Section 5.2).  In one example, scaling advantageously ensures that input images and the object detection neural networks that process them can fit within a GPU’s memory (Section 5.2, second and third paragraphs).  In another example, scaling so that input images have a single scale provides good object detection with lower performance cost than multi-scale object detection, and thereby advantageously “offers the best tradeoff between speed and accuracy” (Section 5.2, last two paragraphs).
Were Li modified to scale the frames input to its object detectors as taught by Girshick, then one or more of the first frame or the portion of the first frame would be scaled based at least in part on a parameter, and detecting the first object recognition information comprising one or more of the candidate object or the first candidate bounding box associated with the candidate object (i.e. the output of the object 
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art to modify the method of Li with the frame scaling of Girshick in order to improve the method with the reasonable expectation that this would result in a method that advantageously ensured that its object detection did not require too much memory and offered the best tradeoff between speed and accuracy.  This technique for improving the method of Li was within the ordinary ability of one of ordinary skill in the art based on the teachings of Girshick.
Therefore, it would have been obvious to one of ordinary skill in the art to combine the teachings of LI and Girshick in order to obtain the invention as specified in claim 10.

Regarding claim 11, Li teaches the method of claim 1.
Li cites multiple object detection algorithms that can be used in its method, such as Cascade R-CNN (e.g. Section 4.1, sentence spanning columns), but does not itself teach details of these algorithms, such as whether they include scaling frames based on a parameter.
In particular, Li does not explicitly teach scaling one or more of the second frame or the portion of the second frame based at least in part on a parameter, wherein detecting the second object recognition information comprising one or more of the candidate object in the second frame, the second candidate bounding box associated 
However, Girshick does teach, as part of an object detection algorithm (Fast R-CNN), scaling an input frame based at least in part on a parameter (Section 5.2, single-scale implementation includes scaling input images based on                         
                            s
                        
                     and a maximum side length, both of which are parameters).
Girshick teaches that scaling is advantageous for object detectors for multiple reasons (Section 5.2).  In one example, scaling advantageously ensures that input images and the object detection neural networks that process them can fit within a GPU’s memory (Section 5.2, second and third paragraphs).  In another example, scaling so that input images have a single scale provides good object detection with lower performance cost than multi-scale object detection, and thereby advantageously “offers the best tradeoff between speed and accuracy” (Section 5.2, last two paragraphs).
Were Li modified to scale the frames input to its object detectors as taught by Girshick, then one or more of the second frame or the portion of the second frame would be scaled based at least in part on a parameter, and detecting the second object recognition information comprising one or more of the candidate object in the second frame, the second candidate bounding box associated with the candidate object, or the one or more features of the candidate object (i.e. the output, and/or further processing of the output, of the object detectors) would be based at least in part on the scaling (because the scaling operates on the inputs to the object detectors).

Therefore, it would have been obvious to one of ordinary skill in the art to combine the teachings of LI and Girshick in order to obtain the invention as specified in claim 11.


Claim(s) 16-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Li.
Regarding claim 16, Examiner notes that the claim recites an apparatus comprising a processor, memory coupled with the processor, and instructions stored in the memory and executable by the processor to cause the apparatus to: perform a method that is substantially the same as the method of claim 1.
Li teaches the method of claim 1 (see above).
While the teachings of Li certainly imply the use of a computer (e.g. Section 4.1, Implementation details, GPU), Li is more focused on describing its method and accordingly does not explicitly teach details of the computer implementation of its method.  In particular, Li does not explicitly teach implementing its method as an 
However, Examiner takes Official Notice that it is old and well known in the art of image analysis to implement a method as an apparatus comprising a processor, memory coupled with the processor, and instructions stored in the memory and executable by the processor to cause the apparatus to: perform the method.  Such computer implementation advantageously allows the method to be performed quickly and efficiently.
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art to implement the method of Li as an apparatus comprising a processor, memory coupled with the processor, and instructions stored in the memory and executable by the processor to cause the apparatus to: perform the method, in order to advantageously allow the method to be performed quickly and efficiently.
Therefore, it would have been obvious to one of ordinary skill in the art to combine the teachings of Li to obtain the invention as specified in claim 16.	

Regarding claim 17, Examiner notes that the claim recites limitations that are substantially the same as limitations included in claim 2.  Li teaches the method of claim 2.  Accordingly, claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Li for substantially the same reasons as claim 2.

Regarding claim 18, Examiner notes that the claim recites limitations that are substantially the same as limitations included in claim 3.  Li teaches the method of claim 3.  Accordingly, claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Li for substantially the same reasons as claim 3.

Regarding claim 19, Examiner notes that the claim recites limitations that are substantially the same as limitations included in claim 4.  Li teaches the method of claim 4.  Accordingly, claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Li for substantially the same reasons as claim 4.

Regarding claim 20, Examiner notes that the claim recites an apparatus comprising a set of means, each means being for performing a function that is substantially the same as a step of the method of claim 1, and a corresponding step performed by the apparatus of claim 16.
Each of the means of claim 20 invokes 35 U.S.C. 112(f) (see Claim Interpretation above).  The specification discloses that the corresponding structures of the means include multimedia manager 810 and/or one or more of its components ([0122] of the published application, US 2021/0192756 A1).  Multimedia manager 810 is a computer that includes components such as a memory (e.g. memory 830) and a processor (processor 840) that executes instructions stored in the memory in order to perform a detection or tracking method (e.g. [0128]).  Therefore, the scope of claim 20, as interpreted under 35 U.S.C. 112(f), includes an apparatus that is substantially the same as the apparatus of claim 16.  


Conclusion
The following prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
‘Cai’ (“Cascade R-CNN: Delving into High Quality Object Detection,” 2018)
Teaches details of the Cascade R-CNN object detector used by Li
Teaches that its object detection “Inference was performed on a single image scale” – Section 5.1
‘Ren’ (“Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” 2015)
Teaches details of the Faster R-CNN object detector that is alternatively used by Li
Citing to Girshick, teaches re-scaling images to a single scale before object – Page 5, Implementation Details
‘Sun’ (“PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume,” 2018)
Teaches details of the optical flow network used by Li
‘Wang’ (“Online Multiple Object Tracking Via Flow and Convolutional Features,” 2017)
Similar to Li, combines optical flow with detections to perform tracking
‘Azimi’ (“Towards Multi-class Object Detection in Unconstrained Remote Sensing Imagery,” 29 May 2019)
Teaches an example of a cascade neural network that enables object detection/recognition at various orientations – see e.g. Section 2.2
‘Lin’ (US 2020/0160060 A1)
Example of a tracking method that skips some frames, such that the frames that are processed may be noncontiguous

Any inquiry concerning this communication or earlier communications from the examiner should be directed to GEOFFREY E SUMMERS whose telephone number is (571)272-9915. The examiner can normally be reached Monday-Friday, 7:00 AM to 3:30 PM ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Chan Park can be reached on (571) 272-7409. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For 





/GEOFFREY E SUMMERS/Examiner, Art Unit 2669