Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s arguments, see pp. 11-12, filed 09/03/2021, with respect to 35 U.S.C. 101 have been fully considered and are persuasive. The amended part of the claim modifies the machine learning model and it is deemed a practical application. The 101 rejections have been withdrawn. 
Applicant’s arguments with respect to with respect to 35 U.S.C. 103 have been
considered but are moot because the arguments are directed to amended limitations that have
not been previously examined.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

The following is a quotation of pre-AIA  35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA  35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

Claims 6 and 8 are rejected under 35 U.S.C. 112(d) or pre-AIA  35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends. Claims 6 and 8 are the dependent claims of claim 1 but they do not constitute .  Applicant may cancel the claim(s), amend the claim(s) to place the claim(s) in proper dependent form, rewrite the claim(s) in independent form, or present a sufficient showing that the dependent claim(s) complies with the statutory requirements.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


1.	Claims 1, 3-4, 6, 8, 15 and 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Butt et al. (US 2018/0157939 A1) in view of Mehrseresht (US 2019/0188866 A1) in view of Karpathy et al. (“Large-scale Video Classification with Convolutional Neural Networks”) further in view of Cao et al. (US 9471851 B1)

Regarding Claim 1
Butt teaches:
Butt [Abstract] “The system comprises one or more processors and memory comprising computer program code stored on the memory and configured when executed by the one or more processors to cause the one or more processors to perform a method.”;  [0114] “The video is captured by the camera 108 over a period of time.” [0072] “Each video capture device 108 includes at least one image sensor 116 for capturing a plurality of images.” [0092] “The video analytics module 224 receives image data and analyzes the image data to determine properties or characteristics of the captured image or video and/or of objects found in the scene represented by the image or video.”; “processors and memory” reads on “a computing system”, “a period of time” reads on “time interval”; “objects found in the scene” reads on “the environment comprises one or more objects”)
- generating, by the computing system, based at least in part on the sensor data, an input representation of the one or more objects, wherein the input representation comprises a temporal dimension and one or more spatial dimensions; (Butt [0138] “The Data 710 in Object Profile 702 and Object Profile 704 has, for example, content including time stamp, frame number, resolution in pixels by width and height of the scene, segmentation mask of this frame by width and height in pixels and stride by row width in bytes, classification (person, vehicle, other), confidence by percent of the classification, box (bounding box surrounding the profiled object) by width and height in normalized sensor coordinates, image width and height in pixels as well as image stride (row width in bytes), segmentation mask of image, orientation, and x & y coordinates of the image box.”; “time stamp “ reads on “temporal dimension” and “width and height” reads on “spatial dimension”)
- determining, by the computing system, based at least in part on the input representation and a machine-learned model, at least one of one or more detected object classes of the one or more objects, one or more locations of the one or more objects over the one or more time intervals, or one or more predicted paths of the one or more objects; and (Butt [0009] “The implemented learning machine may be a second learning machine, and the identifying may be performed by a first learning machine implemented by the one or more processors.” [0145] “The temporal object classification module 912 may also maintains class (such as, for example, human, vehicle, or animal) information of an object over a period of time. … the temporal object classification module 912 determines the objects type based on its appearance in multiple frames. For example, gait analysis of the way a person walks can be useful to classify a person, or analysis of a person's legs can be useful to classify a cyclist. The temporal object classification module 912 may combine information regarding the trajectory of an object”; “object type” reads on “object classes” and “trajectory” reads on “the predicted paths”)
- generating, by the computing system, based at least in part on the input representation and the machine-learned model, output data comprising one or more bounding shapes corresponding to the one or more objects. (Butt [0106] “For example, the location metadata may be further used to generate a bounding box (such as, for example, when encoding video or playing back video) outlining the detected foreground visual object.”)
	Butt does not distinctly disclose
- wherein the machine-learned model aggregates the temporal information associated with the temporal dimension at the first convolution layer
	However, Mehrseresht teaches
- wherein the machine-learned model aggregates the temporal information associated with the temporal dimension at the first convolution layer (Mehrseresht [Fig. 9] [0108] “In one arrangement, an input 905 to the convolutional neural network is a segment of the sequence of spatial representations 125 containing spatial representation from 16 consecutive timestamps (16 frames). In the convolutional neural network 900 illustrated in FIG. 9, the convolution filters are c×3×3×3 tensors, where c is the number of channels in the previous layer. The convolutional neural network 900 has stride of (v.sub.1, v.sub.2, v.sub.3) indicating convolution with the stride of v.sub.3 over temporal dimension, and stride of v.sub.1 and v.sub.2 on the width and height of each frame of spatial representation (in a first convolution layer 910), or feature map of previous layers (in the subsequent convolution layers 911).”; Fig. 9 discloses the input of the convolutional layer and “timestamp” reads on “temporal dimension”)
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the appearance search system of Butt with the tensor data structure of Mehrseresht in order to achieve fine-grain detection of motions and efficient and accurate computation. (Mehrseresht, [0131] “The arrangements described accordingly 
	The combination of Butt and Mehrseresht does not appear to distinctly disclose
- aggregates the temporal information associated with the temporal dimension over two or more convolution layers of the machine- learned model
	However, Karpathy teaches
- aggregates the temporal information associated with the temporal dimension over two or more convolution layers of the machine- learned model (Karpathy, [Figure 1] discloses late fusion and shows the video frames are aggregated over two convolutional layers. Video frames include temporal information.)
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the appearance search system of Butt as modified by Mehrseresht and Banka with the tensor aggregation of Karpathy in order to achieve significant performance improvement in CNN training. (Karpathy, [Abstract] “We further study the generalization performance of our best model by retraining the top layers on the UCF101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).”)
The combination of Butt, Mehrseresht and Karpathy does not appear to distinctly disclose
- determining, by the computing system based at least in part on one or more fusion criteria, whether to aggregate temporal information associated with the temporal dimension at a first convolution layer of a plurality of convolution layers of a machine-learned model or to 
	However, Cao teaches
- determining, by the computing system based at least in part on one or more fusion criteria, whether to aggregate temporal information associated with the temporal dimension at a first convolution layer of a plurality of convolution layers of a machine-learned model or to aggregate the temporal information associated with the temporal dimension over two or more convolution layers of the plurality of convolution layers of the machine-learned model; ([FIG.2][6:22-24] “The multimodal information fusion device 230 can selectively perform any of early fusion; late fusion; and filtered fusion.”; “early fusion” reads on “to aggregate temporal information associated with the temporal dimension at a first convolution layer of a plurality of convolution layers of a machine-learned model” and “late fusion” reads on “to aggregate the temporal information associated with the temporal dimension over two or more convolution layers of the plurality of convolution layers of the machine-learned model”; [FIG.2] discloses how the fusion method is determined by the fusion criteria;)
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the appearance search system of Butt, Mehrseresht and Karpathy with the fusion method decision of Cao in order to achieve effective combination of multimodal data. (Cao, [Abstract] “Thus, there is a need for a system that derives user gender using an effective multimodal combination of visual and non-visual cues.”)

Regarding Claim 3
Butt, Mehrseresht, Karpathy and Cao teaches all of the limitations of claim 1 as cited above and Mehrseresht further teaches
- The computer-implemented method of claim 1, wherein the input representation comprises a tensor associated with a set of dimensions comprising the temporal dimension and the one or more spatial dimensions, the temporal dimension of the tensor associated with the one or more time intervals, and the one or more spatial dimensions of the tensor comprising a width dimension, a depth dimension, or a height dimension that is used as an input channel for the machine-learned model. (Mehrseresht [0108] “In one arrangement, an input 905 to the convolutional neural network is a segment of the sequence of spatial representations 125 containing spatial representation from 16 consecutive timestamps (16 frames). In the convolutional neural network 900 illustrated in FIG. 9, the convolution filters are c×3×3×3 tensors, where c is the number of channels in the previous layer. The convolutional neural network 900 has stride of (v.sub.1, v.sub.2, v.sub.3) indicating convolution with the stride of v.sub.3 over temporal dimension, and stride of v.sub.1 and v.sub.2 on the width and height of each frame of spatial representation (in a first convolution layer 910), or feature map of previous layers (in the subsequent convolution layers 911).”)
	Same motivation as claim 1.

Regarding Claim 4
Butt, Mehrseresht, Karpathy and Cao teaches all of the limitations of claim 3 as cited above and Mehrseresht further teaches
Mehrseresht [0107] “Convolutional layers apply a convolution operation to the input, passing the result to the next layer. The receptive field of convolution units are often small e.g., 3 by 3, and convolution units in the same layer have the same weights. Convolution units in the same layer having the same weights is commonly referred to as weight sharing. In other words, convolutional nodes in the same layer share weights. Units in fully connected layers have connections to all units in the previous layer.”)
	Same motivation as claim 1.

Regarding Claim 6
Butt as modified by Mehrseresht, Karpathy, Cao and Banka teaches all of the limitations of claim 1 as cited above and Mehrseresht further teaches:
- The computer-implemented method of claim 1, wherein the temporal information associated with the temporal dimension is aggregated at the first convolution layer of the plurality of convolution layers.
(Mehrseresht [Fig. 9] [0108] “In one arrangement, an input 905 to the convolutional neural network is a segment of the sequence of spatial representations 125 containing spatial representation from 16 consecutive timestamps (16 frames). In the convolutional neural network 900 illustrated in FIG. 9, the convolution filters are c×3×3×3 tensors, where c is the number of channels in the previous layer. The convolutional neural network 900 has stride of (v.sub.1, v.sub.2, v.sub.3) indicating convolution with the stride of v.sub.3 over temporal dimension, and stride of v.sub.1 and v.sub.2 on the width and height of each frame of spatial representation (in a first convolution layer 910), or feature map of previous layers (in the subsequent convolution layers 911).”; Fig. 9 discloses the input of the convolutional layer and “timestamp” reads on “temporal dimension”)
	Same motivation as claim 1.

Regarding Claim 8
Butt, Mehrseresht, Karpathy and Cao teaches all of the limitations of claim 1 as cited above and Karpathy further teaches
- The computer-implemented method of claim 1, wherein the temporal information associated with the temporal dimension of the tensor is aggregated over two or more convolution layers of the plurality of convolution layers. (Karpathy, [Figure 1] discloses late fusion and shows the video frames are aggregated over two convolutional layers. Video frames include temporal information.) 
	Same motivation as claim 1.

Regarding Claim 15
Claim 15 is a tangible non-transitory comprising computer-readable media claim corresponding to the methods of claim 1, and is directed to largely the same subject matter. 

Regarding Claim 18
Claim 18 is a computing device claim corresponding to the methods of claim 1, and is directed to largely the same subject matter. Thus, it is rejected for the same reasons as given in the rejection of claim 1. Note that Butt teaches a computing device ([0007]).

Regarding Claim 19
Butt, Mehrseresht, Karpathy and Cao teaches all of the limitations of claim 18 as cited above and Mehrseresht further teaches
- The computing device of claim 18, wherein the machine-learned model is based at least in part on one or more classification techniques comprising a convolutional neural network. (Mehrseresht [0114] “Training of the convolutional neural network adjusts the weights by minimizing the loss. Sigmoid cross-entropy (also called Softmax) loss is commonly used for classification.”)
	Same motivation as claim 18.


2.	Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Butt in view of Mehrseresht in view of Karpathy in view of Cao and further in view of Adams et al. (US 2019/0049242 A1 hereinafter Adams).
Regarding Claim 2
Butt, Mehrseresht, Karpathy and Cao teaches all of the limitations of claim 1 as cited above but does not appear to distinctly disclose:
- The computer-implemented method of claim 1, further comprising: generating, by the computing system, based at least in part on the sensor data, a plurality of voxels corresponding to the environment comprising the one or more objects, wherein a height dimension of the plurality of voxels is used as an input channel of the input representation, and wherein the input representation is based at least in part on the plurality of voxels corresponding to one or more portions of the environment occupied by the one or more objects.
However, Adams teaches

- The computer-implemented method of claim 1, further comprising: generating, by the computing system, based at least in part on the sensor data, a plurality of voxels corresponding to the environment comprising the one or more objects, wherein a height dimension of the plurality of voxels is used as an input channel of the input representation, and wherein the input representation is based at least in part on the plurality of voxels corresponding to one or more portions of the environment occupied by the one or more objects. (Adams [0019-0020] “In some examples, various sensor data may be accumulated into a voxel space or array. Such a voxel space may be a three-dimensional representation comprising a plurality of voxels. As a non-limiting example, a voxel space may be a rectangular cuboid having a length, a width, and a height, comprising a plurality voxels, each having a similar shape. In some examples, the voxel space is representative of an environment such that an origin of the voxel space may be described by a position and/or orientation in an environment. Similarly, each voxel may be described by one or more of a position and orientation in an environment, or a coordinate relative to an origin of the voxel space”; “environment” reads on “object”)
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the appearance search system of Butt, Mehrseresht, Karpathy and Cao with the voxel creation of Adams in order to achieve better representatives of objects and efficient determination of overlaps between objects (Adams, [0019])

3.	Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Butt in view of Mehrseresht in view of Karpathy in view of Cao and further in view of Banka et al. (US 2012/0197856 A1 hereinafter Banka) 
Regarding Claim 5
	Butt as modified by Mehrseresht, Karpathy and Cao teaches all of the limitations of claim 4 as cited above and Mehrseresht further teaches: 
- the plurality of convolution layers (Mehrseresht [Fig. 9] discloses the multiple convolutional layers)
Butt as modified by Mehrseresht, Karpathy and Cao does not appear to distinctly disclose:
- The computer-implemented method of claim 4, further comprising: aggregating, by the computing system, temporal information to the tensor subsequent to aggregating spatial information associated with the one or more spatial dimensions to the tensor, wherein the temporal information is aggregated as the input representation is processed by [the plurality of 
	However, Banka teaches:
- The computer-implemented method of claim 4, further comprising: aggregating, by the computing system, temporal information to the tensor subsequent to aggregating spatial information associated with the one or more spatial dimensions to the tensor, wherein the temporal information is aggregated as the input representation is processed by [the plurality of convolution layers], and wherein the temporal information is associated with the temporal dimension of the tensor. (Banka [0028] “In particular embodiments, an aggregator nodes 16 may aggregate sensor data using both spatial and temporal factors. An aggregator node 16 may collect data from one or more sensor nodes 12 based both the spacial proximity of sensor nodes 12 and on the time-series of the sensor data. In particular embodiments, complex sensor data with multidimensional and temporal characteristics may be aggregated using multilinear algebraic techniques (such as, for example, tensor decomposition) and aggregator node 16 may only transmit key coefficients to indexer nodes 26.”)
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the appearance search system of Butt, Mehrseresht, Karpathy and Cao with the data aggregation of Banka in order to achieve efficient data processing in the CNN layer (Banka, [0063] “At step 306, aggregator nodes 16 append metadata to the received sensor data for the purpose of providing efficient searches.”)

7 is rejected under 35 U.S.C. 103 as being unpatentable over Butt in view of Mehrseresht in view of Karpathy in view of Cao and further in view of Chen et al. (CN 105910827 A hereinafter Chen) 

Regarding Claim 7
Butt as modified by Mehrseresht, Karpathy and Cao teaches all of the limitations of claim 6 as cited above but does not appear to distinctly disclose:
- The computer-implemented method of claim 6, wherein aggregating the temporal information comprises: reducing, by the computing system, the one or more time intervals of the temporal dimension to one time interval by performing a one-dimensional convolution on the temporal information associated with the temporal dimension.
	However, Chen teaches
- The computer-implemented method of claim 6, wherein aggregating the temporal information comprises: reducing, by the computing system, the one or more time intervals of the temporal dimension to one time interval by performing a one-dimensional convolution on the temporal information associated with the temporal dimension. (Chen [0015-0016] “Step 21. Convolution feature learning: Construct a convolution-pooling model, use a filter to perform convolution operations on the one-dimensional motor vibration signal, reduce the dimension of the feature map while ensuring the feature position is unchanged, and pull the pooled feature map into a one-dimensional vector as the final learning Fault characteristics”; Chen discloses one-dimensional convolution reduction)

	
5.	Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Butt in view of Mehrseresht in view of Karpathy in view of Cao and further in view of Liu et al. (US 2019/0130569 A1 hereinafter Liu)
Regarding Claim 9
Butt as modified by Mehrseresht, Karpathy and Cao teaches all of the limitations of claim 8 as cited above but does not distinctly disclose:
- The computer-implemented method of claim 8, wherein aggregating the temporal information comprises:  reducing, by the computing system, the one or more time intervals of the temporal dimension to one time interval by performing a two-dimensional convolution on the temporal information associated with the temporal dimension.
	However, Liu teaches
- The computer-implemented method of claim 8, wherein aggregating the temporal information comprises:  reducing, by the computing system, the one or more time intervals of the temporal dimension to one time interval by performing a two-dimensional convolution on the temporal information associated with the temporal dimension. (Liu [0063] “Each unit layer of the encoder network 604 may include a 2D convolution layer 614 with a set of 2D filters, batch normalization (BN) layer 616, rectified-linear unit (ReLU) activation layer 618, followed by a max-pooling layer (the pooling layer 620) for reduction of data dimensions. The unit layer may be repeated multiple times to achieve sufficient data compression.”; Liu discloses two-dimensional convolution layer for reduction of data dimensions.)
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the appearance search system of Butt, Mehrseresht, Karpathy and Cao with the two-dimensional convolution of Liu in order to achieve data attenuation correction thereby achieving accurate qualitative and quantitative results. (Liu, [0052] “Accordingly, attenuation correction of data is generally necessary for accurate qualitative and quantitative measurements of radiolabeled molecule activity.”)

6.	Claims 10-11, 13-14 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Butt in view of Mehrseresht in view of Karpathy in view of Cao and further in view of Vallespi-Gonzalez et al. (US 2018/0348346 A1 hereinafter Vallespi)
Regarding Claim 10
The combination of Butt, Mehrseresht, Karpathy and Cao teaches all of the limitations of claim 1 as cited above but does not appear to distinctly disclose:
- The computing device of claim 18, wherein the machine-learned model is based at least in part on one or more classification techniques comprising a convolutional neural network.
	However, Vallespi teaches
- The computing device of claim 18, wherein the machine-learned model is based at least in part on one or more classification techniques comprising a convolutional neural network. Vallespi [0118] “In FIG. 13, cell classification and segmentation graph 880 depicts a first group of cells 882 of LIDAR data, a second group of cells 884 of LIDAR data, and a third group of cells 886 of LIDAR data determined in an environment surrounding autonomous vehicle 888.”; cell classification and segmentation )
	Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the appearance search system of Butt, Mehrseresht, Karpathy and Cao with the vehicle controller of Vallespi in order to achieve an accurate classification and generate the appropriate motion plan. (Vallespi, [0004] “The ability to accurately and precisely detect and characterize objects of interest is fundamental to enabling the autonomous vehicle to generate an appropriate motion plan through its surrounding environment.”)

Regarding Claim 11
The combination of Butt, Mehrseresht, Karpathy and Cao teaches all of the limitations of claim 1 as cited above but does not appear to distinctly disclose:
- The computer-implemented method of claim 1, further comprising: determining, by the computing system, one or more travelled paths of the one or more objects based at least in part on one or more locations of the one or more objects over a sequence of the one or more time intervals comprising a last time interval associated with a current time and the one or more time intervals prior to the current time, wherein the one or more predicted paths of the one or more objects is based at least in part on the one or more travelled paths.
	However, Vallespi teaches
Vallespi 
[0051-0053] “The sensor data can include information that describes the location (e.g., in three-dimensional space relative to the autonomous vehicle) of points that correspond to objects within the surrounding environment of the autonomous vehicle (e.g., at one or more times). In particular, in some implementations, the perception system can determine, for each object, state data that describes a current state of such object. As examples, the state data for each object can describe an estimate of the object's current location (also referred to as position); current speed; current heading (which may also be referred to together as velocity); current acceleration; current orientation; size/footprint (e.g., as represented by a bounding shape such as a bounding polygon or polyhedron); class of characterization (e.g., vehicle versus pedestrian versus bicycle versus other); yaw rate; and/or other state information. In some implementations, the perception system can determine state data for each object over a number of iterations. In particular, the perception system can update the state data for each object at each iteration. Thus, the perception system can detect and track objects (e.g., vehicles, bicycles, pedestrians, etc.) that are proximate to the autonomous vehicle over time, and thereby produce a presentation of the world around an autonomous vehicle along with its state (e.g., a presentation of the objects of interest within a scene at the current time along with the states of the objects). The prediction system can receive the state data from the perception system and predict one or more future locations and/or moving paths for each object based on such state data.”; “one or more time” implies current time and prior times. Vallespi discloses the state data of objects and it includes location, time data and predicted path based on the travelled path.)
Same motivation to combine Butt and Vallespi as claim 10.

Regarding Claim 13
The combination of Butt, Mehrseresht, Karpathy and Cao teaches all of the limitations of claim 1 as cited above but does not appear to distinctly disclose: 
- The computer-implemented method of claim 1, wherein the one or more sensor outputs comprise one or more three-dimensional points corresponding to a plurality of surfaces of the one or more objects detected by the one or more sensors.
However, Vallespi teaches
- The computer-implemented method of claim 1, wherein the one or more sensor outputs comprise one or more three-dimensional points corresponding to a plurality of surfaces of the one or more objects detected by the one or more sensors.( Vallespi [0032] “In some embodiments, LIDAR data includes a three-dimensional point cloud of LIDAR data points received from around the periphery of an autonomous vehicle.”)
Same motivation to combine Butt and Vallespi as claim 10.

Regarding Claim 14
Butt, Mehrseresht, Karpathy and Cao teaches all of the limitations of claim 1 as cited above but does not appear to distinctly disclose:
- The computer-implemented method of claim 1, wherein the sensor data is associated with a birds eye view vantage point, the one or more sensors comprising one or more light detection and ranging devices (LIDAR), one or more cameras, one or more radar devices, one or more sonar devices, or one or more thermal sensors.
However, Vallespi teaches
- The computer-implemented method of claim 1, wherein the sensor data is associated with a birds eye view vantage point, the one or more sensors comprising one or more light detection and ranging devices (LIDAR), one or more cameras, one or more radar devices, one or more sonar devices, or one or more thermal sensors. (Vallespi [0035] “For example, a second representation of LIDAR data can correspond to a top-view representation. In contrast to the range-view representation of LIDAR data described above, a top-view representation can correspond to a representation of LIDAR data as viewed from a bird's eye or plan view relative to an autonomous vehicle and/or ground surface. A top-view representation of LIDAR data is generally from a vantage point that is substantially perpendicular to the vantage point of a range-view representation of the same data.”)
Same motivation to combine Butt and Vallespi as claim 10.

Regarding Claim 16
The combination of Butt, Mehrseresht, Karpathy and Cao teaches all of the limitations of claim 15 as cited above but does not appear to distinctly disclose:

	However, Vallespi teaches
- The one or more tangible non-transitory computer-readable media of claim 15, further comprising: generating the machine-learned model based at least in part on training data comprising a plurality of training objects associated with a plurality of classified features and a plurality of classified object labels, the plurality of classified features based at least in part on point cloud data comprising a plurality of three-dimensional points associated with one or more physical characteristics of the plurality of training objects. (Vallespi [0150] “The detector training dataset 992 can further include a second portion of data corresponding to labels identifying corresponding objects detected within each portion of input sensor data. In some implementations, the labels can include at least a bounding shape corresponding to each detected object of interest. In some implementations, the labels can additionally include a classification for each object of interest from a predetermined set of objects including one or more of a pedestrian, a vehicle, or a bicycle.” [0063] “In some embodiments, LIDAR data 12 can include a three-dimensional point cloud of LIDAR data points received from around the periphery of an autonomous vehicle.”)


7.	Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Butt in view of Mehrseresht in view of Karpathy in view of Cao in view of Vallespi further in view of Li et al. (WO 2010042068 A1 hereinafter Li) 
Regarding Claim 12
The combination of Butt, Mehrseresht, Karpathy, Cao and Vallespi teaches all of the limitations of claim 11 as cited above but does not appear to distinctly disclose:
- The computer-implemented method of claim 11, further comprising: detecting, by the computing system, an object of the one or more objects that is at least partly occluded; and determining, by the computing system, based at least in part on the one or more travelled paths of the one or more objects, when the object of the one or more objects that is at least partly occluded was previously detected.
	However, Li teaches
- The computer-implemented method of claim 11, further comprising: detecting, by the computing system, an object of the one or more objects that is at least partly occluded; and determining, by the computing system, based at least in part on the one or more travelled paths of the one or more objects, when the object of the one or more objects that is at least partly occluded was previously detected. (Li [pp 19:14-25]  “In step 108 (Figure 1), the position of the object in the current frame is estimated based on motion history in the preceding frame, particularly in instances where the person is occluded. Let X-Z plane be the ground plane, where X-direction is aligned with the image plane and Z-direction is aligned with optical axis of a camera. The probability of being occluded can be estimated according to the object's position on the ground plane. Figures 7a-7c show possible events for two human objects HA and HB. In Figure 7a, human objects HA and HB are completely visible to a camera 730(?). In Figure 7b, human object HB is partially occluded by human object HA who is nearer to the camera 730 (i.e. ZA < ZB), while in Figure 7c, human object H8 is completed occluded by human object HA..... Further, if a person is occluded by more than one person, the persons occluding him are classified into two groups, i.e. on the left and right sides. The maximum probabilities for both groups are selected and the final probability value is the sum of the two maximum values.”; Li discloses how to determine the occlusion based on the probability and the object labeling implies that the object was previously detected.)
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the appearance search system of Butt, Mehrseresht, Karpathy, Cao and Vallespi with the occlusion detection of Li in order to achieve accurate detection of objects. (Li, [pp15:2-4] “On the other hand, the stereo-based detection according to the example embodiment accurately detects the objects as shown in 308, 318 and 328.”)

8.	Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Butt in view of Mehrseresht in view of Karpathy in view of Cao in view of Vallespi further in view of Weston et al. (US 2017/0193390 A1 hereinafter Weston) 
Regarding Claim 17
Butt, Mehrseresht, Karpathy, Cao and Vallespi teaches all of the limitations of claim 16 as cited above and Vallespi further teaches:
- The one or more tangible non-transitory computer-readable media of claim 16, further comprising: 
determining, for each of the plurality of predefined portions of the training environment, a score associated with a probability of the predefined portion of the plurality of predefined portions being associated with one of the plurality of classified object labels; and (Vallespi [0041] “In some examples, the classification for each cell can include a probability score associated with each classification indicating the likelihood that such cell includes one or more particular classes of objects of interest.”)
The combination of Butt, Mehrseresht, Karpathy, Cao and Vallespi does not appear to distinctly disclose:
- training the machine-learned model using the training data comprising a plurality of predefined portions of a training environment, wherein each of the plurality of predefined portions of the training environment is associated with at least one of a plurality of negative training samples or at least one of a plurality of positive training samples associated with a corresponding ground truth sample; and  
- ranking the plurality of negative training samples based at least in part on the score for the respective one of the plurality of predefined portions of the training environment, wherein a weighting of a filter of the machine-learned model is based at least in part on a predetermined portion of the plurality of the negative samples associated with the lowest scores.
	However, Weston teaches
Weston [0058] “Thus, the deep-learning model, may be trained so that each of the entities of the second set of entities (i.e., negative samples) has a lower similarity score than the target entity (i.e., a positive sample). In other words, the deep-learning model may be trained so that each of the entities of the second set of entities is ranked lower than the target entity.”)
- ranking the plurality of negative training samples based at least in part on the score for the respective one of the plurality of predefined portions of the training environment, wherein a weighting of a filter of the machine-learned model is based at least in part on a predetermined portion of the plurality of the negative samples associated with the lowest scores.(Weston [0058] “In particular embodiments, social-networking system 160 may assign rankings to each of the target entity and the second set of entities, and the one or more weights of the deep-learning model may be updated further based on the rankings. … In other words, the deep-learning model may be trained so that each of the entities of the second set of entities is ranked lower than the target entity. Social-networking system 160 may determine that one or more of the entities of the second set of entities are ranked above or have higher similarity scores than the target entity (i.e., have corresponding embeddings that are more proximate to the user embedding in the embedding space), and social-networking system 160 may update vector representations of one or more entities of the second set of entities.”)


9.	Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over Butt in view of Mehrseresht in view of Karpathy in view of Cao and further in view of Boettger et al. (“Measuring the Accuracy of Object Detectors and Trackers” hereinafter Boettger).
Regarding Claim 20
The combination of Butt, Mehrseresht, Karpathy and Cao all of the limitations of claim 18 as cited above but does not distinctly disclose:
The computing device of claim 18, further comprising: 
- determining, based at least in part on the input representation and the machine-learned model, an amount of overlap between the one or more bounding shapes; and 
- responsive to the amount of overlap between the one or more bounding shapes satisfying one or more overlap criteria, determining that the object of the one or more objects associated with the one or more bounding shapes that satisfies the one or more overlap criteria is the same object over the one or more time intervals.
However, Boettger teaches
- determining, based at least in part on the input representation and the machine-learned model, an amount of overlap between the one or more bounding shapes; and (Boettger we introduce the relative Intersection over Union (rIoU) accuracy measure. The measure normalizes the IoU with the optimal box for the segmentation to generate an accuracy measure that ranges between 0 and 1 and allows a more precise measurement of accuracies.”; “measure that ranges between 0 and 1” reads on “amount of overlap”)
- responsive to the amount of overlap between the one or more bounding shapes satisfying one or more overlap criteria, determining that the object of the one or more objects associated with the one or more bounding shapes that satisfies the one or more overlap criteria is the same object over the one or more time intervals. (Boettger [Figure 4] Boettger discloses the IOU value change over time; It is obvious that the criteria can be set up and the determination can be made based on the criteria.)
Before the effective filling date of the claimed invention, it would have been obvious to one of ordinary skill in the art to combine the appearance search system of Butt, Mehrseresht, Karpathy and Cao with the relative intersection over union of Boettger in order to achieve more accurate detection and tracking of objects (Boettger, [Abstract] “The measure normalizes the IoU with the optimal box for the segmentation to generate an accuracy measure that ranges between 0 and 1 and allows a more precise measurement of accuracies”)

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SUNG WON LEE whose telephone number is 571-272-8508.  The examiner can normally be reached on Mon-Fri 0730-1730.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ALEXEY SHMATOV can be reached on 571-270-3428.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access 

	

/SUNG W LEE/
Examiner, Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129