DETAILED ACTION
Response to Amendment
Claims 1-18 and 20-21 are pending. Claims 1-18 and 20 are amended directly or by dependency on an amended claim. Claim 19 is canceled. Claim 21 is new.
Response to Arguments
Applicant’s arguments with respect to claim(s) 1-3 and 6-19 have been considered but are moot because the new ground of rejection does not rely on the current combination of references applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. In particular, Huang et al. as was previously applied to reject claims 11 and 12 is now incorporated into the rejection of the independent claims.
Claim 19 is now canceled so the rejection under 35 USC 101 is withdrawn.
Claim 21 is allowed as incorporating the subject matter in the previously indicated as allowable subject matter into an independent claim form.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 6, 8, 9, 10, 11, 12, 13, 16, 17, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Roh et al. (US 20180165551 A1) in view of Yao et al. (US 20190073553 A1) in view of Huang et al. (US 20190004533 A1).

Regarding claims 1 and 20, Roh et al. disclose a method for encoding objects in a camera-captured image with a deep neural network pipeline (
    PNG
    media_image1.png
    350
    750
    media_image1.png
    Greyscale
), the method comprising, and non-transitory computer readable medium including instructions that when executed by a process are configured to: identifying at least a portion of the camera-capture image (camera, [0021], image data 204 may be embodied as RGB image data that may include multiple objects, [0028], image data 204 is input to the convolution layer 208a, [0037]); applying a first convolutional neural network to the at least the portion of the camera-captured image at a first stage (convolution layers 208a through 208m, [0039], 208a in Fig. 8); pooling, at a second stage, a plurality of subregion representations from an output of the first convolutional neural network for the first stage (the convolution map data output from the convolution layer 208g is processed by a pooling layer 802, [0039], 802 in Fig. 8); performing, at a fourth stage, at least one deconvolution from the output of the first stage or the output of the second stage (the convolution data output from the convolution layer 208m is processed by a deconvolution layer 804, [0039], 804 in Fig. 8); concatenating, at a fifth stage, the output of the fourth stage and the output of the third stage (concatenated data generated by the concatenation layer 210, [0039], 210 in Fig. 8); applying a second convolutional neural network to the output of the fifth stage (concatenated data generated by the concatenation layer 210 is input to a 1×1 convolution layer 806 for dimension reduction and another rectified linear activation, [0039], 806 in Fig. 8); and classifying the at least the portion of the camera-captured image as an object category in response to an output of the second convolutional neural network (multi-scale object classifier, [0039]).

Roh et al. do not disclose performing, at a third stage, at least one convolution of an output of the second stage (i.e. there is no additional convolution after the pooling shown in 802). Roh et al. and do not explicitly disclose an image corresponding to a surrounding of a vehicle or an object category from a predetermined list of road objects.

Yao et al. teach identifying at least a portion of the camera-capture image (extracts features using at least four parts, input image 112, [0021]); applying a first convolutional neural network to the at least the portion of the camera-captured image at a first stage (Conv 1, “120” Fig. 1, first convolutional layer 120, [0021]); pooling, at a second stage, a plurality of subregion representations from an output of the first convolutional neural network for the first stage (Max pooling, “126” Fig. 1, pooling 126 is used to pool the maximum value pixels and thereby reduce the resolution, [0024]); performing, at a third stage, at least one convolution of an output of the second stage (“130” follows “126” in Fig. 1, convolutional kernels 130, [0025], after the pooling, the subsequent layers are a convolution 162, [0032]); performing, at a fourth stage, at least one deconvolution from the output of the first stage or the output of the second stage (“128” in Fig. 1, Deconvolution 128 is applied to the high-level feature map, [0024]); concatenating, at a fifth stage, the output of the fourth stage and the output of the third stage (“sequential concatenation” “104” in Fig. 1, feature maps are concatenated, [0024]); applying a second convolutional neural network to the output of the fifth stage (“142” in Fig. 1, subsequent layers are a 3×3×4 convolution 142, [0029]); and classifying the at least the portion of the camera-captured image as an object category in response to an output of the second convolutional neural network (classification, rider, horse, person, Fig. 1, “The final image 114 has bounding boxes for each of three detected objects. A first bounding box 172 is proposed around a first detected object which in this case is a horse. The object may be classified as a horse, a quadruped, an animal or in some other suitable class, depending on the nature of the system and its intended use. The size of the bounding box and the location offset around the object are determined as appropriate for a horse or other class if the object is in another class. The second bounding box 174 is around a second detected object which in this case is a horse rider (i.e. a person). The third bounding box 176 is around a third object which in this case is a standing person”, [0035]). See Fig. 1: 
    PNG
    media_image2.png
    462
    791
    media_image2.png
    Greyscale

Roh et al. and Yao et al. are in the same art of using convolutional neural networks for object detection (Roh et al., abstract; Yao et al., abstract). The combination of Yao et al. with Roh et al. enables the use of a convolution after the pooling. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the order of operations of Yao et al. with the invention of Roh et al. as this was known at the time of filing, the combination would have predictable results, and as Yao et al. indicate “High accuracy and high speed generic object detection is described using a novel HyperNet technology… the technique provides impressively good feature discrimination” ([0014]) and “With this significant improvement in precision and speed object detection may be performed on small devices in real time as video frames are received. This allows for many new applications” ([0017]), indicating the improvement to the object discrimination that can be expected when this technology is combined with the object detection neural network architecture described by Roh et al.

Roh et al. and Yao et al. do not explicitly disclose an image corresponding to a surrounding of a vehicle or an object category from a predetermined list of road objects.

Huang et al. teach in the same art of CNNs with deconvolutional layers and concatenation and pooling ([0081], [0083], [0087]), an image corresponding to a surrounding of a vehicle (the system receives a first image captured by a first camera, the first image capturing a portion of a driving environment of the ADV, [0028], Cameras 211 may include one or more devices to capture images of the environment surrounding the autonomous vehicle. Cameras 211 may be still cameras and/or video cameras. A camera may be mechanically movable, for example, by mounting the camera on a rotating and/or tilting a platform, [0035]) and an object category from a predetermined list of road objects (perception can include the lane configuration (e.g., straight or curve lanes), traffic light signals, a relative position of another vehicle, a pedestrian, a building, crosswalk, or other traffic related signs (e.g., stop signs, yield signs), etc., for example, in a form of an object, [0051], Perception module 302 may include a computer vision system or functionalities of a computer vision system to process and analyze images captured by one or more cameras in order to identify objects and/or features in the environment of autonomous vehicle. The objects can include traffic signals, road way boundaries, other vehicles, pedestrians, and/or obstacles, etc., [0052]).

Roh et al. and Yao et al. and Huang et al. are in the same art of using convolutional neural networks for object detection (Roh et al., abstract; Yao et al., abstract; Huang et al., abstract). The combination of Huang et al. with Roh et al. and Yao et al. enables the application to vehicle navigation. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the technology application of Huang et al. with the invention of Roh et al. and Yao et al. as this was known at the time of filing, the combination would have predictable results, and as autonomous driving applications are one of the most relevant topics for neural networks given the commercial interest in this area, and through collision avoidance using neural network obstacle detection, this will greatly improve the safety and therefore commercial viability of the system described by Roh et al. and Yao et al. 

Regarding claim 6, Roh et al. and Yao et al. and Huang et al. disclose the method of claim 1. Roh et al. and Yao et al. further teach pooling the plurality of subregion representations comprises: calculating a large image block at a first level of coarseness; and calculating a small image block at a second level of coarseness (Roh et al., multi-scale, Each ROI pooling layer has a different output size, and each FC layer may be trained for an object scale based on the output size of the associated ROI pooling layer, abstract, The computing device 100 may input the convolution data and the region proposals into a multi-scale object classifier, which includes classifiers trained for multiple proposed region sizes. The computing device 100 may analyze data generated by different levels of abstraction in the multi-layer convolution network. By generating region proposals and classifying objects at multiple scales, the computing device 100 may boost object detection accuracy for the same computational cost required for previous approaches. Additionally, by using multi-scale classifiers, the computing device 100 may prevent duplication of pooled features for small objects as compared to scale-dependent pooling, [0015], Each ROI pooling layer 218 has a different output size, [0027],  Each of the ROI pooling layers 218 has different output dimensions, [0033], Since each layer must have the same feature dimension to concatenate, sizes (e.g., width and height) of features may be adjusted before constructing a hyper-feature layer, [0034] Example 29 includes the subject matter of any of Examples 18-28, and wherein executing the multi-scale object classifier further comprises selecting the first region of interest pooling layer based on a proposed object size of the first region proposal and the output size of the first region of interest pooling layer; and executing the first region of interest pooling layer comprises executing the region of interest pooling layer in response to selecting the first region of interest pooling layer, [0070]; There are a variety of different techniques to conform the sizes of the feature maps. As shown in this example, pooling 126 is used to pool the maximum value pixels and thereby reduce the resolution. This may be a 2×2 max pooling, for example. Deconvolution 128 is applied to the high-level feature map. In this case a 4× deconvolution is run on the feature maps from the fifth layer. This increases the resolution to that of the middle level or third layer feature map. The feature maps may be upscaled or downscaled in any of a variety of different ways. This approach allows for higher efficiency and discrimination after the layers are concatenated 130, [0024], At 210, the object detection HyperNet is fine-tuned, [0043], At 408 the feature maps are reshaped to a single size. The size includes the width and height. In the same or another process, the depth may also be modified to be the same for all of the feature maps. For larger feature maps, max pooling or another approach may be used to reduce the width and height of the feature map. For smaller feature maps, deconvolution may be used to increase the width and height of the feature map. The depth may be modified using convolution among other approaches, [0065]).

Regarding claim 8, Roh et al. and Yao et al. and Huang et al. disclose the method of claim 1. Roh et al. and Yao et al. further teach training the second convolutional neural network using the output of the fifth stage and a ground truth data set (Roh et al., Results on PASCAL VOC 2007 test set, [0016], multi-scale object classifier 216 may further include a trainable selection network to select the ROI pooling layer, [0027], each pipe of the multi-scale RPN 212 may be trained independently—at different receptive field sizes—to produce more accurate and relevant region proposals according to visual characteristics such as size and textureness of target objects, [0031]; Yao et al., layers of the complete region proposal HyperNet are then fine-tuned, [0042], training and testing the region proposal HyperNet, ground truth class label, [0046], [0048], embodiments use a Hyper Feature which combines the feature maps from multiple convolutional layers of a pre-trained CNN model to represent image or region content, [0050]) [known to use the output of the 5th stage as the concatenated output is used to create the final labeled map as seen in Fig. 1]

Regarding claim 9, Roh et al. and Yao et al. and Huang et al. disclose the method of claim 8. Roh et al. and Yao et al. further teach the ground truth data set includes a plurality of predetermined object categories (Roh et al., low-level features from an earlier part of the convolution network 206 may include structures or boundaries of objects, while high-level features from a later part of the convolution network 206 may be capable of generating robust and abstracted object categories, [0034]; Yao et al., 20 different object categories, [0033], object may be classified as a horse, a quadruped, an animal or in some other suitable class, [0035]).

Regarding claim 10, Roh et al. and Yao et al. and Huang et al. disclose the method of claim 1. Yao et al. further teach sending the object category to a vehicle system (autonomous vehicles, [0054]).

Regarding claim 11, Roh et al. and Yao et al. and Huang et al. disclose the method of claim 10, including an autonomous vehicle (Yao et al., [0054]). Huang et al. further teach the vehicle system provides navigation in response to the object category (“Note that decision module 304 and planning module 305 may be integrated as an integrated module. Decision module 304/planning module 305 may include a navigation system or functionalities of a navigation system to determine a driving path for the autonomous vehicle. For example, the navigation system may determine a series of speeds and directional headings to effect movement of the autonomous vehicle along a path that substantially avoids perceived obstacles while generally advancing the autonomous vehicle along a roadway-based path leading to an ultimate destination. The destination may be set according to user inputs via user interface system 113. The navigation system may update the driving path dynamically while the autonomous vehicle is in operation. The navigation system can incorporate data from a GPS system and one or more maps so as to determine the driving path for the autonomous vehicle”, [0058]).

Regarding claim 12, Roh et al. and Yao et al. and Huang et al. disclose the method of claim 10, including an autonomous vehicle (Yao et al., [0054]). Huang et al. further teach the vehicle system provides assisted or autonomous driving in response to the object category (“Decision module 304/planning module 305 may further include a collision avoidance system or functionalities of a collision avoidance system to identify, evaluate, and avoid or otherwise negotiate potential obstacles in the environment of the autonomous vehicle. For example, the collision avoidance system may effect changes in the navigation of the autonomous vehicle by operating one or more subsystems in control system 111 to undertake swerving maneuvers, turning maneuvers, braking maneuvers, etc. The collision avoidance system may automatically determine feasible obstacle avoidance maneuvers on the basis of surrounding traffic patterns, road conditions, etc. The collision avoidance system may be configured such that a swerving maneuver is not undertaken when other sensor systems detect vehicles, construction barriers, etc. in the region adjacent the autonomous vehicle that would be swerved into. The collision avoidance system may automatically select the maneuver that is both available and maximizes safety of occupants of the autonomous vehicle. The collision avoidance system may select an avoidance maneuver predicted to cause the least amount of acceleration in a passenger cabin of the autonomous vehicle”, [0059]).

Regarding claim 13, Roh et al. and Yao et al. and Huang et al. disclose the method of claim 1. Roh et al. and Yao et al. further teach upsampling the output of the fifth stage to match a resolution of the camera-captured image (Roh et al., deconvolution (or upsampling) layer inserted after the 13th convolution layer 208, so that all of the features have the same dimensions, [0034], the convolution map data output from the convolution layer 208g is processed by a pooling layer 802 and the convolution data output from the convolution layer 208m is processed by a deconvolution layer 804 to generate data with the same size as the convolution map data generated by the convolution layer 208j, [0039]; Yao et al., Deconvolution 128 is applied to the high-level feature map. In this case a 4× deconvolution is run on the feature maps from the fifth layer. This increases the resolution to that of the middle level or third layer feature map. The feature maps may be upscaled or downscaled in any of a variety of different ways. This approach allows for higher efficiency and discrimination after the layers are concatenated 130, [0024]).

Regarding claim 16, Roh et al. and Yao et al. and Huang et al. disclose the method of claim 1. Roh et al. and Yao et al. further teach expanding dimensions of a filter from the fifth stage; and performing a final stage convolution on the output of the fifth stage using the expanded dimensions of the filter from the fifth stage (Roh et al., Each ROI pooling layer 218 has a different output size and each ROI pooling layer 218 is associated with a corresponding FC layer 220. Each FC layer 220 may be trained for an object scale based on the output size of the corresponding ROI pooling layer 218, [0027], [0033], may include a pooling layer inserted after the seventh convolution layer 208 and a deconvolution (or upsampling) layer inserted after the 13th convolution layer 208, so that all of the features have the same dimensions, [0034], Each RPN layer 214 has a different receptive field size (i.e., kernel size). For example, the RPN layers 214a, 214b, 214c may have receptive field sizes of one pixel square, three pixels square, and five pixels square, respectively. The output of each RPN layer 214 is input to a 1×1 convolution layer 502 to generate a classification layer and a 1×1 convolution layer 504 to generate a regression layer, [0036], a first fully connected layer trained for image dimensions smaller than 128 pixels and a second fully connected layer trained for image dimensions greater than or equal to 128 pixels, [0051], to concatenate the plurality of convolution maps further comprises to resize the plurality of convolution maps, [0057], [0074]; Yao et al., each proposed image region is resized to some fixed size in order to run the related CNN features and object classifiers, [0016], An input image 112 is first resized to some standard scale. In this example, the image is downsized to 1000×600 pixels. However, the image may instead be upsized or downsized to any other suitable size and aspect ratio, depending on the camera, the system, and the intended use of the object detection. The resized image is then provided to the network in which the first operation is directly initialized as the first N convolutional layers of a pre-trained CNN model., [0021] In this example, the sizes of the feature maps from the 1.sup.st, 3.sup.rd and 5.sup.th convolutional layers have different dimensions, or different numbers of pixels. They have a high, middle, and low pixel scale. In this example, the first, low level convolutional map has a high pixel density or scale at 1000×600×64. This is the same as the input image at 1000×600 with an additional 64 channels for the z-coordinate from the convolution. The middle level convolutional map has a lower resolution at 250×150×256. This represents a downscaled image and an increased number of channels. The high level convolution has a still lower resolution and still more channels at 62×37×512, [0023], feature maps may be upscaled or downscaled, [0024], 13×13 RoI Pooling layer operates Max Pooling over each RoI with an adaptive cell size, [0029], At 408 the feature maps are reshaped to a single size. The size includes the width and height. In the same or another process, the depth may also be modified to be the same for all of the feature maps. For larger feature maps, max pooling or another approach may be used to reduce the width and height of the feature map. For smaller feature maps, deconvolution may be used to increase the width and height of the feature map [0065]).

Regarding claim 17, Roh et al. and Yao et al. and Huang et al. disclose the method of claim 1. Roh et al. and Yao et al. further teach the deep neural network pipeline includes a plurality of paths including: a first prong from the first stage through the fourth stage for low level features and shallow layers; and a second prong from the first stage through the second stage, the third stage, and the fourth stage for upsampling (see different pathways in Roh et al., Fig. 8, Yao et al., Fig. 1, also Roh et al., “In some embodiments, in block 420 the computing device 100 may concatenate multiple levels of abstraction in the convolution map that is input to the ROI pooling layers 218. Including multiple levels of abstraction in the convolution map may improve accuracy (i.e., semantics) and/or localization. For example, low-level features from an earlier part of the convolution network 206 may include structures or boundaries of objects, while high-level features from a later part of the convolution network 206 may be capable of generating robust and abstracted object categories”, [0034]; Yao et al., These layers have a fine, medium, and coarse resolution as shown. This corresponds to a low, middle, and high level semantical meaning. In the feature maps, fine resolution (from a shallow layer) corresponds to low level semantical meaning, but coarse resolution (from a deep layer) has a more high level semantical meaning, [0021], combined fine-tuned HyperNet 106, 108 trained at 208, 210 is able to jointly handle region proposal generation and object detection, [0044]).

Claims 2, 3, and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Roh et al. (US 20180165551 A1) and Yao et al. (US 20190073553 A1) and Huang et al. (US 20190004533 A1) as applied to claim 1, further in view of Li et al. (US 20170243053 A1).

Regarding claim 2, Roh et al. and Yao et al. and Huang et al. disclose the method of claim 1. Roh et al. and Yao et al. further indicate a first deconvolution from the output of the first stage (Roh et al., “804” Fig. 8; Yao et al., “128”, Fig. 1), but do not disclose the at least one deconvolution from the output of the first stage or the output of the second stage further comprises: a first deconvolution from the output of the first stage and a second deconvolution from the output of the second stage.

Li et al. teach a first deconvolution from the output of the first stage and a second deconvolution from the output of the second stage (The convolutional network VGG-16 242 may be applied to perform feature extraction (e.g. to identify probable facial and non-facial regions). As the VGG-16 242 convolutional network operates, it generates intermediate data including a series of pooling layers. The intermedia data may be processed by the associated deconvolutional networks FCN-8s 243 and DeconvNet 244 (discussed below) to enable the creation of a much more accurate and finely grained probability map, [0038], “Zero padding may be used for each deconvolution so that the size of each activation layer is aligned with the output of the previous pooling layer of the VGG16 242 convolution. Also, the FCN-8s 243 relies upon the last pooling layer (e.g. the one preceding the immediate deconvolution during the convolution process) as the coarsest prediction to preserve spatial information in the resulting image. The process is repeated and fused with the output of pooling layers 4 and 3 from the VGG-16 242 convolutional network. Finally, the fused prediction is upsampled to the same resolution as the RGB camera input image”, [0040], final output of DeconvNet 244 and FCN-8s 243 are concatenated, [0041]).

Roh et al. and Yao et al. and Li et al. are in the same art of using convolutional neural networks for object detection (Roh et al., abstract; Yao et al., abstract; Li et al., abstract). The combination of Li et al. with Roh et al. and Yao et al. and Huang et al. enables the use of two separate deconvolutions. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the deconvoutions of Li et al. with the invention of Roh et al. and Yao et al. and Huang et al. as this was known at the time of filing, the combination would have predictable results, and as Li et al. indicate “The present system extends the state of the art technology to apply well-trained convolutional neural networks to provide real-time facial tracking, segmentation, and performance capture with incredible accuracy, while dealing effectively with difficult occlusions” ([0020]), indicating when the two deconvolutions of Li et al. are combined with the invention of Roh et al. and Yao et al. and Huang et al., this will increase the accuracy of the network described by Roh et al. and Yao et al. and Huang et al. when dealing with occlusions. Occlusions are a regular feature when dealing with normal input pictures and videos from real scenes, and therefore a desirable target for improvement when dealing with real data.

Regarding claim 3, Roh et al. and Yao et al. and Huang et al. and Li et al. disclose the method of claim 2. 
Li et al. further indicate concatenating an output of the second deconvolution with the output of the first stage to provide a concatenated input for the first deconvolution (final output of DeconvNet 244 and FCN-8s 243 are concatenated, [0041]).

Regarding claim 14, Roh et al. and Yao et al. and Huang et al. disclose the method of claim 1. Roh et al. and Yao et al. do not disclose inserting padding values in between at least row or at least one column in the output of the first stage or the output of the second stage comprises, wherein the padding values and the output of the first stage or the output of the second stage are applied to the at least one deconvolution.

Li et al. teach, “The FCN-8s 243 operates substantially simultaneously on the same 128×128 probability map, but its default output size is incorrect for concatenation with the results of the DeconvNet 244 operations. Zero padding may be used for each deconvolution so that the size of each activation layer is aligned with the output of the previous pooling layer of the VGG16 242 convolution. Also, the FCN-8s 243 relies upon the last pooling layer (e.g. the one preceding the immediate deconvolution during the convolution process) as the coarsest prediction to preserve spatial information in the resulting image. The process is repeated and fused with the output of pooling layers 4 and 3 from the VGG-16 242 convolutional network. Finally, the fused prediction is upsampled to the same resolution as the RGB camera input image” ([0040]).

Roh et al. and Yao et al. and Li et al. are in the same art of using convolutional neural networks for object detection (Roh et al., abstract; Yao et al., abstract; Li et al., abstract). The combination of Li et al. with Roh et al. and Yao et al. and Huang et al. enables the use of padding. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the padding of Li et al. with the invention of Roh et al. and Yao et al. and Huang et al. as this was known at the time of filing, the combination would have predictable results, as zero padding is the most common way to ensure the sizes of the layers can align, and as Li et al. indicate “The present system extends the state of the art technology to apply well-trained convolutional neural networks to provide real-time facial tracking, segmentation, and performance capture with incredible accuracy, while dealing effectively with difficult occlusions” ([0020]), indicating when the padding of Li et al. is combined with the invention of Roh et al. and Yao et al. and Huang et al., this will increase the accuracy of the network described by Roh et al. and Yao et al. and Huang et al. when dealing with occlusions.

Claims 7 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Roh et al. (US 20180165551 A1) and Yao et al. (US 20190073553 A1) and Huang et al. (US 20190004533 A1) as applied to claim 6, further in view of Zhu et al. (US 20190205758 A1).

Regarding claim 7, Roh et al. and Yao et al. and Huang et al. disclose the method of claim 6. Roh et al. and Yao et al. and Huang et al. do not disclose the plurality of subregion representations comprises a pyramid of blocks having varying objects or varying detail levels.

Zhu et al. teach the plurality of subregion representations comprises a pyramid of blocks having varying objects or varying detail levels (“DeepLab: Contrary to FCN which has a stride of 32 at the last convolutional layer, DeepLab produces denser feature maps by removing the downsampling operator in the last two max pooling layers and applying Atrous convolution in the subsequent convolutional layers to enlarge the receptive field of view. As a result, DeepLab has the following several benefits: (1) max pooling which consecutively reduces the feature resolution and spatial information is avoided; (2) the dense prediction map simplifies the upsampling scheme; (3) Atrous spatial pyramid pooling employed at the end of the network allows to explore multi-scale context information in parallel. A deeper network is beneficial to learn high-level features but comes at the cost of losing spatial information. Therefore, the Deeplab model with Atrous convolution is well-suited to meet the purpose of the model of the present embodiment”, [0033]).

Roh et al. and Yao et al. and Zhu et al. are in the same art of using convolutional neural networks for object detection (Roh et al., abstract; Yao et al., abstract; Zhu et al., [0003]). The combination of Zhu et al. with Roh et al. and Yao et al. and Huang et al. enables the use of a pyramid/pyramidal pooling. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the pyramidal pooling of Zhu et al. with the invention of Roh et al. and Yao et al. and Huang et al. as this was known at the time of filing, the combination would have predictable results, Zhu et al. indicate, “DeepLab has the following several benefits: (1) max pooling which consecutively reduces the feature resolution and spatial information is avoided; (2) the dense prediction map simplifies the upsampling scheme; (3) Atrous spatial pyramid pooling employed at the end of the network allows to explore multi-scale context information in parallel” ([0033]), indicating when the pyramid of Zhu et al. is combined with the network of Roh et al. and Yao et al. and Huang et al., it is expected this will optimally deal with multi-scale context information in a relatively computationally efficient manner.

Regarding claim 18, Roh et al. and Yao et al. and Huang et al. disclose the method of claim 17. Roh et al. and Yao et al. do not disclose the plurality of paths includes: a third prong from the first stage through the second stage, the third stage, and the fifth stage for pyramidal pooling.

Zhu et al. teach a third prong from the first stage through the second stage, the third stage, and the fifth stage for pyramidal pooling (“DeepLab: Contrary to FCN which has a stride of 32 at the last convolutional layer, DeepLab produces denser feature maps by removing the downsampling operator in the last two max pooling layers and applying Atrous convolution in the subsequent convolutional layers to enlarge the receptive field of view. As a result, DeepLab has the following several benefits: (1) max pooling which consecutively reduces the feature resolution and spatial information is avoided; (2) the dense prediction map simplifies the upsampling scheme; (3) Atrous spatial pyramid pooling employed at the end of the network allows to explore multi-scale context information in parallel. A deeper network is beneficial to learn high-level features but comes at the cost of losing spatial information. Therefore, the Deeplab model with Atrous convolution is well-suited to meet the purpose of the model of the present embodiment”, [0033]).

Roh et al. and Yao et al. and Zhu et al. are in the same art of using convolutional neural networks for object detection (Roh et al., abstract; Yao et al., abstract; Zhu et al., [0003]). The combination of Zhu et al. with Roh et al. and Yao et al. and Huang et al. enables the use of a pyramid/pyramidal pooling. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the pyramidal pooling of Zhu et al. with the invention of Roh et al. and Yao et al. and Huang et al. as this was known at the time of filing, the combination would have predictable results, Zhu et al. indicate, “DeepLab has the following several benefits: (1) max pooling which consecutively reduces the feature resolution and spatial information is avoided; (2) the dense prediction map simplifies the upsampling scheme; (3) Atrous spatial pyramid pooling employed at the end of the network allows to explore multi-scale context information in parallel” ([0033]), indicating when the pyramid of Zhu et al. is combined with the network of Roh et al. and Yao et al. and Huang et al., it is expected this will optimally deal with multi-scale context information in a relatively computationally efficient manner.



Claim 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Roh et al. (US 20180165551 A1) and Yao et al. (US 20190073553 A1) and Huang et al. (US 20190004533 A1) and Li et al. (US 20170243053 A1) as applied to claim 14, further in view of Drozdova et al. (US 20180268250 A1).

Regarding claim 15, Roh et al. and Yao et al. and Huang et al. and Li et al. disclose the method of claim 14. Roh et al. and Yao et al. and Huang et al. and Li et al. do not explicitly disclose the performing the at least one convolution of an output of the second stage further comprises: performing, at the third stage, a first third stage convolution including a set of weights; performing, at the third stage, a second third stage convolution from an output of the first third stage convolution and initialized using the set of weights from the first third stage convolution and defined before the second third stage convolution is performed.

Drozdova et al., in the same art of CNNs with deconvolutional layers and concatenation and pooling ([0032], [0033]), teach performing, at the third stage, a first third stage convolution including a set of weights; performing, at the third stage, a second third stage convolution from an output of the first third stage convolution and initialized using the set of weights from the first third stage convolution and defined before the second third stage convolution is performed (When the convolutional neural network includes multiple convolution layers, the convolution controller 210 may implement these convolution layers to apply weights having different granularities. For instance, the weights that are applied at one convolution layer may be configured to detect more complex, less abstract, and/or less granular features than the weights applied at a subsequent convolution layer. To further illustrate, the weights that are applied at one convolution layer may be configured to detect edges (e.g., horizontal lines, vertical lines, diagonal lines) in an image while the weights that are applied at a subsequent convolution layer may be configured to detect shapes (e.g., triangles, rectangles, circles), which may be formed from the edges detected in the previous convolution layer, [0026])

Roh et al. and Yao et al. and Drozdova et al. are in the same art of using convolutional neural networks for object detection (Roh et al., abstract; Yao et al., abstract; Drozdova et al., [0033]). The combination of Drozdova et al. with Roh et al. and Yao et al. and Huang et al. enables the weight initialization. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the weights of Drozdova et al. with the invention of Roh et al. and Yao et al. and Huang et al. as this was known at the time of filing, the combination would have predictable results, and as “When the convolutional neural network includes multiple convolution layers, the convolution controller 210 may implement these convolution layers to apply weights having different granularities. For instance, the weights that are applied at one convolution layer may be configured to detect more complex, less abstract, and/or less granular features than the weights applied at a subsequent convolution layer. To further illustrate, the weights that are applied at one convolution layer may be configured to detect edges (e.g., horizontal lines, vertical lines, diagonal lines) in an image while the weights that are applied at a subsequent convolution layer may be configured to detect shapes (e.g., triangles, rectangles, circles), which may be formed from the edges detected in the previous convolution layer” ([0026]) indicating how these weights as described by Drozdova et al. can allow for more customized detections of the system in combination with the invention of Roh et al. and Yao et al. and Huang et al..


Allowable Subject Matter
Claim 4 (and by further dependency claim 5) are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims. Johansson et al. (US 20190375261 A1) indicate a convolutional neural network, a further layer, e.g., a max-pooling layer, which causes downscaling of the data, i.e., reduction of the output images of the convolution layer, is inserted between two convolution layers, “If, instead of a classification, a segmentation of the image is supposed to be performed in which the output of the CNN is in turn an image, the MLP part can be replaced by layers that upscale the downscaled image again, e.g., deconvolutional layer” ([0017]), however, instead of being a third deconvolution, this is simply saying the downscaled image is just returned to its original size. US 10303980 B1 indicates multiple deconvolutions, but these are in directly consecutive order: 
    PNG
    media_image3.png
    445
    707
    media_image3.png
    Greyscale
. Similarly, US 20190205758 A1 shows: 
    PNG
    media_image4.png
    463
    684
    media_image4.png
    Greyscale
 thereby showing a third deconvolution, but not with the right placement in the sequence. 
Claim 21 is allowed.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M ENTEZARI HAUSMANN whose telephone number is (571)270-5084. The examiner can normally be reached 10-7 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, VINCENT M RUDOLPH can be reached on (571)272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MICHELLE M ENTEZARI/Primary Examiner, Art Unit 2661