DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 2022-07-29 has been entered.  The status of the claims is as follows:
Claims 1-3, 5-10, 12-17, and 19-21 are pending in the application.
Claims 1, 8, and 15 are amended.
Claim 21 is new.
Response to Arguments
Applicant’s arguments with respect to rejections under 35 USC 101 have been fully considered and are persuasive.  The newly amended matter added by Applicant provides details on a method of training a machine learning model, which according to MPEP 2106.04(a)(1)(vii) is not a mental process.
Applicant’s arguments with respect to rejections under 35 USC 103 have been considered but are moot because the newly amended matter necessitates a new ground of rejection, and the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5, 8-10, 12, 15-17, 19, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao et al. (“ICNet for Real-Time Semantic Segmentation on High-Resolution Images”; hereinafter “Zhao”) in view of Takahashi et al. (“A Novel Weight-Shared Multi-Stage Network Architecture of CNNs for Scale Invariance”; hereinafter “Takahashi”).
As per Claim 1, Zhao teaches a computer-implemented method comprising: (Zhao, Page 1 Abstract, discloses a GPU processor:  “Our system yields realtime inference on a single GPU card with decent quality results evaluated on challenging Cityscapes dataset.”)
receiving, by a processor, an original input; (Zhao, Page 4, Section 4.1 “Low Resolution”, discloses:  “The input full-resolution image of scale 1 produces two lower-resolution images after downsampling with scales of 1/2 and 1/4.”  Here, Zhao discloses receiving an original input (“input full-resolution image”)).
downsampling, by the processor, the original input into a downscaled input, the downscaled input comprising a lower resolution than the original input; (Zhao, Page 4, Section 4.1 “Low Resolution”, discloses:  “The input full-resolution image of scale 1 produces two lower-resolution images after downsampling with scales of 1/2 and 1/4.”)
For the following limitations, see Zhao Page 4 Figure 4, annotated by Examiner below:

    PNG
    media_image1.png
    714
    925
    media_image1.png
    Greyscale

inputting the downscaled input comprising the lower resolution to a first convolutional neural network (CNN);  running, by the processor, the first CNN on the downscaled input (Zhao, as shown in Figure 4 above, discloses on Page 4 Section 4.1 “Low Resolution”:  “For the lowest resolution input, it goes through the top branch, which is an FCN-based PSPNet architecture. Since the input size is only 1/4 of the original one, convolution layers correspondingly downsize the feature maps by a ratio of 1/8 and yield 1/32 of the original spatial size. Then several dilated convolution layers are used to enlarge the receptive fields without downsampling the spatial size, outputting feature maps with sizes 1/32 of original ones.”)
inputting the original input comprising a higher resolution than the downscaled input to the second CNN; running, by the processor, a second CNN on the original input having the higher resolution, wherein the second CNN has fewer layers than the first CNN; (Zhao, as shown in Figure 4 above, discloses on Page 4 Section 4.1 “High Resolution”:  “For the high-resolution image, similar to the operation in the second branch, it is processed by several convolutional layers with downsampling rate 8. A 1/8 size feature map is resulted. Since the median-resolution image already restore most semantically meaningful details that are missing in the low-resolution one, we can safely limit the number of convolutional layers when processing the high-resolution input. Here we use only three convolutional layers each with kernel size 3 x 3 and stride 2 to downsize the resolution to 1/8 of the original input.”)
merging, by the processor, the output of the first CNN with the output of the second CNN (Zhao, as shown in Figure 4 above, discloses on Page 4 Section 4.1 “High Resolution”:  “Similarly with the fusion as described in branch two, we use the CFF unit to incorporate the output of previous CFF unit and current feature map in full resolution in branch three.”)
and providing a result, by the processor, following the merging of the outputs, wherein the first CNN is associated with a first scale and the second CNN is associated with a second scale (Zhao, as shown in Figure 4 above, discloses on Page 2 Para 2:  “Our contribution is to utilize processing efficiency of low-resolution images and high inference quality of high-resolution ones and propose an image cascade framework to progressively refine segment prediction.”  Here, Zhao discloses providing a result (“segment prediction”) based on the two different scales of CNN shown above).
wherein, during training, branches of the first CNN and the second CNN are merged such that the first CNN learns from the second scale of the second CNN and the second CNN learns from the first scale of the first CNN, thereby allowing the first CNN and the second CNN to learn differences between the first and second scales (Zhao, as shown in Figure 4 above, discloses on Page 6 Section 5.1:  “To train ICNet, we append softmax cross entropy loss in each branch denoted as L1, L2 and L3 with corresponding weights λ1, λ2, and λ3. The total loss is L. The overall loss function is
L = λ1L1 + λ2L2 + λ3L3  (1)
The framework is trained to minimize the above loss function. All the losses we adopted are the cross-entropy loss on the corresponding downsampled score maps.”  Here, Zhao discloses that the branches of the CNNs are merged and jointly trained.)
	However, Zhao does not explicitly teach wherein the merging is performed as a groupwise merger.
	Takahashi teaches wherein the merging is performed as a groupwise merger (Instant Specification describes “groupwise merger” as “concatenates the features of multiple networks, and if needed, subsequently applies a 1 x 1 convolution to fuse the features”.  This is taught by Takahashi Page 4 Figure 5:

    PNG
    media_image2.png
    392
    421
    media_image2.png
    Greyscale

Takahashi Page 4 Section B discloses:  “Before the global pooling layer, the integration layer is given the concatenated feature map … In addition, we evaluated a 1 x 1
convolution layer as the integration layer called 1 x 1 conv”).
	Takahashi and Zhao are analogous art because they are both in the field of endeavor of fusing multiscale CNNs for image analysis (see Takahashi Page 3 Figure 4, for which the caption states:  “A simple weight-shared multi-stage network (WSMS-Net) consisting of three stages. The input image is resized to half and quarter for the second stage and the third stage.”)
	It would have been obvious before the effective filing date of the claimed invention to combine the concatenation-and-1x1-convolution type of fusion of Takahashi with the multiscale CNN with fusion of Zhao.  One of ordinary skill in the art would be motivated to do so in order to maximize accuracy by learning integrated features of the multiple scales together (Takahashi, Page 4 Section C:  “The integration layer is an extra layer placed just after the concatenation layer that concatenates all the feature maps from all the stages, and integrates all the feature maps before the classification”), thus achieving better results (Takahashi, Page 6 “Classification Results”:  “In conclusion, regardless of the difference in the structure between ResNet and DenseNet, the experimental results demonstrate that the increase in the depth has a limitation in improvement of classification accuracy and the WSMS-Net with 1x1 conv integration layer achieved a better accuracy than the original networks.”)

	As per Claim 2, the combination of Zhao and Takahashi teaches the computer-implemented method of claim 1. Zhao teaches wherein the original input comprises image data representing an image (Zhao, Page 4, Section 4.1 “Low Resolution”, discloses:  “The input full-resolution image of scale 1 produces two lower-resolution images after downsampling with scales of 1/2 and 1/4.”)

As per Claim 3, the combination of Zhao and Takahashi teaches the computer-implemented method of claim 1. Zhao teaches comprising providing the output of the first CNN as an input to the second CNN (Zhao, as shown in Figure 4 above, discloses on Page 4 Section 4.1 “Median Resolution”,  “To fusion the 1/16 size feature map with the 1/32 size feature map in the top branch, we propose a cascade feature fusion (CFF) unit that will be discussed later in this paper,” and in “High Resolution”, “Similarly with the fusion as described in branch two, we use the CFF unit to incorporate the output of previous CFF unit and current feature map in full resolution in branch three.”  Thus, the output from the first CNN from the lower resolution image is provided to the input of the second CNN from the higher resolution image.)

As per Claim 5, the combination of Zhao and Takahashi teaches the computer-implemented method of claim 1. Zhao teaches wherein the result is an identification of an object (Zhao, Page 2 Top, discloses:  “In this paper, we focus on building a practically fast semantic segmentation system with decent prediction accuracy”.  Here, Zhao discloses “image segmentation”, which identifies distinct objects in an image, as shown in the result of Zhao Figure 4, where one can see distinct objects, shown below)
 
    PNG
    media_image3.png
    200
    400
    media_image3.png
    Greyscale


As per Claim 8, this is a system claim corresponding to method Claim 1.  The difference is that it recites a memory and a processor.  Zhao, Page 5 Section 4.2, discloses memory:  “Therefore even with more than 50 layers, the inference operation and memory consumption are not large as 18ms and 0.6GB.”  Zhao, Page 1 Abstract, discloses a GPU processor:  “Our system yields realtime inference on a single GPU card with decent quality results evaluated on challenging Cityscapes dataset.”  Claim 8 is rejected for the same reasons as Claim 1.

As per Claims 8-10 and 12, these are system claims corresponding to method Claims 1-3 and 5, respectively.  The difference is that it recites a memory.  Zhao, Page 5 Section 4.2, discloses memory:  “Therefore even with more than 50 layers, the inference operation and memory consumption are not large as 18ms and 0.6GB.”  Claims 8-10 and 12 are rejected for the same reasons as Claims 1-3 and 5, respectively.

As per Claims 15-17 and 19, these are computer program product claims corresponding to method Claims 1-3 and 5, respectively.  The difference is that it recites a computer readable storage medium.  Zhao, Page 5 Section 4.2, discloses memory:  “Therefore even with more than 50 layers, the inference operation and memory consumption are not large as 18ms and 0.6GB.”  Claims 15-17 and 19 are rejected for the same reasons as Claims 1-3 and 5, respectively.

As per Claim 21, the combination of Zhao and Takahashi teaches the computer-implemented method of claim 1. Zhao teaches wherein the downsampling of the original input into the downscaled input comprises downsampling the original input by a factor of 2 (Zhao, Page 4, Section 4.1 “Low Resolution”, discloses:  “The input full-resolution image of scale 1 produces two lower-resolution images after downsampling with scales of 1/2 and 1/4.”  Here, Zhao’s downsampling of the original input to the downscaled input comprises downsampling by a factor of 2, twice.)

Claims 6 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao in view of Takahashi, further in view of Lee et al. (“Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging”; hereinafter “Lee”).
As per Claim 6, the combination of Zhao and Takahashi teaches the computer-implemented method of claim 1.  However, the combination of Zhao and Takahashi does not explicitly teach wherein the input comprises audio data presenting an audio input.
Lee teaches wherein the input comprises audio data presenting an audio input.  (Lee, Page 1209 Section A, discloses:  “In the first step, we perform supervised feature learning with a set of CNNs. We choose the segment sizes such that the hidden layers capture multi-level audio features within one to several musical beats for different beats per minute  (BPM).” Lee, Page 1209 Figure 1, discloses: 

    PNG
    media_image4.png
    750
    500
    media_image4.png
    Greyscale


	Lee and the combination of Zhao and Takahashi are analogous art because they are both in the field of endeavor of merging data from multiscale CNNs to analyze data.
	It would have been obvious before the effective filing date of the claimed invention to combine the fusion of multiscale CNNs of Zhao and Takahashi, with the audio data of Lee.  One of ordinary skill in the art would be motivated to do so because analyzing audio data at multiple scales can increase the accuracy of classifying the audio data (Lee, Page 1209 End of Section I:  “Our experiments show how different combinations of multi-layer and multi-time-scale features improve the accuracy and also how the architecture outperforms the previous state-of-the-art architectures.”)

As per Claim 13, this is a system claim corresponding to method Claim 6.  The difference is that it recites a memory.  Zhao, Page 5 Section 4.2, discloses memory:  “Therefore even with more than 50 layers, the inference operation and memory consumption are not large as 18ms and 0.6GB.”  Claim 13 is rejected for the same reasons as Claim 6.
	
Claims 7, 14, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhao in view of Takahashi, further in view of Li et al. (“Multiscale convolutional neural network for the detection of built-up areas in high-resolution SAR image”; hereinafter “Li”).
As per Claim 7, the combination of Zhao and Takahashi teaches the computer-implemented method of claim 1.  Zhao suggests downsampling feature maps on Page 3 Right Column “Downsampling Feature” (“Besides directly downsampling the input image, another straightforward choice is to scale down the feature map by a large ratio in the inference process”).  However, Zhao performs the same “downsampling rate 8” to all CNNs as shown on Page 4 Section 4.1 (“downsize the feature maps by a ratio of 1/8”, “downsampling rate 8”, “downsampling rate 8”).  Therefore, the combination of Zhao and Takahashi does not explicitly teach wherein the second CNN has a smaller feature map than the first CNN.
Li explicitly teaches wherein the second CNN has a smaller feature map than the first CNN.  (Li, Page 911 Figure 2 and Page 912 Table 1 shows:

    PNG
    media_image5.png
    488
    410
    media_image5.png
    Greyscale


    PNG
    media_image6.png
    514
    664
    media_image6.png
    Greyscale

Thus, Li teaches that the second CNN, which is the CNN of the original image with less layers, also has a smaller feature map than the first CNN, which is the one on the downscaled image that also has less layers.
	Li and the combination of Zhao and Takahashi are analogous art because they are both in the field of endeavor of fusing multiscale CNNs to analyze images.
	It would have been obvious before the effective filing date of the claimed invention to combine the fusion of multiscale CNNs of Zhao and Takahashi, with the smaller feature map on the full-scale image of Li.  One of ordinary skill in the art would be motivated to do so in order to gain efficiency by reducing the complexity of the second CNN which would take the most time because it works on the largest image, and thus get faster results when speed is prioritized over maximum accuracy (Zhao, Page 3 “Downsampling Feature”:  “A smaller feature map can yield faster inference at the cost of sacrificing prediction accuracy”).

As per Claim 14, this is a system claim corresponding to method Claim 7.  The difference is that it recites a memory.  Zhao, Page 5 Section 4.2, discloses memory:  “Therefore even with more than 50 layers, the inference operation and memory consumption are not large as 18ms and 0.6GB.”  Claim 14 is rejected for the same reasons as Claim 7.
	
As per Claim 20, this is a computer program product claim corresponding to method Claim 7, respectively.  The difference is that it recites a computer readable storage medium.  Zhao, Page 5 Section 4.2, discloses memory:  “Therefore even with more than 50 layers, the inference operation and memory consumption are not large as 18ms and 0.6GB.”  Claim 20 is rejected for the same reason as Claim 7.

	Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Guo et al. (US 2020/0342955 A1) discloses fusing results of 2 CNNs at the nucleotide level, and the gene level, which is similar to two different resolutions:

    PNG
    media_image7.png
    530
    784
    media_image7.png
    Greyscale

Gueguen (US 2019/0205700 A1) discloses passing a full-resolution image segment through a fine CNN, and a downscaled full image through a coarse CNN:

    PNG
    media_image8.png
    438
    750
    media_image8.png
    Greyscale


Hu et al. (“A Multiscale Fusion Convolutional Neural Network for Plant Leaf Recognition”) discloses fusing CNN results from downscaled versions of an image:

    PNG
    media_image9.png
    376
    1402
    media_image9.png
    Greyscale


Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710. The examiner can normally be reached M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/L.A.S./Examiner, Art Unit 2126   
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126