Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on September 3, 2019 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
The information disclosure statement (IDS) submitted on May 1, 2020 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
The information disclosure statement (IDS) submitted on December 17, 2020 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Drawings
The drawings are objected to because the specification references the symbol AFL in figure 2, used to denote an activation function layer, but the symbol AFL does not appear in figure 2; and in figure 3A, the width of convolution filter F2a is not labeled, while all other convolution filters in figure 3A have their width labeled.  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. 

Specification
The disclosure is objected to because of the following informalities: in paragraph [0011], "...be more precisely..." should read "...be more precise..."; in paragraph [0039], "...remain a preciseness of..." is incorrect and should likely read "...retain the preciseness of..."; in paragraph [0046], "...objection recognition," should read "...object recognition,".  
Appropriate correction is required.
Claim Objections
Claims 4 and 15 are objected to because of the following informalities:  in claim 4, "...neural network further comprise a plurality..." should read "...neural network further comprises a plurality..."; and in claim 15, "...further comprise..." should read "...further comprises...".  Appropriate correction is required.

Claim Rejections - 35 USC § 103
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the 

Claims 1-3, 8-9, 12-14, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Shi (U.S. Patent No 20180108165-A1) in view of Howard (U.S. Patent No 11157814-B2) (Please note that limitations enclosed by brackets [as such] are not part of what a reference does or does not teach).
Regarding claim 1, Howard teaches:
a method for adjusting a convolutional neural network, a first model of the convolutional neural network comprising a plurality of convolution layers in a sequential order (Description; (col. 4:21): "In some implementations, the plurality of depthwise separable convolution layers can be stacked one after another,"; Howard teaches a model (a first model) with layers stacked one after another (i.e. in a sequential order)).
the adjustment method comprising: determining a plurality of receptive field widths of the convolution layers in the first model of the convolutional neural network (Fig 2, 206; (col. 10:53): "In some implementations, the one or more convolutional layers include a plurality of convolutional layers. In some of such implementations, determining the respective reduced number of filters for each of the plurality of convolutional layers..."; Howard teaches determining the number of filters in each convolutional layer (i.e. the receptive field width)).
reducing a plurality of channel widths of the convolution layers in the first model into a plurality of reduced channel widths [according to] the receptive field widths of the convolution layers [and an input image width] (Fig 2, 208; (col. 10:60): "At 208, the computing system generates a reduced convolutional neural network structure that has the existing convolutional neural network structure except that each of the one or more convolutional layers in the reduced convolutional neural network has the respective Howard teaches reducing the number of filters in each convolutional layer (i.e. the receptive field widths)).
forming a structure of a second model of the convolutional neural network according to the reduced channel widths (Fig 2, 208; (col. 10:60): "At 208, the computing system generates a reduced convolutional neural network structure that has the existing convolutional neural network structure except that each of the one or more convolutional layers in the reduced convolutional neural network has the respective reduced number of filters determined for such convolutional layer,"; Howard teaches generating a reduced network structure using a reduced number of filters (i.e. the receptive field widths), using the existing structure (i.e. the first model) as a basis for the generated structure (i.e. the second model)).
training the second model of the convolutional neural network (Fig 2, 200; (col. 11:6): "In some implementations, method 200 can further include training a convolutional neural network that has the reduced convolutional neural network structure on a set of training data”; Howard teaches training a network with the reduced structure (i.e. the second model)).
Howard does not teach [reducing]…according to [the receptive field widths of the convolution layers] and an input image width.
Shi teaches [reducing]…according to [the receptive field widths of the convolution layers] and an input image width ([0129]: "If the position variance is greater than the set threshold value, the business object sample image is filtered out...the set threshold value may be set to be 1/20-⅕ of an image length or an image width,"; Shi teaches filtering out a sample image if it does not meet a threshold value based on the image width (i.e. filtering out according to an input image width)).

Regarding claim 2, Howard and Shi teach the method of claim 1. Howard further teaches:
wherein reducing the channel widths of the convolution layers in the first model comprises classifying each of the convolution layers in the first model into one of a base layer group and an enhancement layer group by comparing the receptive field widths of the convolution layers in the first model… ((col. 5:51): “…a computing system can generate the width multiplier value based at least in part on one or more desired performance parameters and one or more existing performance parameters associated with an existing convolutional neural network structure,”; Howard teaches using a width multiplier value to reduce the channel widths of the layers. This value is generated using existing parameters from the first model (such as number of filters, i.e. the receptive field widths of the convolution layers in the first model), and the desired parameters of the second model (such as whether a layer needs to be reduced, determined through comparison and used to designate a layer to either the base or enhancement group, i.e. classifying each of the convolution layers in the first model into one of a base layer group and an enhancement layer group)
determining a plurality of redundancy ratios of the convolution layers in the first model according to a partial calculation amount of the enhancement layer group relative to a total calculation amount of the base layer group and the enhancement layer group ((col. 5:41): "For example, different, respective width multipliers can be used,"; (col. 5:53): "...a computing system can generate the width multiplier value based at least in part on one or more desired performance parameters and one or more existing performance parameters associated with an existing convolutional neural network structure,"; Howard teaches generating a width multiplier (i.e. redundancy ratio) based on desired performance parameters (i.e. a partial calculation amount) and existing performance parameters (i.e. a total calculation amount)).
reducing the channel widths of the convolution layers in the first model into the reduced channel widths according to the redundancy ratios of the convolution layers ((col. 5:16): "Such can be performed by reducing the number of existing filters (or channels) included in such layer according to the width multiplier. For example, in some implementations, the width multiplier can be a value greater than zero and less than one. Further, in some implementations, the number of existing filters in a given convolutional layer can be multiplied by the width multiplier to determine a reduced number of filters for such layer,"; Howard teaches a value to be multiplied with the number of filters in a convolutional layer to reduce the number of filters in the layer, and this value can be between zero and one (i.e. the redundancy ratio)).
Howard does not teach comparing…with a threshold positively related to the input image width.
Shi teaches comparing…with a threshold positively related to the input image width ([0129]: "If the position variance is greater than the set threshold value, the business object sample image is Shi teaches comparing a measured variance against a threshold directly related to the width of an image (i.e. comparing…with a threshold positively related to the input image width)).
Howard and Shi are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard and Shi before him or her to modify the width multiplier determined by existing and desired performance parameters as in Howard to include an image width threshold as in Shi, obtaining the advantage of keeping only information relevant to the current task (Shi; [0128]: “The business object sample image may include some sample images which do not meet a training standard of a convolutional network model. In the present embodiment, this part of the sample images which do not meet the training standard of the convolutional network model may be filtered out by preprocessing the business object sample image,”).
Regarding claim 3, Howard and Shi teach the method of claim 2. Howard further teaches:
in response to a first one of the convolution layers has a receptive field width lower than the threshold [positively related to the input image width], classifying the first one of the convolution layers in the first model into one of the base layer group ((col. 5:51): “…a computing system can generate the width multiplier value based at least in part on one or more desired performance parameters and one or more existing performance parameters associated with an existing convolutional neural network structure,”; Howard teaches generating a width multiplier using existing and desired performance parameters. In this case, if the receptive field width parameter is less than the threshold parameter, the layer will be classified to the base layer group by assigning a width multiplier value of 1 such that it will not be reduced (i.e. in response to a first one of the convolution layers has a receptive field width lower than the threshold [positively related to the input image width], classifying the first one of the convolution layers in the first model into one of the base layer group).
in response to a second one of the convolution layers has a receptive field width exceeding the threshold [positively related to the input image width], classifying the second one of the convolution layers in the first model into one of the enhancement layer group ((col. 5:51): “…a computing system can generate the width multiplier value based at least in part on one or more desired performance parameters and one or more existing performance parameters associated with an existing convolutional neural network structure,”; Howard teaches generating a width multiplier using existing and desired performance parameters. In this case, if the receptive field width parameter is greater than the threshold parameter, the layer will be classified to the enhancement layer group by assigning a width multiplier value between 0 and 1 such that it will be reduced (i.e. in response to a second one of the convolution layers has a receptive field width exceeding the threshold [positively related to the input image width], classifying the second one of the convolution layers in the first model into one of the enhancement layer group).
Howard does not teach the threshold positively related to the input image width.
Shi teaches the threshold positively related to the input image width ([0129]: "If the position variance is greater than the set threshold value, the business object sample image is filtered out...the set threshold value may be set to be 1/20-⅕ of an image length or an image width,"; Shi teaches comparing a measured variance against a threshold directly related to the width of an image (i.e. the threshold positively related to the input image width)).
Howard and Shi are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard and Shi before him or her to modify the width 
Regarding claim 8, Howard and Shi teach the method of claim 1. Howard further teaches that the channel widths of the convolution layers correspond to amounts of convolution filters in each of the convolution layers (Description; (col. 4:41): "…a respective number of filters included in the depthwise convolution layer of each of the plurality of depthwise separable convolution layers…"; Howard teaches that each layer has a certain amount of filters (i.e. convolution filters in each of the convolution layers), and refers to this amount directly as a parameter of the layers (i.e. the channel widths of the convolution layers correspond to amounts of convolution filters)).
Regarding claim 9, Howard teaches the method according to claim 1. Howard also teaches that the first model comprises M convolution layers in the sequential order, M is a positive integer, in response to a channel width in a Mth convolution layer of the first model and another channel width in a (M-1)th convolution layer of the first model are both reduced, the channel width in the Mth convolution layer is reduced with a higher proportion compared to the channel width in the (M-1)th convolution layer ((col. 4:21): "In some implementations, the plurality of depthwise separable convolution layers can be stacked one after another,"; (col. 5:39): “In some implementations, a width multiplier can be applied to each of a plurality of convolutional layers included in a convolutional neural network. For example, different, respective width multipliers can be used,”; Howard teaches a model comprising a plurality of sequential convolution layers (i.e. the first model comprises M convolution layers in the sequential order, M is a positive integer). Howard also teaches applying different width multipliers to a plurality of convolution layers. This can include using one width multiplier on a certain layer, and then a different, more reductive width multiplier on the following layer (i.e. in response to a channel width in a Mth convolution layer of the first model and another channel width in a (M-1)th convolution layer of the first model are both reduced, the channel width in the Mth convolution layer is reduced with a higher proportion compared to the channel width in the (M-1)th convolution layer)).
Regarding claim 12, Howard teaches:
An electronic apparatus, suitable for adjusting a convolution neural network, the electronic apparatus comprising: a data storage, configured to store a first model of the convolution neural network, the first model of the convolution neural network comprising a plurality of convolution layers ((col. 7:4): “The user computing device 102 includes one or more processors 112 and a memory 114,”; (col. 4:21): "In some implementations, the plurality of depthwise separable convolution layers can be stacked one after another,"; Howard teaches using a data storage in the implementation of the network. Howard also teaches a model (a first model) with layers stacked one after another (i.e. in a sequential order).).
a processor, coupled with the data storage, the processor being configured to: determine a plurality of receptive field widths of the convolution layers in the first model of the convolutional neural network ((col. 7:4): “The user computing device 102 includes one or more processors 112 and a memory 114,”; Fig 2, 206; (col. 10:53): "In some implementations, the one or more convolutional layers include a plurality of convolutional layers. In some of such implementations, determining the respective reduced number of filters for each of the plurality of convolutional layers..."; Howard teaches using a processor in the implementation of the network. Howard also teaches determining the number of filters in each convolutional layer (i.e. the receptive field width)
reducing a plurality of channel widths of the convolution layers in the first model into a plurality of reduced channel widths [according to] the receptive field widths of the convolution layers [and an input image width] (Fig 2, 208; (col. 10:60): "At 208, the computing system generates a reduced convolutional neural network structure that has the existing convolutional neural network structure except that each of the one or more convolutional layers in the reduced convolutional neural network has the respective reduced number of filters determined for such convolutional layer,"; Howard teaches reducing the number of filters in each convolutional layer (i.e. the receptive field widths)).
forming a structure of a second model of the convolutional neural network according to the reduced channel widths (Fig 2, 208; (col. 10:60): "At 208, the computing system generates a reduced convolutional neural network structure that has the existing convolutional neural network structure except that each of the one or more convolutional layers in the reduced convolutional neural network has the respective reduced number of filters determined for such convolutional layer,"; Howard teaches generating a reduced network structure using a reduced number of filters (i.e. the receptive field widths), using the existing structure (i.e. the first model) as a basis for the generated structure (i.e. the second model)).
training the second model of the convolutional neural network (Fig 2, 200; (col. 11:6): "In some implementations, method 200 can further include training a convolutional neural network that has the reduced convolutional neural network structure on a set of training data”; Howard teaches training a network with the reduced structure (i.e. the second model)).
Howard does not teach [reducing]…according to [the receptive field widths of the convolution layers] and an input image width.
[reducing]…according to [the receptive field widths of the convolution layers] and an input image width ([0129]: "If the position variance is greater than the set threshold value, the business object sample image is filtered out...the set threshold value may be set to be 1/20-⅕ of an image length or an image width,"; Shi teaches filtering out a sample image if it does not meet a threshold value based on the image width (i.e. filtering out according to an input image width)).
Howard and Shi are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard and Shi before him or her to modify the channel width reduction and receptive field widths as in Howard to include an image width threshold as in Shi, obtaining the advantage of keeping only information relevant to the current task (Shi; [0128]: “The business object sample image may include some sample images which do not meet a training standard of a convolutional network model. In the present embodiment, this part of the sample images which do not meet the training standard of the convolutional network model may be filtered out by preprocessing the business object sample image,”)
Regarding claim 13, Howard and Shi teach the apparatus of claim 12. Howard further teaches:
wherein reducing the channel widths of the convolution layers in the first model comprises classifying each of the convolution layers in the first model into one of a base layer group and an enhancement layer group by comparing the receptive field widths of the convolution layers in the first model… ((col. 5:51): “…a computing system can generate the width multiplier value based at least in part on one or more desired performance parameters and one or more existing performance parameters associated with an existing convolutional neural network structure,”; Howard teaches using a width multiplier value to reduce the channel widths of the layers. This value is generated using existing parameters from the first model (such as number of filters, i.e. the receptive field widths of the convolution layers in the first model), and the desired parameters of the second model (such as whether a layer needs to be reduced, determined through comparison and used to designate a layer to either the base or enhancement group, i.e. classifying each of the convolution layers in the first model into one of a base layer group and an enhancement layer group).
determining a plurality of redundancy ratios of the convolution layers in the first model according to a partial calculation amount of the enhancement layer group relative to a total calculation amount of the base layer group and the enhancement layer group ((col. 5:41): "For example, different, respective width multipliers can be used,"; (col. 5:53): "...a computing system can generate the width multiplier value based at least in part on one or more desired performance parameters and one or more existing performance parameters associated with an existing convolutional neural network structure,"; Howard teaches generating a width multiplier (i.e. redundancy ratio) based on desired performance parameters (i.e. a partial calculation amount) and existing performance parameters (i.e. a total calculation amount)).
reducing the channel widths of the convolution layers in the first model into the reduced channel widths according to the redundancy ratios of the convolution layers ((col. 5:16): "Such can be performed by reducing the number of existing filters (or channels) included in such layer according to the width multiplier. For example, in some implementations, the width multiplier can be a value greater than zero and less than one. Further, in some implementations, the number of existing filters in a given convolutional layer can be multiplied by the width multiplier to determine a reduced number of filters for such layer,"; Howard teaches a value to be multiplied with the number of filters in a convolutional layer to reduce the number of filters in the layer, and this value can be between zero and one (i.e. the redundancy ratio)).
Howard does not teach comparing…with a threshold positively related to the input image width.
Shi teaches comparing…with a threshold positively related to the input image width ([0129]: "If the position variance is greater than the set threshold value, the business object sample image is filtered out...the set threshold value may be set to be 1/20-⅕ of an image length or an image width,"; Shi teaches comparing a measured variance against a threshold directly related to the width of an image (i.e. comparing…with a threshold positively related to the input image width)).
Howard and Shi are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard and Shi before him or her to modify the width multiplier determined by existing and desired performance parameters as in Howard to include an image width threshold as in Shi, obtaining the advantage of keeping only information relevant to the current task (Shi; [0128]: “The business object sample image may include some sample images which do not meet a training standard of a convolutional network model. In the present embodiment, this part of the sample images which do not meet the training standard of the convolutional network model may be filtered out by preprocessing the business object sample image,”).
Regarding claim 14, Howard and Shi teach the apparatus of claim 13. Howard further teaches:
in response to a first one of the convolution layers has a receptive field width lower than the threshold [positively related to the input image width], classifying the first one of the convolution layers in the first model into one of the base layer group ((col. 5:51): “…a computing system can generate the width multiplier value based at least in part on one or more desired performance parameters and one or more existing performance parameters Howard teaches generating a width multiplier using existing and desired performance parameters. In this case, if the receptive field width parameter is less than the threshold parameter, the layer will be classified to the base layer group by assigning a width multiplier value of 1 such that it will not be reduced (i.e. in response to a first one of the convolution layers has a receptive field width lower than the threshold [positively related to the input image width], classifying the first one of the convolution layers in the first model into one of the base layer group).
in response to a second one of the convolution layers has a receptive field width exceeding the threshold [positively related to the input image width], classifying the second one of the convolution layers in the first model into one of the enhancement layer group ((col. 5:51): “…a computing system can generate the width multiplier value based at least in part on one or more desired performance parameters and one or more existing performance parameters associated with an existing convolutional neural network structure,”; Howard teaches generating a width multiplier using existing and desired performance parameters. In this case, if the receptive field width parameter is greater than the threshold parameter, the layer will be classified to the enhancement layer group by assigning a width multiplier value between 0 and 1 such that it will be reduced (i.e. in response to a second one of the convolution layers has a receptive field width exceeding the threshold [positively related to the input image width], classifying the second one of the convolution layers in the first model into one of the enhancement layer group).
Howard does not teach the threshold positively related to the input image width.
Shi teaches the threshold positively related to the input image width ([0129]: "If the position variance is greater than the set threshold value, the business object sample image is filtered out...the set Shi teaches comparing a measured variance against a threshold directly related to the width of an image (i.e. the threshold positively related to the input image width)).
Howard and Shi are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard and Shi before him or her to modify the width multiplier determined by existing and desired performance parameters as in Howard to include an image width threshold as in Shi, obtaining the advantage of keeping only information relevant to the current task (Shi; [0128]; “The business object sample image may include some sample images which do not meet a training standard of a convolutional network model. In the present embodiment, this part of the sample images which do not meet the training standard of the convolutional network model may be filtered out by preprocessing the business object sample image,”).
Regarding claim 19, Howard and Shi teach the apparatus of claim 12. Howard further teaches that the channel widths of the convolution layers correspond to amounts of convolution filters in each of the convolution layers (Description; (col. 4:41): "…a respective number of filters included in the depthwise convolution layer of each of the plurality of depthwise separable convolution layers…"; Howard teaches that each layer has a certain amount of filters (i.e. convolution filters in each of the convolution layers), and refers to this amount directly as a parameter of the layers (i.e. the channel widths of the convolution layers correspond to amounts of convolution filters)).
Regarding claim 20, Howard and Shi teach the apparatus of claim 12. Howard also teaches that the first model comprises M convolution layers in the sequential order, M is a positive integer, in response to a channel width in a Mth convolution layer of the first model and another channel width in a (M-1)th convolution layer of the first model are both reduced, the channel width in the Mth convolution layer is reduced with a higher proportion compared to the channel width in the (M-1)th convolution layer ((col. 4:21): "In some implementations, the plurality of depthwise separable convolution layers can be stacked one after another,"; (col. 5:39): “In some implementations, a width multiplier can be applied to each of a plurality of convolutional layers included in a convolutional neural network. For example, different, respective width multipliers can be used,”; Howard teaches a model comprising a plurality of sequential convolution layers (i.e. the first model comprises M convolution layers in the sequential order, M is a positive integer). Howard also teaches applying different width multipliers to a plurality of convolution layers. This can include using one width multiplier on a certain layer, and then a different, more reductive width multiplier on the following layer (i.e. in response to a channel width in a Mth convolution layer of the first model and another channel width in a (M-1)th convolution layer of the first model are both reduced, the channel width in the Mth convolution layer is reduced with a higher proportion compared to the channel width in the (M-1)th convolution layer)).
Claims 4 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Kruglov (U.S. Patent No 20200394520-A1), Liu (Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, Marianna Pensky; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 806-814), and Osogami (U.S. Patent No 20180060729-A1) in view of Howard and Shi.
Regarding claim 4, Howard and Shi teach the method according to claim 2. Howard also teaches:
the convolutional neural network further comprise a plurality of activation layers, each of the activation layers are arranged after one of the convolution layers, a convolution output tensor generated by each of the convolution layers is rectified by one of the activation layers into non-zero outputs and zero outputs ((col. 13:51): “In the example defined in Table 1, all layers are followed by a batchnorm (see, e.g., Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv: 1502.03167, 2015) and ReLU nonlinearity with the exception of the final fully connected layer which has no nonlinearity and feeds into a Howard teaches batch normalization and ReLU functions (an activation function) following each of the convolutional layers (i.e. the convolutional neural network further comprise a plurality of activation layers, each of the activation layers are arranged after one of the convolution layers). A ReLU function will either output a zero or non-zero output based on the input it receives, in this case the output of a convolutional layer (convolutional output tensor), thereby rectifying the input value (i.e. a convolution output tensor generated by each of the convolution layers is rectified by one of the activation layers into non-zero outputs and zero outputs).
Howard does not teach:
calculating an effective probability respectively for each of the convolution layers according to a ratio of the non-zero outputs
and calculating an effective flop count respectively for each of the convolution layers, wherein the effective flop count is calculated by a product between an original flop count and the effective probability
Kruglov teaches:
calculating an effective probability respectively for each of the convolution layers, but does not teach calculating…according to a ratio of the non-zero outputs (Description, [0019]: "The probability value for a given channel, and the corresponding masking of that channel, may be updated for a given iteration based on a network loss and/or an amount of a processing resource (e.g., a number of FLOPS) detected for the preceding iteration,"; Kruglov teaches updating (i.e. calculating) the probability value of a given channel, and if each channel in a layer has a probability, then that layer also has a probability (i.e. calculating an effective probability respectively for each of the convolution layers)
and calculating an effective flop count respectively for each of the convolution layers (Description, [0019]: "The probability value for a given channel, and the corresponding masking of that channel, may be updated for a given iteration based on a network loss and/or an amount of a processing resource (e.g., a number of FLOPS) detected for the preceding iteration,"; Description, [0027]: "In an embodiment, an initialization state of device 102 may include initial values of parameters ρl,c i which are to be used to determine corresponding retention probabilities pl,c i (e.g., where any particular ρl,c i or ρl,c i is specific to a particular mask layer, specific to a particular channel of that mask layer, and specific to a particular iteration),"; Kruglov teaches updating a probability value based on a processing resource, such as flop count. Kruglov also teaches that the probability value and processing resource are specific to a layer (i.e. calculating an effective flop count respectively for each of the convolution layers)).
Howard, Shi, and Kruglov are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard, Shi and Kruglov before him or her to modify the convolution layers including activation layers as in Howard and Shi to include calculating flop count as in Kruglov, obtaining the advantage of reducing the resource use of the network (Kruglov; [0004]: “Pruning can reduce the amount of memory required to store neural network parameters and can reduce processing hardware of the network which would otherwise be needed,”).
Osogami teaches the effective flop count is calculated by a product between an original flop count and effective probability ([0105] "For example, the learning processing section 150 may multiply the total likelihood p.sub.j respectively by each update amount,"; Osogami teaches multiplying a total likelihood (i.e. effective probability) by one or more update amounts, such as flop count (i.e. the effective flop count is calculated by a product between an original flop count and effective probability)).
Howard, Shi, and Osogami are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard, Shi and Osogami before him or her to modify the convolution layers including activation layers as in Howard and Shi to include calculating effective probability and flop count as in Osogami, obtaining the advantage of increasing the efficiency of the network (Osogami; [0105] “In this way, the learning processing section 150 can efficiently perform the update of the weight parameters,”).
Liu teaches calculating…according to a ratio of the non-zero outputs (section 3.2; “…where γ is the proportion of non-zeros of the sparse matrix,”; Liu teaches using a proportion of non-zero outputs in a calculation (i.e. according to a ratio of the non-zero outputs)).
Howard, Shi, and Liu are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard, Shi, and Liu before him or her to modify the activation layers as in Howard and Shi to include using a ratio of non-zero outputs as in Liu, obtaining the advantage of increasing the speed of the network (Liu; section 3.1; “Our objective is to replace computationally expensive convolutional operation O = K * I in formula (1) by its fast sparsified version which is based on multiplication of sparse matrices,”).
Regarding claim 15, Howard teaches the apparatus according to claim 13. Howard also teaches:
the convolutional neural network further comprise a plurality of activation layers, each of the activation layers are arranged after one of the convolution layers, a convolution output tensor generated by each of the convolution layers is rectified by one of the activation layers into non-zero outputs and zero outputs ((col. 13:51): “In the example Howard teaches batch normalization and ReLU functions (an activation function) following each of the convolutional layers (i.e. the convolutional neural network further comprise a plurality of activation layers, each of the activation layers are arranged after one of the convolution layers). A ReLU function will either output a zero or non-zero output based on the input it receives, in this case the output of a convolutional layer (convolutional output tensor), thereby rectifying the input value (i.e. a convolution output tensor generated by each of the convolution layers is rectified by one of the activation layers into non-zero outputs and zero outputs).
Howard does not teach:
calculating an effective probability respectively for each of the convolution layers according to a ratio of the non-zero outputs
and calculating an effective flop count respectively for each of the convolution layers, wherein the effective flop count is calculated by a product between an original flop count and the effective probability
Kruglov teaches:
calculating an effective probability respectively for each of the convolution layers, but does not teach calculating…according to a ratio of the non-zero outputs (Description, [0019]: "The probability value for a given channel, and the corresponding masking of that channel, may be updated for a given iteration based on a network loss and/or an amount of a processing resource (e.g., a number of FLOPS) detected for the preceding iteration,"; Kruglov teaches updating (i.e. calculating) the probability value of a given channel, and if each channel in a layer has a probability, then that layer also has a probability (i.e. calculating an effective probability respectively for each of the convolution layers)).
and calculating an effective flop count respectively for each of the convolution layers (Description, [0019]: "The probability value for a given channel, and the corresponding masking of that channel, may be updated for a given iteration based on a network loss and/or an amount of a processing resource (e.g., a number of FLOPS) detected for the preceding iteration,"; Description, [0027]: "In an embodiment, an initialization state of device 102 may include initial values of parameters ρl,c i which are to be used to determine corresponding retention probabilities pl,c i (e.g., where any particular ρl,c i or ρl,c i is specific to a particular mask layer, specific to a particular channel of that mask layer, and specific to a particular iteration),"; Kruglov teaches updating a probability value based on a processing resource, such as flop count. Kruglov also teaches that the probability value and processing resource are specific to a layer (i.e. calculating an effective flop count respectively for each of the convolution layers)).
Howard, Shi, and Kruglov are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard, Shi and Kruglov before him or her to modify the convolution layers including activation layers as in Howard and Shi to include calculating flop count as in Kruglov, obtaining the advantage of reducing the resource use of the network (Kruglov; [0004]: “Pruning can reduce the amount of memory required to store neural network parameters and can reduce processing hardware of the network which would otherwise be needed,”).
Osogami teaches the effective flop count is calculated by a product between an original flop count and effective probability ([0105] "For example, the learning processing section 150 may multiply Osogami teaches multiplying a total likelihood (i.e. effective probability) by one or more update amounts, such as flop count (i.e. the effective flop count is calculated by a product between an original flop count and effective probability)).
Howard, Shi, and Osogami are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard, Shi and Osogami before him or her to modify the convolution layers including activation layers as in Howard and Shi to include calculating effective probability and flop count as in Osogami, obtaining the advantage of increasing the efficiency of the network (Osogami; [0105] “In this way, the learning processing section 150 can efficiently perform the update of the weight parameters,”).
Liu teaches calculating…according to a ratio of the non-zero outputs (section 3.2; “…where γ is the proportion of non-zeros of the sparse matrix,”; Liu teaches using a proportion of non-zero outputs in a calculation (i.e. according to a ratio of the non-zero outputs)).
Howard, Shi, and Liu are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard, Shi, and Liu before him or her to modify the activation layers as in Howard and Shi to include using a ratio of non-zero outputs as in Liu, obtaining the advantage of increasing the speed of the network (Liu; section 3.1; “Our objective is to replace computationally expensive convolutional operation O = K * I in formula (1) by its fast sparsified version which is based on multiplication of sparse matrices,”).
Claims 5, 7, 16, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Talathi (U.S. Patent No 20170061326-A1) in view of Howard and Shi.
Regarding claim 5, Howard and Shi teach the method of claim 2. Howard further teaches:
the first model of the convolutional neural network comprises a plurality of…convolution layers…arranged in the sequential order (Description, (col. 4:21): "In some implementations, the plurality of depthwise separable convolution layers can be stacked one after another,"; Howard teaches a plurality of convolutional layers, arranged one after another (i.e. a plurality of convolution layers arranged in the sequential order))
reducing channel widths of the second convolution layers in the second macroblock in response to any one of the second convolution layers is in the enhancement layer group ((col. 4:65): “The width multiplier can be used to reduce the computational costs and number of parameters of a convolutional neural network by reducing the number of filters (or channels) included in one or more convolutional layers of a convolutional neural network,”; Howard teaches reducing the number of filters (i.e. reducing channel widths) of multiple particular convolution layers (i.e. a particular group or macroblock) of a convolutional neural network (i.e. reducing channel widths of the second convolution layers in the second macroblock). If a layer has a width multiplier between 0 and 1 (i.e. is in the enhancement layer group), its channel width will be reduced (i.e. in response to any one of the second convolution layers is in the enhancement layer group)).
reducing channel widths of the first convolution layers in the first macroblock in response to any one of the first convolution layers is in the enhancement layer group ((col. 4:65): “The width multiplier can be used to reduce the computational costs and number of parameters of a convolutional neural network by reducing the number of filters (or channels) included in one or more convolutional layers of a convolutional neural network,”; Howard teaches reducing the number of filters (i.e. reducing channel widths) of multiple particular convolution layers (i.e. a particular group or macroblock) of a convolutional neural network (i.e. reducing channel widths of the first convolution layers in the first macroblock). If a layer has a width multiplier between 0 and 1 (i.e. is in the enhancement layer group), its channel width will be reduced (i.e. in response to any one of the first convolution layers is in the enhancement layer group)).
a plurality of first convolution layers …
Howard does not explicitly teach: 
the first model of the convolutional neural network comprises [a plurality] of first convolution layers, a pooling layer, [a plurality] of second convolution layers…
grouping the first convolution layers into a first macroblock and the second convolution layers into a second macroblock.
Talathi teaches
the first model of the convolutional neural network comprises [a plurality] of first convolution layers, a pooling layer, [a plurality] of second convolution layers (Detailed Description, [0070]: " ...the exemplary deep convolutional network 350 includes multiple convolution blocks (e.g., C1 and C2). Each of the convolution blocks may be configured with a convolution layer, a normalization layer (LNorm), and a pooling layer,"; Talathi teaches convolution blocks that comprise a network, which may contain a convolution layer followed by a pooling layer. The pooling layer at the end of a block would then connect to the convolution layer at the start of the next block (i.e. the first model of the convolutional neural network comprises a first convolution layer, a pooling layer, a second convolution layer))
grouping the first convolution layers into a first macroblock and the second convolution layers into a second macroblock (Detailed Description [0070]: “…the exemplary deep convolutional network 350 includes multiple convolution blocks (e.g., C1 and C2). Each of the convolution blocks may be configured with a convolution layer, a normalization layer Talathi teaches multiple blocks (i.e. first macroblock/second macroblock), where each block consists of a group of several layers (i.e. grouping the first layers into a first macroblock and the second layers into a second macroblock)).
Howard, Shi, and Talathi are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard, Shi, and Talathi before him or her to modify the multiple convolution layers of Howard and Shi to include the pooling layer and layer groups of Talathi, thereafter the multiple convolution layers are connected to the pooling layer and layer groups, to obtain the advantage of reducing the dimensions of the convolution layers (Talathi; [0070]: “The pooling layer may provide down sampling aggregation over space for local invariance and dimensionality reduction,”).
Regarding claim 7, Howard and Shi teach the method of claim 1. Howard further teaches the pre-trained model comprises the convolution layers with the channel widths in default amounts, the second model is formed to comprise the convolution layers with the channel widths in reduced amounts, the reduced amounts are lower than or equal to the default amounts (Fig 2, 208; (col. 10:60): "At 208, the computing system generates a reduced convolutional neural network structure that has the existing convolutional neural network structure except that each of the one or more convolutional layers in the reduced convolutional neural network has the respective reduced number of filters determined for such convolutional layer,"; Howard teaches forming a second model with a reduced number of filters (i.e. second model with channel widths in reduced amounts) from an existing network with an original number of filters (i.e. the pre-trained model comprises the convolution layers with the channel widths in default amounts), and that this reduction is done by multiplying the original number of filters by a width multiplier that can be between 0 and 1, as above, which would make the reduced number of filters less than or equal to the original number of filters (i.e. the reduced amounts are lower than or equal to the default amounts)).
Howard does not teach the first model is a pre-trained model of the convolutional neural network.
Talathi teaches the first model is a pre-trained model of the convolutional neural network (Summary [0010]: "...a method for improving performance of a trained machine learning model is presented,"; Talathi teaches improving the performance of a machine learning model (i.e. the first model), which has already been trained prior to applying the method for improvement (i.e. the first model is a pre-trained model of the convolutional neural network).
Howard, Shi, and Talathi are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard, Shi, and Talathi before him or her to modify the second model with reduced channel widths of Howard and Shi to include the pre-trained first model as in Talathi, to obtain the advantage of improving upon an already trained model (Talathi; [0010]: “…a method for improving performance of a trained machine learning model is presented. The method comprises adding a second classifier with a second objective function to a first classifier with a first objective function. The second objective function is used to directly reduce errors of the first classifier,”).
Regarding claim 16, Howard and Shi teach the apparatus of claim 13. Howard further teaches:
the first model of the convolutional neural network comprises a plurality of…convolution layers…arranged in the sequential order (Description, (col. 4:21): "In some implementations, the plurality of depthwise separable convolution layers can be stacked one after another,"; Howard teaches a plurality of convolutional layers, arranged one after another (i.e. a plurality of convolution layers arranged in the sequential order)
reducing channel widths of the second convolution layers in the second macroblock in response to any one of the second convolution layers is in the enhancement layer group ((col. 4:65): “The width multiplier can be used to reduce the computational costs and number of parameters of a convolutional neural network by reducing the number of filters (or channels) included in one or more convolutional layers of a convolutional neural network,”; Howard teaches reducing the number of filters (i.e. reducing channel widths) of multiple particular convolution layers (i.e. a particular group or macroblock) of a convolutional neural network (i.e. reducing channel widths of the second convolution layers in the second macroblock). If a layer has a width multiplier between 0 and 1 (i.e. is in the enhancement layer group), its channel width will be reduced (i.e. in response to any one of the second convolution layers is in the enhancement layer group)).
reducing channel widths of the first convolution layers in the first macroblock in response to any one of the first convolution layers is in the enhancement layer group ((col. 4:65): “The width multiplier can be used to reduce the computational costs and number of parameters of a convolutional neural network by reducing the number of filters (or channels) included in one or more convolutional layers of a convolutional neural network,”; Howard teaches reducing the number of filters (i.e. reducing channel widths) of multiple particular convolution layers (i.e. a particular group or macroblock) of a convolutional neural network (i.e. reducing channel widths of the first convolution layers in the first macroblock). If a layer has a width multiplier between 0 and 1 (i.e. is in the enhancement layer group), its channel width will be reduced (i.e. in response to any one of the first convolution layers is in the enhancement layer group)).
grouping the first convolution layers into a first macroblock and the second convolution layers into a second macroblock ()

the first model of the convolutional neural network comprises a plurality of first convolution layers, a pooling layer, a plurality of second convolution layers arranged in the sequential order
grouping the first convolution layers into a first macroblock and the second convolution layers into a second macroblock.
Talathi teaches
the first model of the convolutional neural network comprises a first convolution layer, a pooling layer, a second convolution layer (Detailed Description, [0070]: " ...the exemplary deep convolutional network 350 includes multiple convolution blocks (e.g., C1 and C2). Each of the convolution blocks may be configured with a convolution layer, a normalization layer (LNorm), and a pooling layer,"; Talathi teaches convolution blocks that comprise a network, which may contain a convolution layer followed by a pooling layer. The pooling layer at the end of a block would then connect to the convolution layer at the start of the next block (i.e. the first model of the convolutional neural network comprises a first convolution layer, a pooling layer, a second convolution layer))
grouping the first layers into a first macroblock and the second layers into a second macroblock (Detailed Description [0070]: “Each of the convolution blocks may be configured with a convolution layer, a normalization layer (LNorm), and a pooling layer,”; Talathi teaches grouping specific layers into different blocks (i.e. grouping the first layers into a first macroblock and the second layers into a second macroblock)).
Howard, Shi, and Talathi are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard, Shi, and Talathi before him or her to 
Regarding claim 18, Howard and Shi teach the apparatus of claim 12. Howard further teaches the pre-trained model comprises the convolution layers with the channel widths in default amounts, the second model is formed to comprise the convolution layers with the channel widths in reduced amounts, the reduced amounts are lower than or equal to the default amounts (Fig 2, 208; (col. 10:60): "At 208, the computing system generates a reduced convolutional neural network structure that has the existing convolutional neural network structure except that each of the one or more convolutional layers in the reduced convolutional neural network has the respective reduced number of filters determined for such convolutional layer,"; Howard teaches forming a second model with a reduced number of filters (i.e. second model with channel widths in reduced amounts) from an existing network with an original number of filters (i.e. the pre-trained model comprises the convolution layers with the channel widths in default amounts), and that this reduction is done by multiplying the original number of filters by a width multiplier that can be between 0 and 1, as above, which would make the reduced number of filters less than or equal to the original number of filters (i.e. the reduced amounts are lower than or equal to the default amounts)).
Howard does not teach the first model is a pre-trained model of the convolutional neural network.
Talathi teaches the first model is a pre-trained model of the convolutional neural network (Summary [0010]: "...a method for improving performance of a trained machine learning model is presented,"; Talathi teaches improving the performance of a machine learning model (i.e. the first model), which has already been trained prior to applying the method for improvement (i.e. the first model is a pre-trained model of the convolutional neural network).
Howard, Shi, and Talathi are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard, Shi, and Talathi before him or her to modify the second model with reduced channel widths of Howard and Shi to include the pre-trained first model as in Talathi, to obtain the advantage of improving upon an already trained model (Talathi; [0010]: “…a method for improving performance of a trained machine learning model is presented. The method comprises adding a second classifier with a second objective function to a first classifier with a first objective function. The second objective function is used to directly reduce errors of the first classifier,”).
Claims 6 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Blaettler (U.S. Patent No 20180269897-A1) in view of Howard, Shi, and Talathi.
Regarding claim 6, Howard, Shi, and Talathi teach the method of claim 5. Howard further teaches …the channel widths in the second macroblock are reduced with a higher proportion compared to the channel widths in the first macroblock ((col. 5:39): “In some implementations, a width multiplier can be applied to each of a plurality of convolutional layers included in a convolutional neural network. For example, different, respective width multipliers can be used,”; Howard teaches using different width multipliers on multiple layers, which can include using one width multiplier on one group and another width multiplier on another group, such that one group is reduced more than the other (i.e. the channel widths in the second macroblock are reduced with a higher proportion compared to the channel widths in the first macroblock)).
Howard does not teach in response to the channel widths in the second macroblock and the channel widths in the first macroblock are [both] reduced.
in response to the channel widths in the second macroblock and the channel widths in the first macroblock are [both] reduced ([0091]: "In response to a reduced context model being indicated in block 1902 control transfers to block 1904, where flash controller 140 initializes a set S…"; Blaettler teaches checking for reduction in a model, and if reduction is indicated, performing an action (i.e. acting in response to the channel widths…are reduced)).
Howard, Shi, Talathi, and Blaettler are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard, Shi, Talathi, and Blaettler before him or her to modify the blocks of multiple convolution layers with different reduction proportions established by Howard, Shi, and Talathi to include checking for multiple reductions before acting as in Blaettler, to obtain the advantage of maintaining a desired structural proportion for the network (Blaettler; [0094]: “As one example, the desired size of the reduced context model ensemble may be determined based on achieving a desired minimum CR and throughput,”).
Regarding claim 17, Howard, Shi, and Talathi teach the apparatus of claim 16. Howard further teaches …the channel widths in the second macroblock are reduced with a higher proportion compared to the channel widths in the first macroblock ((col. 5:39): “In some implementations, a width multiplier can be applied to each of a plurality of convolutional layers included in a convolutional neural network. For example, different, respective width multipliers can be used,”; Howard teaches using different width multipliers on multiple layers, which can include using one width multiplier on one group and another width multiplier on another group, such that one group is reduced more than the other (i.e. the channel widths in the second macroblock are reduced with a higher proportion compared to the channel widths in the first macroblock)).
Howard does not teach in response to the channel widths in the second macroblock and the channel widths in the first macroblock are [both] reduced.
in response to the channel widths in the second macroblock and the channel widths in the first macroblock are [both] reduced ([0091]: "In response to a reduced context model being indicated in block 1902 control transfers to block 1904, where flash controller 140 initializes a set S…"; Blaettler teaches checking for reduction in a model, and if reduction is indicated, performing an action (i.e. acting in response to the channel widths…are reduced)).
Howard, Shi, Talathi, and Blaettler are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard, Shi, Talathi, and Blaettler before him or her to modify the blocks of multiple convolution layers with different reduction proportions established by Howard, Shi, and Talathi to include checking for multiple reductions before acting as in Blaettler, to obtain the advantage of maintaining a desired structural proportion for the network (Blaettler; [0094]: “As one example, the desired size of the reduced context model ensemble may be determined based on achieving a desired minimum CR and throughput,”).
Claims 10 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Chang (U.S. Patent No 20200257975-A1) in view of Howard and Shi.
Regarding claim 10, Howard and Shi teach the method according to claim 1. Howard also teaches a receptive field width of one of the convolution layers is determined by a projective region on an input image affecting one feature point in a convolution output tensor at the one of the convolution layers ((col. 5:51): “…a computing system can generate the width multiplier value based at least in part on one or more desired performance parameters and one or more existing performance parameters associated with an existing convolutional neural network structure,”; Howard teaches determining a width multiplier value (which will be used to modify the filter width of a convolution layer) using one or more desired performance parameters (in this case, the desired new filter width) and one or more existing performance parameters (in this case, the projective region size) (i.e. a receptive field width of one of the convolution layers is determined by a projective region)).
Howard does not teach [a projective region] on an input image affecting one feature point in a convolution output tensor at the one of the convolution layers.
Chang teaches [a projective region] on an input image affecting one feature point in a convolution output tensor at the one of the convolution layers ([0122]: "The instruction for extracting an object 1321 extracts one or more object images from the provided frame image...The instruction for determining feature points 1322 outputs feature data of each of the object images using CNN layers, and adds the feature points by mapping to the current embedding space,"; Chang teaches taking one part of an frame image (i.e. input image), and using that part to determine feature points and output feature data (i.e. affecting one feature point in a convolution output tensor at the one of the convolution layers)).
Howard, Shi, and Chang are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard, Shi, and Chang before him or her to modify the width determination of Howard and Shi to include the feature points as set forth in Chang, to obtain the advantage of making the data understandable (Chang; [0008]: “Meanwhile, it is necessary not only to classify different data correctly, but also to understand the meanings of the data (hereinafter, referred to as semantics) in order to understand the data,”).
Regarding claim 21, Howard and Shi teach the apparatus according to claim 12. Howard also teaches a receptive field width of one of the convolution layers is determined by a projective region on an input image affecting one feature point in a convolution output tensor at the one of the convolution layers ((col. 5:51): “…a computing system can generate the width multiplier value based at least in part on one or more desired performance parameters and one or more existing performance Howard teaches determining a width multiplier value (which will be used to modify the filter width of a convolution layer) using one or more desired performance parameters (in this case, the desired new filter width) and one or more existing performance parameters (in this case, the projective region size) (i.e. a receptive field width of one of the convolution layers is determined by a projective region)).
Howard does not teach [a projective region] on an input image affecting one feature point in a convolution output tensor at the one of the convolution layers.
Chang teaches [a projective region] on an input image affecting one feature point in a convolution output tensor at the one of the convolution layers ([0122]: "The instruction for extracting an object 1321 extracts one or more object images from the provided frame image...The instruction for determining feature points 1322 outputs feature data of each of the object images using CNN layers, and adds the feature points by mapping to the current embedding space,"; Chang teaches taking one part of an frame image (i.e. input image), and using that part to determine feature points and output feature data (i.e. affecting one feature point in a convolution output tensor at the one of the convolution layers)).
Howard, Shi, and Chang are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard, Shi, and Chang before him or her to modify the width determination of Howard and Shi to include the feature points as set forth in Chang, to obtain the advantage of making the data understandable (Chang; [0008]: “Meanwhile, it is necessary not only to classify different data correctly, but also to understand the meanings of the data (hereinafter, referred to as semantics) in order to understand the data,”).
s 11 and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Guo ("Simple convolutional neural network on image classification," 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA), 2017, pp. 721-724, doi: 10.1109/ICBDA.2017.8078730) in view of Howard and Shi.
Regarding claim 11, Howard and Shi teach the method according to claim 1. Howard does not explicitly teach that the [second] model of the convolutional neural network is utilized to recognize an incoming image, and the [second] model is further utilized to generate a label corresponding to the incoming image, to detect an object in the incoming image or to segment a foreground object from a background of the incoming image.
Guo teaches that the [second] model of the convolutional neural network is utilized to recognize an incoming image, and the [second] model is further utilized to generate a label corresponding to the incoming image, to detect an object in the incoming image or to segment a foreground object from a background of the incoming image (Introduction (first paragraph): "Image classification is process including image preprocessing, image segmentation, key feature extraction and matching identification,"; Section III (second paragraph): “Based on this idea, we build a simple Convolutional neural network on image classification,”; Guo teaches a model of a convolutional neural network  used for image classification. Guo describes image classification as preprocessing (i.e. recognize an image), segmentation and key feature extraction (i.e. detect an object in the incoming image or to segment a foreground object from a background), and matching identification (i.e. generate a label corresponding to the incoming image)).
Howard, Shi, and Guo are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard and Guo before him or her to modify the second model of Howard and Shi to include the image classification ability set forth in Guo, to obtain the advantage of using the improved second model of the network for image classification (Guo; 
Regarding claim 22, Howard and Shi teach the apparatus according to claim 12. Howard does not explicitly teach that the…model of the convolutional neural network is utilized to recognize an incoming image, and the…model is further utilized to generate a label corresponding to the incoming image, to detect an object in the incoming image or to segment a foreground object from a background of the incoming image.
Guo teaches that the…model of the convolutional neural network is utilized to recognize an incoming image, and the…model is further utilized to generate a label corresponding to the incoming image, to detect an object in the incoming image or to segment a foreground object from a background of the incoming image (Introduction (first paragraph): "Image classification is process including image preprocessing, image segmentation, key feature extraction and matching identification,"; Section III (second paragraph): “Based on this idea, we build a simple Convolutional neural network on image classification,”; Guo teaches a model of a convolutional neural network  used for image classification. Guo describes image classification as preprocessing (i.e. recognize an image), segmentation and key feature extraction (i.e. segment a foreground object from a background), and matching identification (i.e. generate a label corresponding to the incoming image)).
Howard and Guo are analogous art because they are from the same field of endeavor in neural networks. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Howard and Guo before him or her to modify the second model of Howard to include the image classification ability set forth in Guo, thereafter the second model is connected to image classification. The motivation for doing so would be obtaining the advantage of using the improved second model of the network for image classification. Therefore, it would have been 
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MAXWELL EDWARD MIKA whose telephone number is (571)272-2654. The examiner can normally be reached 7:30 AM - 5:30 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MAXWELL EDWARD MIKA/               Examiner, Art Unit 4112                                                                                                                                                                                         
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129