210125
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
The amendment filed on February 24, 2022 has been entered. The status of the claims is as follows:
Claims 1-7, 9, and 11-16 are currently pending of which claims 1-7, 9, and 11-16 are amended, and claims 8 and 10 are canceled in the amendment filed on February 24, 2022.

Response to Arguments
            The amendment and arguments filed on 2/24/2022 have been fully considered. The examiner’s response is delineated as follows.
(a)       Response to Arguments Regarding Claims Objections: The objections to claims 10-12 are withdrawn in view of the cancellation of claim 10 and amendment to claims 11-12.
(b)       Response to Arguments Regarding Invocation of 35 U.S.C. § 112(f): Claims 1-7, 9, and 11-16 recites neither means for/steps for nor generic placeholders in the amendment.  Therefore, 1-7, 9, and 11-16, as amended, no longer invoke 35 U.S.C. § 112(f).
(c)        Response to Arguments Regarding Rejection of Claims under 35 U.S.C. § 112(b):
(1)	The rejections of claims in subsection 7(a)-(e) under 35 U.S.C. § 112(b) are withdrawn in light of Applicant’s amendment to the claims.
(2)	The rejections of claims in subsection 8(a)-(b) and (c)(2) under 35 U.S.C. § 112(b) are withdrawn in light of Applicant’s amendment to the claims.
(3)	The rejections of claims in subsection 8(c)(1) under 35 U.S.C. § 112(b) are maintained because Applicant’s amendment to claim 4 does not cover the two recitations of “each of layers”.
(d)       Response to Arguments Regarding Rejection of Claims under 35 U.S.C. § 101: Applicant’s arguments have been considered. The rejections of claims 1-7 and 9-16 under 35 U.S.C. § 101 are withdrawn in light of the amendment to the claims.

Claim Interpretation
Claim 1, as amended, recites the limitation “remove a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters”.  Claims 12-16 also recite substantially similar limitations.  This claimed limitation is contingent upon the occurrence of the condition precedent “if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters”.  Therefore, the claimed removing a layer may or may not necessarily occur and is thus contingent upon whether the aforementioned condition occurs. For example, the claimed removing a layer is not required when the condition precedent – the claimed operation to be performed in the layer is not completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters – does not occur.  See MPEP §§ 2111.04(I)-(II) and 2103(C).  For purpose of examination, this claimed limitation is interpreted as removing a layer from the at least one sub-structure after an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer.

Claim Objections
Claims 12, 14, and 16 objected to because of the following informalities:
(a)	The limitation “save the multilayer neural network model in the one or more memories, the multilayer neural network model is generated by extracting at least one sub-structure from the multilayer neural network model, wherein each of the at least one sub-structure has a plurality of layers which include a quantization layer as a tail layer” is missing a semi-colon at the end. Appropriate correction is required.
(b)	Claim 13: the limitation “wherein each of the at least one sub-structure has a plurality of layers which include a quantization layer as a tail layer” is missing a punctuation, a semi-colon, at the end. 
(c)	Claims 14 and 16: the limitation “the multilayer neural network model is generated by;” respectively recited in claims 14 and 16 should end with a colon, not a semi-colon.
(d)	Claim 15: the limitation “wherein each of the at least one sub-structure has a plurality of layers which include a quantization layer as a tail layer” is missing a punctuation, a semi-colon, at the end. 

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
 
The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1-7, 9, and 11-16 stand rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
(a)      Claims 1 and 12-16: 
The claimed limitation “modifying quantization threshold parameters used for quantization in the quantization layer based on an operation parameter of the operation to be performed in the one or more layers” is not described in the present disclosure. The present disclosure merely describes updating the quantization threshold parameters in the quantization layer by using the operation parameters equivalently transferring the operation parameters and the operation procedures in the layers other than the tail layer to the tail, quantization layer to update the quantization threshold parameters in the quantization layer where transferring is essentially a merge operation. See e.g., ¶ [0051] of the present disclosure. For purpose of examination, “modifying quantization threshold parameters” is interpreted as transferring operation parameters and the operation procedure in a preceding layer to the quantization layer in which the operation parameters and the quantization parameters are merged for the tail, quantization layer. Claims 12-16 also recite substantially similar limitation and are thus rejected accordingly, the same rationale applying.
(b)      Claims 2-7, 9, and 11 depend from claim 1 and are thus rejected due to at least their dependency from claim 1, the same rationale applying.

The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
 
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
 
Claims 1-7, 9, and 11-16 stand rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
(a)       Claim(s) 1 and 12-16: 
(1)      The two recitations of “an operation” in the limitation “incorporate, in each of the at least one sub-structure, at least part of an operation to be performed in one or more layers other than the quantization layer into quantization in the quantization layer by …” and the limitation “remove a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters” are indefinite because it is unclear whether or not these two instances of “an operation” refer to the same operation. For purpose of examination, these two instances of “operation” are interpreted as the same operation. Correction is nevertheless required. 
(2)      Claims 12-16 recite substantially similar limitations as does claim 1 and are thus rejected accordingly, the same rationale applying.
(b)       Claims 2-7 and 9-11 depend from claim 1 and are thus also rejected under 112(b) due to at least their dependency from claim 1.
(c)      Claim 4: The claimed “layers” in “each of layers” are indefinite because it is unclear whether the claimed “layers” in “each of layers” are the same as or different from the amended “a plurality of layers” recited in the base claim 1, and it is equally unclear whether these “layers” in “each of layers” are the same as or different from the amended “one or more layers” recited in the base claim 1. Further, claim 4 recites “each of layers” twice.  It is unclear whether the two instances of “layers” are the same “layers” or different layers because the base claim 1 recites “a plurality of layers” and “one or more layers”.  For the purpose of examination, the recited “layers” are interpreted as the same “plurality of layers” recited in the base claim 1, and the aforementioned limitation is interpreted as “wherein, for each layer of the plurality of layers other than the quantization layer …”. Appropriate correction is required. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 1-5, 7, 9, and 11-15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xu et al. WO 2018/140294 effectively filed on January 25, 2017 (hereinafter Xu) in view of Umuroglu et al. Streamlined Deployment for Quantized Neural Networks, Sept. 12, 2017. (hereinafter Umuroglu) and further in view of Li et al., DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices (August 16, 2017) (hereinafter Li).
With respect to claim 1, Xu teaches:
An apparatus for transforming a multilayer neural network model, comprising: one or more processors; and (Xu, ¶ [0019]: “The special-purpose processing device 106 may further include a memory unit 108 and a processing unit 110. For example, the special-purpose processing device 106 may be a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), a processor or a Central Processing Unit (CPU) with a customized processing unit, or a Graphics Processing Unit (GPU).”)
 
one or more memories coupled to the one or more processors, the one or more memories having stored thereon instructions which, when executed by the one or more processors, cause the apparatus to: (Xu, ¶ [0019], supra.)
 
extract at least one sub-structure from the multilayer neural network model, (Xu, FIG. 3; ¶¶ [0053]-[0058] where Xu’s “convolution layer 300” includes “a binary sub-layer 308” that “convert[s]” “the weights 302” “to binary weights,” “an IBN sub-layer 316,” “a summing sub-layer 320,” “an activation sub-layer 322,” and “a quantization sub-layer 324”. FIG. 4 and ¶¶ [0061]-[0068] further describes the backward pass of a substantially similar convolutional layer 400 with backward propagation of an input 426 in the floating-point format. 
The examiner notes that this limitation identifies a sub-structure for fixed-point processing from a multi-layer neural network. The word “extract” carries the definitions of “to determine by calculation” or “to select” (see Merriam-Webster). Therefore, Xu’s identifying the aforementioned sub-structure for fixed-point processing teaches extracting at least one sub-structure from the multilayer neural network model.)
 
wherein each of the at least one sub-structure has a plurality of layers which include a quantization layer as a tail layer; (Xu, FIG. 3; ¶ [0054]: the convolution layer 300 in FIG. 3 includes “a binary sub-layer 308,” “a hidden layer,” “an output layer,” “a normalization sub-layer 316,” “quantization sub-layer 318,” “a summing sub-layer 320,” “activation sub-layer 322,” binary layer 324”; and ¶ [0061]: the convolutional layers 400 includes “quantization sub-layer” 416, 424, and 428 in FIG. 4. The examiner first notes the aforementioned layers teach that the at least one sub-structure has a plurality of layers. The examiner further notes that that any smaller portion of Xu’s neural network that has multiple layers and ends with a quantization sub-layer (hence a “tail layer”) in Xu’s neural network to which fixed-point or binarization processing applies teaches that the above limitation.)
 
incorporate, in each of the at least one sub-structure, at least part of an operation to be performed in one or more layers other than the quantization layer into quantization in the quantization layer by (Xu, (FIG. 3 (annotated):

    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale

¶ [0065]: “output of the activation sub-layer 422 is provided to the summing sub-layer 420, which corresponds to the summing sub-layer 320, and the gradients of the loss function with respect to two inputs of the summing sub-layer 320 may be determined. Because an input of the sub-layer 320 is the bias, the gradient of the loss function with respect to the bias may be determined and the gradient is provided to the quantization sublayer 428”. ¶ [0051]: “the output of each layer is used as the input of the next layer starting from the head layer until transferring to the quantization layer”
The examiner thus notes that Xu’s weights and biases teach the claimed operation parameters that pertain to an operation, and that Xu’s passing an output of a layer (e.g., the product of a weight matrix and input feature vector of the layer) to the next layer and eventually to a quantization sub-layer incorporates at least part of the aforementioned convolution operation into the quantization sub-layer (e.g., 318 or 324 in FIG. 3) and hence teaches the above limitation.)
 
Xu does not appear to explicitly teach:
modifying quantization threshold parameters used for quantization in the quantization layer based on an operation parameter of the operation to be performed in the one or more layers; and 
remove a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters.
 
Umuroglu does, however, teach:
modifying quantization threshold parameters used for quantization in the quantization layer based on an operation parameter of the operation to be performed in the one or more layers; and (Umuroglu, § 2-(3) “Non-integer quantization levels” “2-bit uniform quantization”; § 2.1.1 “Quantization as successive thresholding” – “Given a set of threshold values t = {t0, t1 . . . tn}, the successive thresholding function T (x, t) maps any real number x to an integer in the interval [0,n], where the returned integer is the number of thresholds that x is greater than or equal to:” “T(x, t) =” “0, for x≤t0”; “1, for t0<x≤t1” … “n-1, for tn-2<x≤tn-1”; and “n, for tn-1<x”. ¶ 2, § 2.1.1, p. 2: “Any uniform quantizer Q(x) can be expressed as successive thresholding followed by a linear transformation such that Q(x) = a · T (x) + b.” 
The examiner notes that Umuroglu’s set of threshold value (“t = {t0, t1 . . . tn}” above) teaches quantization threshold parameters used for quantization in the quantization layer, and that Umuroglu’s transferring or merging the operation parameters (e.g., weights from the preceding summing sub-layer 312 or 320) with the aforementioned quantization threshold parameters for the quantization layer (e.g., and thus teaches modifying quantization threshold parameters based on an operation parameter (e.g., the weights 302 converted by the binary sub-layer 310 in Xu’s FIG. 3, spura) of the operation to be performed in the one or more layers (e.g., the summing sub-layer 312 or 320 in Xu’s FIG. 3, supra). Therefore, the examiner asserts that Umuroglu renders the above limitation obvious.)

Xu and Umuroglu are analogous because both Xu and Umuroglu pertain to reduce the computational complexity and memory footprint, especially for devices with limited compute and memory capacity such as mobile phones, by using fixed-point processing. 
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to have modified Xu’s “neural network” (¶ [0002]) with Umuroglu’s modifying quantization threshold parameters used for quantization (Umuroglu, supra). The modification not only provides the floating-point values for quantization threshold parameters to best approximate the input floating-point values to respective integers but also turns quantization of input floating-point values into simply counting the number of quantization threshold parameters that the input floating-point values are greater than or equal to (Umuroglu, § 2-(3): “Non-integer quantization levels. The chosen quantization levels in a QNN may be floating point values to best approximate the underlying value distribution.)
 
Xu modified by Umuroglu does not appear to explicitly teach:
remove a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters.
 
Li does, however, teach: 
remove a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters. (Li, p. 3, § 3, ¶ 2: “The idea of our method is to merge these highly correlated layers and substitute them as a new “slim” layer from the analysis and modeling of the correlations of the current layer and preceding layers (or parallel layers).”; p. 3, § 3.1, ¶ 4: “Method The streamline slimming regenerates a new tensor layer (i.e., slim layer) by merging non-tensor layers with its bottom tensor units in the feed-forward structure.” P. 4, § 3.1, left-hand column, ¶ 1: “Pooling Layer: The pooling layer down-samples feature maps learned from previous layers. Therefore, to absorb a pooling layer to a convolution layer, we remove the pooling layer and set the stride value of the new convolution layer as the product of the stride values for both the original pooling layer and the convolution layer.”
The examiner notes that Li’s merging a non-tensor layer with its bottom tensor units to regenerate a new tensor layer teaches removing a layer (e.g., the non-tensor layer) from the at least one sub-structure, and that one or more bottom units render the claimed quantization layer obvious. The examiner further notes that Li’s merging the non-tensor layer with its bottom tensor units into a new layer teaches an operation to be performed in the layer (e.g., the aforementioned non-tensor layer) has been completely incorporated into the quantization layer (e.g., Li’s merged layer). Therefore, the examiner asserts that Li, when combined with Umuroglu and Xu, teaches the above limitation.)
 
Xu, Umuroglu, and Li are analogous because all three references pertain to reducing the computational complexity and memory footprint, especially for devices with limited compute and memory capacity such as mobile phones, by using fixed-point processing. 
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to combine Xu in view of Umuroglu to incorporate Li’s “DeepRebirth” that removes a layer from a sub-structure where an operation in the layer has been completely incorporated into a quantization layer (Li, supra). The modification accelerates the model execution at least in non-tensor layers without require training the neural network from scratch (Li, p. 3, § 3, ¶ 1: “To accelerate the model execution in non-tensor layers, we propose DeepRebirth to accelerate the model execution at both streaming substructure and branching substructure. The idea of our method is to merge these highly correlated layers and substitute them as a new “slim” layer from the analysis and modeling of the correlations of the current layer and preceding layers (or parallel layers).” p. 8, § 5, ¶ 2: “In addition, our work slims a well-trained network by relearning the merged rebirth layers and does not require to train from scratch.”)
 
With respect to claim 2, Xu modified by Umuroglu and Li teaches the apparatus according to claim 1, and Xu further teaches: 
wherein a binary convolution layer is included in each of the at least one sub-structure. (Xu, ¶ [0054]: “For example, the binary sub-layer 308 may convert the fixed-point weights 302 into binary weights 310 by a sign function, as shown in equation (1). Moreover, the convolutional layer 300 further receives an input 306, which may be represented by Wbk.” “In both cases, convolutional operation only includes integer multiplication and accumulation and may be computed by bit convolution kernels.” Eq. (8) showing the summation of the multiplication of the input x and the weight wb. The examiner notes that Xu’s convolutional layer (e.g., 300) operating on the binary weights teaches a binary convolution layer. The examiner further notes that Xu’s inclusion of the aforementioned binary layer in each of the aforementioned at least one sub-structure teaches the above limitation.)
 
With respect to claim 3, Xu modified by Umuroglu and Li teaches the apparatus according to claim 2, and Xu further teaches: 
wherein the binary convolution layer is included as a head layer in each of the at least one sub-structure. (Xu, ¶ [0054]: “convolutional operation only includes integer multiplication and accumulation and may be computed by bit convolution kernels”; “In some implementations, if the convolutional layer 300 is the first layer, it may be processed according to equation (8) (reproduction omitted) where x represents an input 306 in an 8-bit fixed-point format, w represents a binary weight and xn represents the mantissa of the n-th element of vector x.)” FIG. 3 cited for claim 1, supra.
The examiner notes that Xu’s sub-structure including the binary convolution layer (e.g., the summing sub-layer 312 that operates on binary weights 310), then the pooling sub-layer 314, the IBN sub-layer 316, and then the quantization sub-layer 318 (or additionally the summing sub-layer 320 followed by the activation sub-layer 322 and then another quantization sub-layer 324) in FIG. 3 includes a convolution layer at the head of the substructure and a quantization sub-layer (318 or 324) at the tail of the substructure. See further details and description in FIG. 3 and ¶¶ [0054]-[0058].) 
 
With respect to claim 4, Xu modified by Umuroglu and Li teaches the apparatus according to claim 1, and Xu further teaches: 
wherein, for each of layers other than the quantization layer in the at least one sub-structure, the operation parameters in an upper layer are transferred to a lower layer from top to bottom, until the operation parameters in each of layers are transferred to the quantization layer. (Xu, ¶¶ [0054]-[0058] teaches Xu forward passes the convolutional operation output, which is the product of the input 306 and the weight wb (310), to the summing sub-layer 312, that the output of the summing sub-layer 312, which is the summation of the aforementioned products, is forwarded to a pooling sub-layer 314, that the output of the pooling sub-layer 314, which pools the input to the pooling sub-layer 314, is forwarded to an IBN sub-layer 316, and that the output of the IBN sub-layer 316 is forwarded to a quantization sub-layer 318.  
The examiner notes that the present disclosure describes “transferring” as “essentially a merge operation” and provides an example where “the output of each layer is used as the input of the next layer starting from the head layer until transferring to the quantization layer” (see e.g., ¶ [0051] of the present disclosure). The examiner thus asserts that Xu’s transferring parameters (e.g., the weights in the weight matrix) from an upper layer (e.g., the head or the convolution layer 312 in the sub-structure) to a lower layer (e.g., to the pooling layer 314 then to the integer batch normalization layer 316) before the output of the IBN layer 416 reaches the tail (e.g., the quantization layer 318) in the sub-structure and thus teaches this limitation.)
 
With respect to claim 5, Xu modified by Umuroglu and Li teaches the apparatus according to claim 2, and Xu further teaches: 
wherein the quantization threshold parameters in the quantization layer are modified based on a scaling coefficient parameter used for convolution in the binary convolution layer. (Xu, ¶ [0043]: “weights and gradients can be stored in a fixed-point format” that “includes a l-bit signed integer mantissa and a global scaling factor”. Eq. (8) in ¶ [0054] where “x represents an input 306”, and “wb represents a binary weight and represents the mantissa of the n-th element of vector x”. FIG. 3 (annotated):

    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale

The examiner notes that Xu teaches transferring parameters form, for example, a binary convolution layer to a lower layer from top to bottom in the sub-structure in claim 4, supra. The examiner further notes that Xu describes storing the weights in a fixed-point format that includes a signed integer mantissa and a global scaling factor and are used in the binary convolution layer and thus teaches a scaling coefficient used for the binary convolution layer, and that Xu’s convolutional operation includes the multiplication of the weights including the scaling parameter and the corresponding input elements. Further, the examiner notes that FIG. 3 illustrates the output of the convolutional operation performed on the weights and hence the scaling factor is forward passed to the summing sub-layer 312 and eventually to the quantization sub-layer. Therefore, the examiner asserts that Xu’s forward passing this output of the convolutional operation (which includes the weights and hence the scaling factor) to the quantization sub-layer, when combined with Umuroglu, thus teaches the above limitation.)
 
With respect to claim 7, Xu modified by Umuroglu and Li teaches the apparatus according to claim 1 but does not appear to explicitly teach: 
wherein the modified quantization threshold parameters are expressed by a base coefficient, a shift coefficient, and a correlation relationship between the quantization threshold parameters with respect to the base coefficient. 
Umuroglu does, however, teach:
wherein the modified quantization threshold parameters are expressed by a base coefficient, a shift coefficient, and a correlation relationship between the quantization threshold parameters with respect to the base coefficient. (Umuroglu , § 2, last paragraph: HWGQ(x) = 0 for x≤t0; HWGQ(x) = 0.538 for t0 < x ≤ 0.807; HWGQ(x) = 1.076 for 0.807 < x ≤ 1.345; HWGQ(x) = 1.614 for 1.345 < x. 
The examiner first notes that simple arithmetic gives that if we let β = 0.538, we obtain that 1.076 = 2* β, 1.314 = 3* β, and that 0.807 = 1.5* β, 1.345 = 2.5* β.  Further, if we let  = t0,  = 0.538, the above HWGQ with α-scaling can be rewritten as HWGQ(x) = 0* β for x ≤ ; HWGQ(x) = 1* β for  < x ≤ 1.5* + ; HWGQ(x) = 2* β for 1.5* +  < x ≤2.5* + ; HWGQ(x) = 3* β for 2.5*  +  < x (note that Umuroglu starts with t0 and then for simplicity uses 0 to simplify the above formulation). The examiner further notes that Umuroglu was previously cited to teach the quantization threshold parameters.  See basis and rationale for claim 1 above. Therefore, the examiner asserts that Umuroglu teaches a base coefficient ( above), a shift coefficient ( above), and a correlation relationship between quantization threshold parameters (see citations and rationale for claim 1, supra) with respect to the shift coefficient ( above) as recited in the claims and described in ¶¶ [0031]-[0035] of the disclosure.) 
 
Xu, Umuroglu, and Li are analogous because all three references pertain to reducing the computational complexity and memory footprint, especially for devices with limited compute and memory capacity such as mobile phones, by using fixed-point processing. 
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to have modified Xu in view of Umuroglu and Li to further incorporate Umuroglu’s expressing modified quantization threshold parameters by a base coefficient, a shift coefficient, and a correlation relationship between the quantization threshold parameters with respect to the base coefficient (Umuroglu, supra). The modification not only provides the floating-point values for quantization threshold parameters to best approximate the input floating-point values to respective integers but also turns quantization of input floating-point values into simply counting the number of quantization threshold parameters that the input floating-point values are greater than or equal to (Umuroglu, § 2-(3): “Non-integer quantization levels. The chosen quantization levels in a QNN may be floating point values to best approximate the underlying value distribution. For instance, the state-of-the-art QNNs produced by HWGQ [1] use the following function for 2-bit uniform quantization:

    PNG
    media_image2.png
    97
    347
    media_image2.png
    Greyscale
”. § 2.1.1: “Quantization as successive thresholding. Given a set of threshold values t = {t0, t1. . . tn}, the successive thresholding function T(x, t) maps any real number x to an integer in the interval [0,n], where the returned integer is the number of thresholds that x is greater than or equal to”.)
 
With respect to claim 9, Xu modified by Umuroglu and Li teaches the apparatus according to claim 1, and Xu further teaches: 
wherein the instructions, when executed by the one or more processors, further cause the apparatus to perform fixed-point processing for each of layers in the multilayer neural network model, such that floating-point operation parameters are converted into fixed-point parameters. (¶ [0066]: “an additional quantization sub-layer 416 is utilized after the IBN sub-layer 418 for converting the floating-point format into a fixed-point format.” The examiner notes that Xu’s converting the floating-point parameters into fixed-point parameters teaches performing fixed-point processing for each of layers in the multilayer neural network model, and that the fixed-point processing converts floating-point format into a fixed-point format for each layer. ) 
 
With respect to claim 11, Xu modified by Umuroglu and Li teaches the apparatus according to claim 12, and Umuroglu further teaches: 
wherein quantization threshold parameters in the multilayer neural network model saved in the one or more memories are expressed by a base coefficient, a shift coefficient, and a correlation relationship between the quantization threshold parameters with respect to the base coefficient; (Umuroglu, § 2 “uniform quantization” HWGQ(x) = 0 for x≤t0; HWGQ(x) = 0.538 for t0 < x ≤ 0.807; HWGQ(x) = 1.076 for 0.807 < x ≤ 1.345; HWGQ(x) = 1.614 for 1.345 < x. The examiner notes that simple arithmetic gives that if we let β = 0.538, we obtain that 1.076 = 2* β, 1.314 = 3* β, and that 0.807 = 1.5* β, 1.345 = 2.5* β.  Further, if we let  = t0,  = 0.538, the above HWGQ with α-scaling can be rewritten as HWGQ(x) = 0* β for x ≤ ; HWGQ(x) = 1* β for  < x ≤ 1.5* + ; HWGQ(x) = 2* β for 1.5* +  < x ≤2.5* + ; HWGQ(x) = 3* β for 2.5*  +  < x (note that Umuroglu starts with t0 and then for simplicity uses 0 to simplify the above formulation). 
The examiner further notes that Umuroglu was previously cited to teach the quantization threshold parameters.  See basis and rationale for claim 1 above. Therefore, the examiner asserts that Umuroglu teaches a base coefficient ( above), a shift coefficient ( above), and a correlation relationship between quantization threshold parameters with respect to the shift coefficient ( above) as recited in the claims and described in ¶¶ [0031]-[0035] of the disclosure.) 
 
the instructions, when executed by the one or more processors, further cause the apparatus to determine the quantization threshold parameters by using the base coefficient, the shift coefficient, and the correlation relationship between the quantization threshold parameters with respect to the base coefficient when an input data set is operated in a quantization layer from top to bottom in the network model, and (Umuroglu, § 2-(3) “Non-integer quantization levels” “2-bit uniform quantization”; § 2.1.1 “Quantization as successive thresholding” – “Given a set of threshold values t = {t0, t1 . . . tn}, the successive thresholding function T (x, t) maps any real number x to an integer in the interval  [0,n], where the returned integer is the number of thresholds that x is greater than or equal to:” “T(x, t) =” “0, for x≤t0”; “1, for t0<x≤t1” … “n-1, for tn-2<x≤tn-1”; and “n, for tn-1<x”. ¶ 2, § 2.1.1, p. 2: “Any uniform quantizer Q(x) can be expressed as successive thresholding followed by a linear transformation such that Q(x) = a · T (x) + b. As an example, the 2-bit uniform HWGQ quantizer can be expressed as HWGQ(x)=0.538 * T(x, t)”. 
The examiner notes that Umuroglu teaches expressing modified threshold parameters by a base coefficient, a shift coefficient, and a correlation relationship (see citation and rationale for base claim 10, supra). The examiner further notes that Umuroglu’s successive thresholding for threshold values {t0, t1 … tn} based on HWGQ teaches determining the quantization threshold parameters by using the base coefficient, the shift coefficient, and the correlation relationship with which HWGQ is represented. The examiner also notes Umuroglu’s quantizer’s transforming each input value of the input (x) into quantized value according to the aforementioned thresholding function (T) teaches when an input data set is operated in a quantization layer from top to bottom in the network model. Therefore, the examiner asserts that Umuroglu teaches the above limitation.)
 
perform quantization processing on the data set based on the determined quantization threshold parameters. (Umuroglu, § 2.1, p. 2: “Through a process we call streamlining, we show how the forward pass through any QNN layer with uniform-quantized activations and weights can be computed using only integer operations. This consists of the following three steps:” “2.1.1 Quantization as successive thresholding.” “2.1.2 Moving and collapsing linear transformations.” “2.1.3 Absorbing linear operations into thresholds.” The examiner notes that Umuroglu teaches determining quantization threshold parameters, supra.  The examiner further notes that Umuroglu’s forward passing through any QNN layer with the successive quantization thresholding to transform input values of the input (x) into quantized values thus teaches the above limitations.)
 
Xu, Umuroglu, and Li are analogous because all three references pertain to reducing the computational complexity and memory footprint, especially for devices with limited compute and memory capacity such as mobile phones, by using fixed-point processing. 
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to have modified Xu in view of Umuroglu and Li to further incorporate Umuroglu’s determining quantization threshold parameters by using the base coefficient, the shift coefficient, and the correlation relationship and quantizing an input data set based on the determined quantization threshold parameters (Umuroglu, supra). The modification not only provides the floating-point values for quantization threshold parameters to best approximate the input floating-point values to respective integers but also turns quantization of input floating-point values into simply counting the number of quantization threshold parameters that the input floating-point values are greater than or equal to (Umuroglu, § 2-(3): “Non-integer quantization levels. The chosen quantization levels in a QNN may be floating point values to best approximate the underlying value distribution. For instance, the state-of-the-art QNNs produced by HWGQ [1] use the following function for 2-bit uniform quantization:

    PNG
    media_image2.png
    97
    347
    media_image2.png
    Greyscale
”. § 2.1.1: “Quantization as successive thresholding. Given a set of threshold values t = {t0, t1. . . tn}, the successive thresholding function T(x, t) maps any real number x to an integer in the interval [0,n], where the returned integer is the number of thresholds that x is greater than or equal to”.)
 
With respect to claim 12, Xu teaches: 
An apparatus for applying a multilayer neural network model, comprising: one or more processors; and (Xu, ¶ [0019]: “The special-purpose processing device 106 may further include a memory unit 108 and a processing unit 110. For example, the special-purpose processing device 106 may be a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), a processor or a Central Processing Unit (CPU) with a customized processing unit, or a Graphics Processing Unit (GPU).”)
 
one or more memories coupled to the one or more processors, the one or more memories having stored thereon instructions which, when executed by the one or more processors, cause the apparatus to: (Xu, ¶ [0019], supra.)
 
save the multilayer neural network model in the one or more memories, (Xu, FIG. 2 and ¶ [0028]: “the CNN 200 includes an input layer 202, convolutional layers 204 and 208, pooling layers 206 and 210, and an output layer 212”. ¶ [0128]:” The special-purpose processing device comprises: a memory module configured to store parameters of a layer of a neural network in a first fixed-point format, the parameters in the first fixed-point format having a predefined bit-width”. The examiner notes that Xu’s processing device or a portion thereof that is utilized to store its CNN teaches this limitation.)
 
the multilayer neural network model is generated by extracting at least one sub-structure from the multilayer neural network model, (Xu, FIG. 3; ¶¶ [0053]-[0058] where Xu’s “convolution layer 300” includes “a binary sub-layer 308” that “convert[s]” “the weights 302” “to binary weights,” “an IBN sub-layer 316,” “a summing sub-layer 320,” “an activation sub-layer 322,” and “a quantization sub-layer 324”. FIG. 4 and ¶¶ [0061]-[0068] further describes the backward pass of a substantially similar convolutional layer 400 with backward propagation of an input 426 in the floating-point format. 
The examiner notes that this limitation identifies a sub-structure for fixed-point processing from a multi-layer neural network. The word “extract” carries the definitions of “to determine by calculation” or “to select” (see Merriam-Webster). Therefore, Xu’s identifying the aforementioned sub-structure for fixed-point processing teaches extracting at least one sub-structure from the multilayer neural network model.)
 
wherein each of the at least one sub-structure has a plurality of layers which include a quantization layer as a tail layer (Xu, FIG. 3; ¶ [0054]: the convolution layer 300 in FIG. 3 includes “a binary sub-layer 308,” “a hidden layer,” “an output layer,” “a normalization sub-layer 316,” “quantization sub-layer 318,” “a summing sub-layer 320,” “activation sub-layer 322,” binary layer 324”; and ¶ [0061]: the convolutional layers 400 includes “quantization sub-layer” 416, 424, and 428 in FIG. 4. The examiner first notes the aforementioned layers teach that the at least one sub-structure has a plurality of layers. The examiner further notes that that any smaller portion of Xu’s neural network that has multiple layers and ends with a quantization sub-layer (hence a “tail layer”) in Xu’s neural network to which fixed-point or binarization processing applies teaches that the above limitation.)
 
incorporating, in each of the at least one sub-structure, at least part of an operation to be performed in one or more layers other than the quantization layer into quantization in the quantization layer by (Xu, (FIG. 3 (annotated):

    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale

¶ [0065]: “output of the activation sub-layer 422 is provided to the summing sub-layer 420, which corresponds to the summing sub-layer 320, and the gradients of the loss function with respect to two inputs of the summing sub-layer 320 may be determined. Because an input of the sub-layer 320 is the bias, the gradient of the loss function with respect to the bias may be determined and the gradient is provided to the quantization sublayer 428”. ¶ [0051]: “the output of each layer is used as the input of the next layer starting from the head layer until transferring to the quantization layer”
The examiner thus notes that Xu’s weights and biases teach the claimed operation parameters that pertain to an operation, and that Xu’s passing an output of a layer (e.g., the product of a weight matrix and input feature vector of the layer) to the next layer and eventually to a quantization sub-layer incorporates at least part of the aforementioned convolution operation into the quantization sub-layer (e.g., 318 or 324 in FIG. 3) and hence teaches the above limitation.)
 
input, to the saved multilayer neural network model, a data set corresponding to a task requirement that is executable by the multilayer neural network model; and (Xu, ¶ [0128]: “The special-purpose processing device comprises: a memory module configured to store parameters of a layer of a neural network in a first fixed-point format, the parameters in the first fixed-point format having a predefined bit-width; an interface module configured to receive an input to the layer; a data access module configured to read the parameters of the layer from the memory module; and a computing module configured to compute, based on the input of the layer and the read parameters, an output of the layer through a fixed-point operation.” ¶ [0034]: “Examples of the optimization solutions include but not limited to stochastic gradient descent algorithm, adaptive momentum estimation (ADAM) method and the like. Therefore, the errors between the classification scores obtained by the convolutional neural network and labels of each image can be lowered as much as possible for the data in the training data set.” The examiner notes that Xu’s “interface module,” “the training data set,” and optimization to lower the errors respectively each an inputting module, a data set, and a task requirement.  The examiner further notes that Xu’s performing optimization on its CNN to lower the errors teaches the task requirement that is executable by the multilayer neural network model.)
 
operate on the data set in each of layers from top to bottom in the multilayer neural network model and output results. (Xu, ¶ [0128]: “The special-purpose processing device comprises: a memory module configured to store parameters of a layer of a neural network in a first fixed-point format, the parameters in the first fixed-point format having a predefined bit-width; an interface module configured to receive an input to the layer; a data access module configured to read the parameters of the layer from the memory module; and a computing module configured to compute, based on the input of the layer and the read parameters, an output of the layer through a fixed-point operation.” ¶ [0034]: “training data set”; ¶ [0098]: “data set CIRFA-30”. FIG. 2 and ¶ [0028]: “the CNN 200 includes an input layer 202, convolutional layers 204 and 208, pooling layers 206 and 210, and an output layer 212”. The examiner notes that Xu’s “computing module” that computes the output based on the input and the read parameters for the aforementioned data teaches an operating module.  Xu’s FIG. 2 further teaches operating on a data set in each of the input layer 202, the convolutional layers 204 and 208, and the pooling players 206 and 210, from top to bottom, to output results at the output layer 212.)
 
Xu does not appear to explicitly teach:
modifying quantization threshold parameters used for quantization in the quantization layer based on an operation parameter of the operation to be performed in the one or more layers; 
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters; 
 
Umuroglu does, however, teach:
modifying quantization threshold parameters used for quantization in the quantization layer based on an operation parameter of the operation to be performed in the one or more layers; (Umuroglu, § 2-(3) “Non-integer quantization levels” “2-bit uniform quantization”; § 2.1.1 “Quantization as successive thresholding” – “Given a set of threshold values t = {t0, t1 . . . tn}, the successive thresholding function T (x, t) maps any real number x to an integer in the interval [0,n], where the returned integer is the number of thresholds that x is greater than or equal to:” “T(x, t) =” “0, for x≤t0”; “1, for t0<x≤t1” … “n-1, for tn-2<x≤tn-1”; and “n, for tn-1<x”. ¶ 2, § 2.1.1, p. 2: “Any uniform quantizer Q(x) can be expressed as successive thresholding followed by a linear transformation such that Q(x) = a · T (x) + b.” 
The examiner notes that Umuroglu’s set of threshold value (“t = {t0, t1 . . . tn}” above) teaches quantization threshold parameters used for quantization in the quantization layer, and that Umuroglu’s transferring or merging the operation parameters (e.g., weights from the preceding summing sub-layer 312 or 320) with the aforementioned quantization threshold parameters for the quantization layer (e.g., and thus teaches modifying quantization threshold parameters based on an operation parameter (e.g., the weights 302 converted by the binary sub-layer 310 in Xu’s FIG. 3, spura) of the operation to be performed in the one or more layers (e.g., the summing sub-layer 312 or 320 in Xu’s FIG. 3, supra). Therefore, the examiner asserts that Umuroglu renders the above limitation obvious.)
 
Xu and Umuroglu are analogous because both Xu and Umuroglu pertain to reduce the computational complexity and memory footprint, especially for devices with limited compute and memory capacity such as mobile phones, by using fixed-point processing. 
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to have modified Xu’s “neural network” (¶ [0002]) with Umuroglu’s modifying quantization threshold parameters used for quantization (Umuroglu, supra). The modification not only provides the floating-point values for quantization threshold parameters to best approximate the input floating-point values to respective integers but also turns quantization of input floating-point values into simply counting the number of quantization threshold parameters that the input floating-point values are greater than or equal to (Umuroglu, § 2-(3): “Non-integer quantization levels. The chosen quantization levels in a QNN may be floating point values to best approximate the underlying value distribution. For instance, the state-of-the-art QNNs produced by HWGQ [1] use the following function for 2-bit uniform quantization:

    PNG
    media_image2.png
    97
    347
    media_image2.png
    Greyscale
”. § 2.1.1: “Quantization as successive thresholding. Given a set of threshold values t = {t0, t1. . . tn}, the successive thresholding function T(x, t) maps any real number x to an integer in the interval [0,n], where the returned integer is the number of thresholds that x is greater than or equal to”.)
 
Xu modified by Umuroglu does not appear to explicitly teach:
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters; 
 
Li does, however, teach: 
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters; (Li, p. 3, § 3, ¶ 2: “The idea of our method is to merge these highly correlated layers and substitute them as a new “slim” layer from the analysis and modeling of the correlations of the current layer and preceding layers (or parallel layers).”; p. 3, § 3.1, ¶ 4: “Method The streamline slimming regenerates a new tensor layer (i.e., slim layer) by merging non-tensor layers with its bottom tensor units in the feed-forward structure.” P. 4, § 3.1, left-hand column, ¶ 1: “Pooling Layer: The pooling layer down-samples feature maps learned from previous layers. Therefore, to absorb a pooling layer to a convolution layer, we remove the pooling layer and set the stride value of the new convolution layer as the product of the stride values for both the original pooling layer and the convolution layer.”
The examiner notes that Li’s merging a non-tensor layer with its bottom tensor units to regenerate a new tensor layer teaches removing a layer (e.g., the non-tensor layer) from the at least one sub-structure, and that one or more bottom units render the claimed quantization layer obvious. The examiner further notes that Li’s merging the non-tensor layer with its bottom tensor units into a new layer teaches an operation to be performed in the layer (e.g., the aforementioned non-tensor layer) has been completely incorporated into the quantization layer (e.g., Li’s merged layer). Therefore, the examiner asserts that Li, when combined with Umuroglu and Xu, teaches the above limitation.)
 
Xu, Umuroglu, and Li are analogous because all three references pertain to reducing the computational complexity and memory footprint, especially for devices with limited compute and memory capacity such as mobile phones, by using fixed-point processing. 
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to combine Xu in view of Umuroglu to incorporate Li’s “DeepRebirth” that removes a layer from a sub-structure where an operation in the layer has been completely incorporated into a quantization layer (Li, supra). The modification accelerates the model execution at least in non-tensor layers without require training the neural network from scratch (Li, p. 3, § 3, ¶ 1: “To accelerate the model execution in non-tensor layers, we propose DeepRebirth to accelerate the model execution at both streaming substructure and branching substructure. The idea of our method is to merge these highly correlated layers and substitute them as a new “slim” layer from the analysis and modeling of the correlations of the current layer and preceding layers (or parallel layers).” p. 8, § 5, ¶ 2: “In addition, our work slims a well-trained network by relearning the merged rebirth layers and does not require to train from scratch.”)

With respect to claim 13, Xu teaches: 
A method for transforming a multilayer neural network model, comprising: extracting at least one sub-structure from the multilayer neural network model, (Xu, FIG. 3; ¶¶ [0053]-[0058] where Xu’s “convolution layer 300” includes “a binary sub-layer 308” that “convert[s]” “the weights 302” “to binary weights,” “an IBN sub-layer 316,” “a summing sub-layer 320,” “an activation sub-layer 322,” and “a quantization sub-layer 324”. FIG. 4 and ¶¶ [0061]-[0068] further describes the backward pass of a substantially similar convolutional layer 400 with backward propagation of an input 426 in the floating-point format. 
The examiner notes that this limitation identifies a sub-structure for fixed-point processing from a multi-layer neural network. The word “extract” carries the definitions of “to determine by calculation” or “to select” (see Merriam-Webster). Therefore, Xu’s identifying the aforementioned sub-structure for fixed-point processing teaches extracting at least one sub-structure from the multilayer neural network model.)
 
wherein each of the at least one sub-structure has a plurality of layers which include a quantization layer as a tail layer (Xu, FIG. 3; ¶ [0054]: the convolution layer 300 in FIG. 3 includes “a binary sub-layer 308,” “a hidden layer,” “an output layer,” “a normalization sub-layer 316,” “quantization sub-layer 318,” “a summing sub-layer 320,” “activation sub-layer 322,” binary layer 324”; and ¶ [0061]: the convolutional layers 400 includes “quantization sub-layer” 416, 424, and 428 in FIG. 4. The examiner first notes the aforementioned layers teach that the at least one sub-structure has a plurality of layers. The examiner further notes that that any smaller portion of Xu’s neural network that has multiple layers and ends with a quantization sub-layer (hence a “tail layer”) in Xu’s neural network to which fixed-point or binarization processing applies teaches that the above limitation.)
 
incorporating, in each of the at least one sub-structure, at least part of an operation to be performed in one or more layers other than the quantization layer to into quantization in the quantization layer by (Xu, (FIG. 3 (annotated):

    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale

¶ [0065]: “output of the activation sub-layer 422 is provided to the summing sub-layer 420, which corresponds to the summing sub-layer 320, and the gradients of the loss function with respect to two inputs of the summing sub-layer 320 may be determined. Because an input of the sub-layer 320 is the bias, the gradient of the loss function with respect to the bias may be determined and the gradient is provided to the quantization sublayer 428”. ¶ [0051]: “the output of each layer is used as the input of the next layer starting from the head layer until transferring to the quantization layer”
The examiner thus notes that Xu’s weights and biases teach the claimed operation parameters that pertain to an operation, and that Xu’s passing an output of a layer (e.g., the product of a weight matrix and input feature vector of the layer) to the next layer and eventually to a quantization sub-layer incorporates at least part of the aforementioned convolution operation into the quantization sub-layer (e.g., 318 or 324 in FIG. 3) and hence teaches the above limitation.)
 
Xu does not appear to explicitly teach:
modifying quantization threshold parameters used for quantization in the quantization layer based on an operation parameter of the operation to be performed in the one or more layers; and  
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters. 
 
Umuroglu does, however, teach:
modifying quantization threshold parameters used for quantization in the quantization layer based on an operation parameter of the operation to be performed in the one or more layers; and (Umuroglu, § 2-(3) “Non-integer quantization levels” “2-bit uniform quantization”; § 2.1.1 “Quantization as successive thresholding” – “Given a set of threshold values t = {t0, t1 . . . tn}, the successive thresholding function T (x, t) maps any real number x to an integer in the interval [0,n], where the returned integer is the number of thresholds that x is greater than or equal to:” “T(x, t) =” “0, for x≤t0”; “1, for t0<x≤t1” … “n-1, for tn-2<x≤tn-1”; and “n, for tn-1<x”. ¶ 2, § 2.1.1, p. 2: “Any uniform quantizer Q(x) can be expressed as successive thresholding followed by a linear transformation such that Q(x) = a · T (x) + b.” 
The examiner notes that Umuroglu’s set of threshold value (“t = {t0, t1 . . . tn}” above) teaches quantization threshold parameters used for quantization in the quantization layer, and that Umuroglu’s transferring or merging the operation parameters (e.g., weights from the preceding summing sub-layer 312 or 320) with the aforementioned quantization threshold parameters for the quantization layer (e.g., and thus teaches modifying quantization threshold parameters based on an operation parameter (e.g., the weights 302 converted by the binary sub-layer 310 in Xu’s FIG. 3, spura) of the operation to be performed in the one or more layers (e.g., the summing sub-layer 312 or 320 in Xu’s FIG. 3, supra). Therefore, the examiner asserts that Umuroglu renders the above limitation obvious.)
 
Xu and Umuroglu are analogous because both Xu and Umuroglu pertain to reduce the computational complexity and memory footprint, especially for devices with limited compute and memory capacity such as mobile phones, by using fixed-point processing. 
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to have modified Xu’s “neural network” (¶ [0002]) with Umuroglu’s modifying quantization threshold parameters used for quantization (Umuroglu, supra). The modification not only provides the floating-point values for quantization threshold parameters to best approximate the input floating-point values to respective integers but also turns quantization of input floating-point values into simply counting the number of quantization threshold parameters that the input floating-point values are greater than or equal to (Umuroglu, § 2-(3): “Non-integer quantization levels. The chosen quantization levels in a QNN may be floating point values to best approximate the underlying value distribution. For instance, the state-of-the-art QNNs produced by HWGQ [1] use the following function for 2-bit uniform quantization:

    PNG
    media_image2.png
    97
    347
    media_image2.png
    Greyscale
”. § 2.1.1: “Quantization as successive thresholding. Given a set of threshold values t = {t0, t1. . . tn}, the successive thresholding function T(x, t) maps any real number x to an integer in the interval [0,n], where the returned integer is the number of thresholds that x is greater than or equal to”.)
 
Xu modified by Umuroglu does not appear to explicitly teach:
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters. 
 
Li does, however, teach: 
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters. (Li, p. 3, § 3, ¶ 2: “The idea of our method is to merge these highly correlated layers and substitute them as a new “slim” layer from the analysis and modeling of the correlations of the current layer and preceding layers (or parallel layers).”; p. 3, § 3.1, ¶ 4: “Method The streamline slimming regenerates a new tensor layer (i.e., slim layer) by merging non-tensor layers with its bottom tensor units in the feed-forward structure.” P. 4, § 3.1, left-hand column, ¶ 1: “Pooling Layer: The pooling layer down-samples feature maps learned from previous layers. Therefore, to absorb a pooling layer to a convolution layer, we remove the pooling layer and set the stride value of the new convolution layer as the product of the stride values for both the original pooling layer and the convolution layer.”
The examiner notes that Li’s merging a non-tensor layer with its bottom tensor units to regenerate a new tensor layer teaches removing a layer (e.g., the non-tensor layer) from the at least one sub-structure, and that one or more bottom units render the claimed quantization layer obvious. The examiner further notes that Li’s merging the non-tensor layer with its bottom tensor units into a new layer teaches an operation to be performed in the layer (e.g., the aforementioned non-tensor layer) has been completely incorporated into the quantization layer (e.g., Li’s merged layer). Therefore, the examiner asserts that Li, when combined with Umuroglu and Xu, teaches the above limitation.)
 
Xu, Umuroglu, and Li are analogous because all three references pertain to reducing the computational complexity and memory footprint, especially for devices with limited compute and memory capacity such as mobile phones, by using fixed-point processing. 
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to combine Xu in view of Umuroglu to incorporate Li’s “DeepRebirth” that removes a layer from a sub-structure where an operation in the layer has been completely incorporated into a quantization layer (Li, supra). The modification accelerates the model execution at least in non-tensor layers without require training the neural network from scratch (Li, p. 3, § 3, ¶ 1: “To accelerate the model execution in non-tensor layers, we propose DeepRebirth to accelerate the model execution at both streaming substructure and branching substructure. The idea of our method is to merge these highly correlated layers and substitute them as a new “slim” layer from the analysis and modeling of the correlations of the current layer and preceding layers (or parallel layers).” p. 8, § 5, ¶ 2: “In addition, our work slims a well-trained network by relearning the merged rebirth layers and does not require to train from scratch.”)
 
With respect to claim 14, Xu teaches: 
A method for applying a multilayer neural network model, comprising: saving the multilayer neural network model in a memory, (Xu, FIG. 2 and ¶ [0028]: “the CNN 200 includes an input layer 202, convolutional layers 204 and 208, pooling layers 206 and 210, and an output layer 212”. ¶ [0128]:” The special-purpose processing device comprises: a memory module configured to store parameters of a layer of a neural network in a first fixed-point format, the parameters in the first fixed-point format having a predefined bit-width”. The examiner notes that Xu’s processing device or a portion thereof that is utilized to store its CNN teaches this limitation.)

the multilayer neural network model is generated by; extracting at least one sub-structure from the multilayer neural network model, (Xu, FIG. 3; ¶¶ [0053]-[0058] where Xu’s “convolution layer 300” includes “a binary sub-layer 308” that “convert[s]” “the weights 302” “to binary weights,” “an IBN sub-layer 316,” “a summing sub-layer 320,” “an activation sub-layer 322,” and “a quantization sub-layer 324”. FIG. 4 and ¶¶ [0061]-[0068] further describes the backward pass of a substantially similar convolutional layer 400 with backward propagation of an input 426 in the floating-point format. 
The examiner notes that this limitation identifies a sub-structure for fixed-point processing from a multi-layer neural network. The word “extract” carries the definitions of “to determine by calculation” or “to select” (see Merriam-Webster). Therefore, Xu’s identifying the aforementioned sub-structure for fixed-point processing teaches extracting at least one sub-structure from the multilayer neural network model.)
 
wherein each of the at least one sub-structure has a plurality of layers which include a quantization layer as a tail layer; (Xu, FIG. 3; ¶ [0054]: the convolution layer 300 in FIG. 3 includes “a binary sub-layer 308,” “a hidden layer,” “an output layer,” “a normalization sub-layer 316,” “quantization sub-layer 318,” “a summing sub-layer 320,” “activation sub-layer 322,” binary layer 324”; and ¶ [0061]: the convolutional layers 400 includes “quantization sub-layer” 416, 424, and 428 in FIG. 4. The examiner first notes the aforementioned layers teach that the at least one sub-structure has a plurality of layers. The examiner further notes that that any smaller portion of Xu’s neural network that has multiple layers and ends with a quantization sub-layer (hence a “tail layer”) in Xu’s neural network to which fixed-point or binarization processing applies teaches that the above limitation.)
 
incorporating, in each of the at least one sub-structure, at least part of an operation to be performed in one or more layers other than the quantization layer into quantization in the quantization layer by (Xu, (FIG. 3 (annotated):

    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale

¶ [0065]: “output of the activation sub-layer 422 is provided to the summing sub-layer 420, which corresponds to the summing sub-layer 320, and the gradients of the loss function with respect to two inputs of the summing sub-layer 320 may be determined. Because an input of the sub-layer 320 is the bias, the gradient of the loss function with respect to the bias may be determined and the gradient is provided to the quantization sublayer 428”. ¶ [0051]: “the output of each layer is used as the input of the next layer starting from the head layer until transferring to the quantization layer”
The examiner thus notes that Xu’s weights and biases teach the claimed operation parameters that pertain to an operation, and that Xu’s passing an output of a layer (e.g., the product of a weight matrix and input feature vector of the layer) to the next layer and eventually to a quantization sub-layer incorporates at least part of the aforementioned convolution operation into the quantization sub-layer (e.g., 318 or 324 in FIG. 3) and hence teaches the above limitation.)
 
inputting, to the saved multilayer neural network model, a data set corresponding to a task requirement that is executable by the multilayer neural network model; and (Xu, ¶ [0128]: “The special-purpose processing device comprises: a memory module configured to store parameters of a layer of a neural network in a first fixed-point format, the parameters in the first fixed-point format having a predefined bit-width; an interface module configured to receive an input to the layer; a data access module configured to read the parameters of the layer from the memory module; and a computing module configured to compute, based on the input of the layer and the read parameters, an output of the layer through a fixed-point operation.” ¶ [0034]: “Examples of the optimization solutions include but not limited to stochastic gradient descent algorithm, adaptive momentum estimation (ADAM) method and the like. Therefore, the errors between the classification scores obtained by the convolutional neural network and labels of each image can be lowered as much as possible for the data in the training data set.” The examiner notes that Xu’s “interface module,” “the training data set,” and optimization to lower the errors respectively each an inputting module, a data set, and a task requirement.  The examiner further notes that Xu’s performing optimization on its CNN to lower the errors teaches the task requirement that is executable by the multilayer neural network model.)
 
operating on the data set in each of layers from top to bottom in the multilayer neural network model and outputting results. (Xu, ¶ [0128]: “The special-purpose processing device comprises: a memory module configured to store parameters of a layer of a neural network in a first fixed-point format, the parameters in the first fixed-point format having a predefined bit-width; an interface module configured to receive an input to the layer; a data access module configured to read the parameters of the layer from the memory module; and a computing module configured to compute, based on the input of the layer and the read parameters, an output of the layer through a fixed-point operation.” ¶ [0034]: “training data set”; ¶ [0098]: “data set CIRFA-30”. FIG. 2 and ¶ [0028]: “the CNN 200 includes an input layer 202, convolutional layers 204 and 208, pooling layers 206 and 210, and an output layer 212”. The examiner notes that Xu’s “computing module” that computes the output based on the input and the read parameters for the aforementioned data teaches an operating module.  Xu’s FIG. 2 further teaches operating on a data set in each of the input layer 202, the convolutional layers 204 and 208, and the pooling players 206 and 210, from top to bottom, to output results at the output layer 212.)
 
Xu does not appear to explicitly teach:
modifying quantization threshold parameters used for quantization in the quantization layer based on an operation parameter of the operation to be performed in the one or more layers; and 
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters; 
 
Umuroglu does, however, teach:
modifying quantization threshold parameters used for quantization in the quantization layer based on an operation parameter of the operation to be performed in the one or more layers; and (Umuroglu, § 2-(3) “Non-integer quantization levels” “2-bit uniform quantization”; § 2.1.1 “Quantization as successive thresholding” – “Given a set of threshold values t = {t0, t1 . . . tn}, the successive thresholding function T (x, t) maps any real number x to an integer in the interval [0,n], where the returned integer is the number of thresholds that x is greater than or equal to:” “T(x, t) =” “0, for x≤t0”; “1, for t0<x≤t1” … “n-1, for tn-2<x≤tn-1”; and “n, for tn-1<x”. ¶ 2, § 2.1.1, p. 2: “Any uniform quantizer Q(x) can be expressed as successive thresholding followed by a linear transformation such that Q(x) = a · T (x) + b.” 
The examiner notes that Umuroglu’s set of threshold value (“t = {t0, t1 . . . tn}” above) teaches quantization threshold parameters used for quantization in the quantization layer, and that Umuroglu’s transferring or merging the operation parameters (e.g., weights from the preceding summing sub-layer 312 or 320) with the aforementioned quantization threshold parameters for the quantization layer (e.g., and thus teaches modifying quantization threshold parameters based on an operation parameter (e.g., the weights 302 converted by the binary sub-layer 310 in Xu’s FIG. 3, spura) of the operation to be performed in the one or more layers (e.g., the summing sub-layer 312 or 320 in Xu’s FIG. 3, supra). Therefore, the examiner asserts that Umuroglu renders the above limitation obvious.)
 
Xu and Umuroglu are analogous because both Xu and Umuroglu pertain to reduce the computational complexity and memory footprint, especially for devices with limited compute and memory capacity such as mobile phones, by using fixed-point processing. 
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to have modified Xu’s “neural network” (¶ [0002]) with Umuroglu’s modifying quantization threshold parameters used for quantization (Umuroglu, supra). The modification not only provides the floating-point values for quantization threshold parameters to best approximate the input floating-point values to respective integers but also turns quantization of input floating-point values into simply counting the number of quantization threshold parameters that the input floating-point values are greater than or equal to (Umuroglu, § 2-(3): “Non-integer quantization levels. The chosen quantization levels in a QNN may be floating point values to best approximate the underlying value distribution. For instance, the state-of-the-art QNNs produced by HWGQ [1] use the following function for 2-bit uniform quantization:

    PNG
    media_image2.png
    97
    347
    media_image2.png
    Greyscale
”. § 2.1.1: “Quantization as successive thresholding. Given a set of threshold values t = {t0, t1. . . tn}, the successive thresholding function T(x, t) maps any real number x to an integer in the interval [0,n], where the returned integer is the number of thresholds that x is greater than or equal to”.)
 
Xu modified by Umuroglu does not appear to explicitly teach:
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters; 
 
Li does, however, teach: 
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters; (Li, p. 3, § 3, ¶ 2: “The idea of our method is to merge these highly correlated layers and substitute them as a new “slim” layer from the analysis and modeling of the correlations of the current layer and preceding layers (or parallel layers).”; p. 3, § 3.1, ¶ 4: “Method The streamline slimming regenerates a new tensor layer (i.e., slim layer) by merging non-tensor layers with its bottom tensor units in the feed-forward structure.” P. 4, § 3.1, left-hand column, ¶ 1: “Pooling Layer: The pooling layer down-samples feature maps learned from previous layers. Therefore, to absorb a pooling layer to a convolution layer, we remove the pooling layer and set the stride value of the new convolution layer as the product of the stride values for both the original pooling layer and the convolution layer.”
The examiner notes that Li’s merging a non-tensor layer with its bottom tensor units to regenerate a new tensor layer teaches removing a layer (e.g., the non-tensor layer) from the at least one sub-structure, and that one or more bottom units render the claimed quantization layer obvious. The examiner further notes that Li’s merging the non-tensor layer with its bottom tensor units into a new layer teaches an operation to be performed in the layer (e.g., the aforementioned non-tensor layer) has been completely incorporated into the quantization layer (e.g., Li’s merged layer). Therefore, the examiner asserts that Li, when combined with Umuroglu and Xu, teaches the above limitation.)
 
Xu, Umuroglu, and Li are analogous because all three references pertain to reducing the computational complexity and memory footprint, especially for devices with limited compute and memory capacity such as mobile phones, by using fixed-point processing. 
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to combine Xu in view of Umuroglu to incorporate Li’s “DeepRebirth” that removes a layer from a sub-structure where an operation in the layer has been completely incorporated into a quantization layer (Li, supra). The modification accelerates the model execution at least in non-tensor layers without require training the neural network from scratch (Li, p. 3, § 3, ¶ 1: “To accelerate the model execution in non-tensor layers, we propose DeepRebirth to accelerate the model execution at both streaming substructure and branching substructure. The idea of our method is to merge these highly correlated layers and substitute them as a new “slim” layer from the analysis and modeling of the correlations of the current layer and preceding layers (or parallel layers).” p. 8, § 5, ¶ 2: “In addition, our work slims a well-trained network by relearning the merged rebirth layers and does not require to train from scratch.”)
  
With respect to claim 15, Xu teaches: 
A non-transitory computer readable storage medium storing instructions for causing a computer to perform a method for transforming a multilayer neural network model when executed by the computer, the method comprising: (¶ [0020]: “The memory 102 may be implemented by various storage media, including but not limited to volatile and non-volatile medium, and removable and non-removable medium.” FIG. 3; ¶¶ [0053]-[0058] where Xu’s “convolution layer 300” includes “a binary sub-layer 308” that “convert[s]” “the weights 302” “to binary weights,” “an IBN sub-layer 316,” “a summing sub-layer 320,” “an activation sub-layer 322,” and “a quantization sub-layer 324”. FIG. 4 and ¶¶ [0061]-[0068] further describes the backward pass of a substantially similar convolutional layer 400 with backward propagation of an input 426 in the floating-point format. The examiner notes that this limitation identifies a sub-structure for fixed-point processing from a multi-layer neural network, and that any portion of Xu’s neural network to which fixed-point processing applies thus teaches this limitation. )
 
extracting at least one sub-structure from the multilayer neural network model, (Xu, FIG. 3; ¶¶ [0053]-[0058] where Xu’s “convolution layer 300” includes “a binary sub-layer 308” that “convert[s]” “the weights 302” “to binary weights,” “an IBN sub-layer 316,” “a summing sub-layer 320,” “an activation sub-layer 322,” and “a quantization sub-layer 324”. FIG. 4 and ¶¶ [0061]-[0068] further describes the backward pass of a substantially similar convolutional layer 400 with backward propagation of an input 426 in the floating-point format. 
The examiner notes that this limitation identifies a sub-structure for fixed-point processing from a multi-layer neural network. The word “extract” carries the definitions of “to determine by calculation” or “to select” (see Merriam-Webster). Therefore, Xu’s identifying the aforementioned sub-structure for fixed-point processing teaches extracting at least one sub-structure from the multilayer neural network model.)
 
wherein each of the at least one sub-structure has a plurality of layers which include a quantization layer as a tail layer (Xu, FIG. 3; ¶ [0054]: the convolution layer 300 in FIG. 3 includes “a binary sub-layer 308,” “a hidden layer,” “an output layer,” “a normalization sub-layer 316,” “quantization sub-layer 318,” “a summing sub-layer 320,” “activation sub-layer 322,” binary layer 324”; and ¶ [0061]: the convolutional layers 400 includes “quantization sub-layer” 416, 424, and 428 in FIG. 4. The examiner first notes the aforementioned layers teach that the at least one sub-structure has a plurality of layers. The examiner further notes that that any smaller portion of Xu’s neural network that has multiple layers and ends with a quantization sub-layer (hence a “tail layer”) in Xu’s neural network to which fixed-point or binarization processing applies teaches that the above limitation.)
 
incorporating, in each of the at least one sub-structure, at least part of an operation to be performed in one or more layers other than the quantization layer to into quantization in the quantization layer by (Xu, (FIG. 3 (annotated):

    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale

¶ [0065]: “output of the activation sub-layer 422 is provided to the summing sub-layer 420, which corresponds to the summing sub-layer 320, and the gradients of the loss function with respect to two inputs of the summing sub-layer 320 may be determined. Because an input of the sub-layer 320 is the bias, the gradient of the loss function with respect to the bias may be determined and the gradient is provided to the quantization sublayer 428”. ¶ [0051]: “the output of each layer is used as the input of the next layer starting from the head layer until transferring to the quantization layer”
The examiner thus notes that Xu’s weights and biases teach the claimed operation parameters that pertain to an operation, and that Xu’s passing an output of a layer (e.g., the product of a weight matrix and input feature vector of the layer) to the next layer and eventually to a quantization sub-layer incorporates at least part of the aforementioned convolution operation into the quantization sub-layer (e.g., 318 or 324 in FIG. 3) and hence teaches the above limitation.)
 
Xu does not appear to explicitly teach:
modifying quantization threshold parameters used for quantization in the quantization layer based on an operation parameter of the operation to be performed in the one or more layers; and 
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters. 
 
Umuroglu does, however, teach:
modifying quantization threshold parameters used for quantization in the quantization layer based on an operation parameter of the operation to be performed in the one or more layers; and (Umuroglu, § 2-(3) “Non-integer quantization levels” “2-bit uniform quantization”; § 2.1.1 “Quantization as successive thresholding” – “Given a set of threshold values t = {t0, t1 . . . tn}, the successive thresholding function T (x, t) maps any real number x to an integer in the interval [0,n], where the returned integer is the number of thresholds that x is greater than or equal to:” “T(x, t) =” “0, for x≤t0”; “1, for t0<x≤t1” … “n-1, for tn-2<x≤tn-1”; and “n, for tn-1<x”. ¶ 2, § 2.1.1, p. 2: “Any uniform quantizer Q(x) can be expressed as successive thresholding followed by a linear transformation such that Q(x) = a · T (x) + b.” 
The examiner notes that Umuroglu’s set of threshold value (“t = {t0, t1 . . . tn}” above) teaches quantization threshold parameters used for quantization in the quantization layer, and that Umuroglu’s transferring or merging the operation parameters (e.g., weights from the preceding summing sub-layer 312 or 320) with the aforementioned quantization threshold parameters for the quantization layer (e.g., and thus teaches modifying quantization threshold parameters based on an operation parameter (e.g., the weights 302 converted by the binary sub-layer 310 in Xu’s FIG. 3, spura) of the operation to be performed in the one or more layers (e.g., the summing sub-layer 312 or 320 in Xu’s FIG. 3, supra). Therefore, the examiner asserts that Umuroglu renders the above limitation obvious.)
Xu and Umuroglu are analogous because both Xu and Umuroglu pertain to reduce the computational complexity and memory footprint, especially for devices with limited compute and memory capacity such as mobile phones, by using fixed-point processing. 
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to have modified Xu’s “neural network” (¶ [0002]) with Umuroglu’s modifying quantization threshold parameters used for quantization (Umuroglu, supra). The modification not only provides the floating-point values for quantization threshold parameters to best approximate the input floating-point values to respective integers but also turns quantization of input floating-point values into simply counting the number of quantization threshold parameters that the input floating-point values are greater than or equal to (Umuroglu, § 2-(3): “Non-integer quantization levels. The chosen quantization levels in a QNN may be floating point values to best approximate the underlying value distribution. For instance, the state-of-the-art QNNs produced by HWGQ [1] use the following function for 2-bit uniform quantization:

    PNG
    media_image2.png
    97
    347
    media_image2.png
    Greyscale
”. § 2.1.1: “Quantization as successive thresholding. Given a set of threshold values t = {t0, t1. . . tn}, the successive thresholding function T(x, t) maps any real number x to an integer in the interval [0,n], where the returned integer is the number of thresholds that x is greater than or equal to”.)
 
Xu modified by Umuroglu does not appear to explicitly teach:
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters. 
 
Li does, however, teach: 
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters. (Li, p. 3, § 3, ¶ 2: “The idea of our method is to merge these highly correlated layers and substitute them as a new “slim” layer from the analysis and modeling of the correlations of the current layer and preceding layers (or parallel layers).”; p. 3, § 3.1, ¶ 4: “Method The streamline slimming regenerates a new tensor layer (i.e., slim layer) by merging non-tensor layers with its bottom tensor units in the feed-forward structure.” P. 4, § 3.1, left-hand column, ¶ 1: “Pooling Layer: The pooling layer down-samples feature maps learned from previous layers. Therefore, to absorb a pooling layer to a convolution layer, we remove the pooling layer and set the stride value of the new convolution layer as the product of the stride values for both the original pooling layer and the convolution layer.”
The examiner notes that Li’s merging a non-tensor layer with its bottom tensor units to regenerate a new tensor layer teaches removing a layer (e.g., the non-tensor layer) from the at least one sub-structure, and that one or more bottom units render the claimed quantization layer obvious. The examiner further notes that Li’s merging the non-tensor layer with its bottom tensor units into a new layer teaches an operation to be performed in the layer (e.g., the aforementioned non-tensor layer) has been completely incorporated into the quantization layer (e.g., Li’s merged layer). Therefore, the examiner asserts that Li, when combined with Umuroglu and Xu, teaches the above limitation.)
 
Xu, Umuroglu, and Li are analogous because all three references pertain to reducing the computational complexity and memory footprint, especially for devices with limited compute and memory capacity such as mobile phones, by using fixed-point processing. 
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to combine Xu in view of Umuroglu to incorporate Li’s “DeepRebirth” that removes a layer from a sub-structure where an operation in the layer has been completely incorporated into a quantization layer (Li, supra). The modification accelerates the model execution at least in non-tensor layers without require training the neural network from scratch (Li, p. 3, § 3, ¶ 1: “To accelerate the model execution in non-tensor layers, we propose DeepRebirth to accelerate the model execution at both streaming substructure and branching substructure. The idea of our method is to merge these highly correlated layers and substitute them as a new “slim” layer from the analysis and modeling of the correlations of the current layer and preceding layers (or parallel layers).” p. 8, § 5, ¶ 2: “In addition, our work slims a well-trained network by relearning the merged rebirth layers and does not require to train from scratch.”)
 
With respect to claim 16, Xu teaches: 
A non-transitory computer readable storage medium storing instructions for causing a computer to perform a method for applying a multilayer neural network model when executed by the computer, the method comprising: (Xu, ¶ [0020]: “The memory 102 may be implemented by various storage media, including but not limited to volatile and non-volatile medium, and removable and non-removable medium.” FIG. 3; ¶¶ [0053]-[0058] where Xu’s “convolution layer 300” includes “a binary sub-layer 308” that “convert[s]” “the weights 302” “to binary weights,” “an IBN sub-layer 316,” “a summing sub-layer 320,” “an activation sub-layer 322,” and “a quantization sub-layer 324”. FIG. 4 and ¶¶ [0061]-[0068] further describes the backward pass of a substantially similar convolutional layer 400 with backward propagation of an input 426 in the floating-point format. The examiner notes that this limitation identifies a sub-structure for fixed-point processing from a multi-layer neural network, and that any portion of Xu’s neural network to which fixed-point processing applies thus teaches this limitation.)
 
saving the multilayer neural network model in a memory, (Xu, FIG. 2 and ¶ [0028]: “the CNN 200 includes an input layer 202, convolutional layers 204 and 208, pooling layers 206 and 210, and an output layer 212” ¶ [0128]:” The special-purpose processing device comprises: a memory module configured to store parameters of a layer of a neural network in a first fixed-point format, the parameters in the first fixed-point format having a predefined bit-width”. The examiner notes that Xu’s processing device or a portion thereof that is utilized to store its CNN teaches this limitation.)
 
the multilayer neural network model is generated by; extracting at least one sub-structure from the multilayer neural network model, (Xu, FIG. 3; ¶¶ [0053]-[0058] where Xu’s “convolution layer 300” includes “a binary sub-layer 308” that “convert[s]” “the weights 302” “to binary weights,” “an IBN sub-layer 316,” “a summing sub-layer 320,” “an activation sub-layer 322,” and “a quantization sub-layer 324”. FIG. 4 and ¶¶ [0061]-[0068] further describes the backward pass of a substantially similar convolutional layer 400 with backward propagation of an input 426 in the floating-point format. 
The examiner notes that this limitation identifies a sub-structure for fixed-point processing from a multi-layer neural network. The word “extract” carries the definitions of “to determine by calculation” or “to select” (see Merriam-Webster). Therefore, Xu’s identifying the aforementioned sub-structure for fixed-point processing teaches extracting at least one sub-structure from the multilayer neural network model.)
 
wherein each of the at least one sub-structure has a plurality of layers which include a quantization layer as a tail layer; (Xu, FIG. 3; ¶ [0054]: the convolution layer 300 in FIG. 3 includes “a binary sub-layer 308,” “a hidden layer,” “an output layer,” “a normalization sub-layer 316,” “quantization sub-layer 318,” “a summing sub-layer 320,” “activation sub-layer 322,” binary layer 324”; and ¶ [0061]: the convolutional layers 400 includes “quantization sub-layer” 416, 424, and 428 in FIG. 4. The examiner first notes the aforementioned layers teach that the at least one sub-structure has a plurality of layers. The examiner further notes that that any smaller portion of Xu’s neural network that has multiple layers and ends with a quantization sub-layer (hence a “tail layer”) in Xu’s neural network to which fixed-point or binarization processing applies teaches that the above limitation.)
 
incorporating, in each of the at least one sub-structure, at least part of an operation to be performed in one or more layers other than the quantization layer into quantization in the quantization layer by (Xu, (FIG. 3 (annotated):

    PNG
    media_image1.png
    200
    400
    media_image1.png
    Greyscale

¶ [0065]: “output of the activation sub-layer 422 is provided to the summing sub-layer 420, which corresponds to the summing sub-layer 320, and the gradients of the loss function with respect to two inputs of the summing sub-layer 320 may be determined. Because an input of the sub-layer 320 is the bias, the gradient of the loss function with respect to the bias may be determined and the gradient is provided to the quantization sublayer 428”. ¶ [0051]: “the output of each layer is used as the input of the next layer starting from the head layer until transferring to the quantization layer”
The examiner thus notes that Xu’s weights and biases teach the claimed operation parameters that pertain to an operation, and that Xu’s passing an output of a layer (e.g., the product of a weight matrix and input feature vector of the layer) to the next layer and eventually to a quantization sub-layer incorporates at least part of the aforementioned convolution operation into the quantization sub-layer (e.g., 318 or 324 in FIG. 3) and hence teaches the above limitation.)
 
inputting, to the saved multilayer neural network model, a data set corresponding to a task requirement that is executable by the multilayer neural network model; and (Xu, ¶ [0128]: “The special-purpose processing device comprises: a memory module configured to store parameters of a layer of a neural network in a first fixed-point format, the parameters in the first fixed-point format having a predefined bit-width; an interface module configured to receive an input to the layer; a data access module configured to read the parameters of the layer from the memory module; and a computing module configured to compute, based on the input of the layer and the read parameters, an output of the layer through a fixed-point operation.”¶ [0034]: “Examples of the optimization solutions include but not limited to stochastic gradient descent algorithm, adaptive momentum estimation (ADAM) method and the like. Therefore, the errors between the classification scores obtained by the convolutional neural network and labels of each image can be lowered as much as possible for the data in the training data set.” The examiner notes that Xu’s “interface module,” “the training data set,” and optimization to lower the errors respectively each an inputting module, a data set, and a task requirement.  The examiner further notes that Xu’s performing optimization on its CNN to lower the errors teaches the task requirement that is executable by the multilayer neural network model.)
 
operating on the data set in each of layers from top to bottom in the multilayer neural network model and outputting results. (Xu, ¶ [0128]: “The special-purpose processing device comprises: a memory module configured to store parameters of a layer of a neural network in a first fixed-point format, the parameters in the first fixed-point format having a predefined bit-width; an interface module configured to receive an input to the layer; a data access module configured to read the parameters of the layer from the memory module; and a computing module configured to compute, based on the input of the layer and the read parameters, an output of the layer through a fixed-point operation.” ¶ [0034]: “training data set”; ¶ [0098]: “data set CIRFA-30”; FIG. 2 and ¶ [0028]: “the CNN 200 includes an input layer 202, convolutional layers 204 and 208, pooling layers 206 and 210, and an output layer 212”. The examiner notes that Xu’s “computing module” that computes the output based on the input and the read parameters for the aforementioned data teaches an operating module.  Xu’s FIG. 2 further teaches operating on a data set in each of the input layer 202, the convolutional layers 204 and 208, and the pooling players 206 and 210, from top to bottom, to output results at the output layer 212.)
 
Xu does not appear to explicitly teach:
modifying quantization threshold parameters used for quantization in the quantization layer based on an operation parameter of the operation to be performed in the one or more layers; and 
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters; 
 
Umuroglu does, however, teach:
modifying quantization threshold parameters used for quantization in the quantization layer based on an operation parameter of the operation to be performed in the one or more layers; and (Umuroglu, § 2-(3) “Non-integer quantization levels” “2-bit uniform quantization”; § 2.1.1 “Quantization as successive thresholding” – “Given a set of threshold values t = {t0, t1 . . . tn}, the successive thresholding function T (x, t) maps any real number x to an integer in the interval [0,n], where the returned integer is the number of thresholds that x is greater than or equal to:” “T(x, t) =” “0, for x≤t0”; “1, for t0<x≤t1” … “n-1, for tn-2<x≤tn-1”; and “n, for tn-1<x”. ¶ 2, § 2.1.1, p. 2: “Any uniform quantizer Q(x) can be expressed as successive thresholding followed by a linear transformation such that Q(x) = a · T (x) + b.” 
The examiner notes that Umuroglu’s set of threshold value (“t = {t0, t1 . . . tn}” above) teaches quantization threshold parameters used for quantization in the quantization layer, and that Umuroglu’s transferring or merging the operation parameters (e.g., weights from the preceding summing sub-layer 312 or 320) with the aforementioned quantization threshold parameters for the quantization layer (e.g., and thus teaches modifying quantization threshold parameters based on an operation parameter (e.g., the weights 302 converted by the binary sub-layer 310 in Xu’s FIG. 3, spura) of the operation to be performed in the one or more layers (e.g., the summing sub-layer 312 or 320 in Xu’s FIG. 3, supra). Therefore, the examiner asserts that Umuroglu renders the above limitation obvious.)
Xu and Umuroglu are analogous because both Xu and Umuroglu pertain to reduce the computational complexity and memory footprint, especially for devices with limited compute and memory capacity such as mobile phones, by using fixed-point processing. 
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to have modified Xu’s “neural network” (¶ [0002]) with Umuroglu’s modifying quantization threshold parameters used for quantization (Umuroglu, supra). The modification not only provides the floating-point values for quantization threshold parameters to best approximate the input floating-point values to respective integers but also turns quantization of input floating-point values into simply counting the number of quantization threshold parameters that the input floating-point values are greater than or equal to (Umuroglu, § 2-(3): “Non-integer quantization levels. The chosen quantization levels in a QNN may be floating point values to best approximate the underlying value distribution. For instance, the state-of-the-art QNNs produced by HWGQ [1] use the following function for 2-bit uniform quantization:

    PNG
    media_image2.png
    97
    347
    media_image2.png
    Greyscale
”. § 2.1.1: “Quantization as successive thresholding. Given a set of threshold values t = {t0, t1. . . tn}, the successive thresholding function T(x, t) maps any real number x to an integer in the interval [0,n], where the returned integer is the number of thresholds that x is greater than or equal to”.)
 
Xu modified by Umuroglu does not appear to explicitly teach:
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters; 
 
Li does, however, teach: 
removing a layer from the at least one sub-structure, if an operation to be performed in the layer has been completely incorporated into the quantization in the quantization layer by modifying the quantization threshold parameters; (Li, p. 3, § 3, ¶ 2: “The idea of our method is to merge these highly correlated layers and substitute them as a new “slim” layer from the analysis and modeling of the correlations of the current layer and preceding layers (or parallel layers).”; p. 3, § 3.1, ¶ 4: “Method The streamline slimming regenerates a new tensor layer (i.e., slim layer) by merging non-tensor layers with its bottom tensor units in the feed-forward structure.” P. 4, § 3.1, left-hand column, ¶ 1: “Pooling Layer: The pooling layer down-samples feature maps learned from previous layers. Therefore, to absorb a pooling layer to a convolution layer, we remove the pooling layer and set the stride value of the new convolution layer as the product of the stride values for both the original pooling layer and the convolution layer.”
The examiner notes that Li’s merging a non-tensor layer with its bottom tensor units to regenerate a new tensor layer teaches removing a layer (e.g., the non-tensor layer) from the at least one sub-structure, and that one or more bottom units render the claimed quantization layer obvious. The examiner further notes that Li’s merging the non-tensor layer with its bottom tensor units into a new layer teaches an operation to be performed in the layer (e.g., the aforementioned non-tensor layer) has been completely incorporated into the quantization layer (e.g., Li’s merged layer). Therefore, the examiner asserts that Li, when combined with Umuroglu and Xu, teaches the above limitation.)
 
Xu, Umuroglu, and Li are analogous because all three references pertain to reducing the computational complexity and memory footprint, especially for devices with limited compute and memory capacity such as mobile phones, by using fixed-point processing. 
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to combine Xu in view of Umuroglu to incorporate Li’s “DeepRebirth” that removes a layer from a sub-structure where an operation in the layer has been completely incorporated into a quantization layer (Li, supra). The modification accelerates the model execution at least in non-tensor layers without require training the neural network from scratch (Li, p. 3, § 3, ¶ 1: “To accelerate the model execution in non-tensor layers, we propose DeepRebirth to accelerate the model execution at both streaming substructure and branching substructure. The idea of our method is to merge these highly correlated layers and substitute them as a new “slim” layer from the analysis and modeling of the correlations of the current layer and preceding layers (or parallel layers).” p. 8, § 5, ¶ 2: “In addition, our work slims a well-trained network by relearning the merged rebirth layers and does not require to train from scratch.”)
 
                 Claim(s) 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Xu et al. WO 2018/140294 effectively filed on January 25, 2017 (hereinafter Xu) in view of Umuroglu et al. Streamlined Deployment for Quantized Neural Networks, Sept. 12, 2017 (hereinafter Umuroglu) and Li et al., DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices (August 16, 2017) (hereinafter Li) and further in view of Ioffe et al., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (March 2, 2015) (hereinafter Ioffe)..
With respect to claim 6, Xu modified by Umuroglu and Li teaches the apparatus according to claim 2 but does not appear to teach: 
wherein the quantization threshold parameters in the quantization layer are modified based on an offset parameter used for convolution in the binary convolution layer. 
Ioffe does, however, teach:
wherein the quantization threshold parameters in the quantization layer are modified based on an offset parameter used for convolution in the binary convolution layer. (Ioffe, Algorithm 1, § 3, p. 3:

    PNG
    media_image3.png
    200
    400
    media_image3.png
    Greyscale

where g and β respectively denote the scaling factor and the shift factor so that “[t]he scaled and shifted values y are passed to other network layers.” Pp. 4-5, § 3.2, ¶ 1: “Batch Normalization can be applied to any set of activations in the network. Here, we focus on transforms that consist of an afﬁne transformation followed by an element-wise nonlinearity: z = g(Wu + b) where W and b are learned parameters of the model, and g() is the nonlinearity such as sigmoid or ReLU.”
The examiner notes that Ioffe’s shift parameter that is used for shifting the input to a batch normalization layer to produce a shifted input teaches an offset parameter. The examiner further notes that Ioffe’s applying the shift factor to the convolution output (Wu in the affine transformation above) teaches that the offset parameter is used for convolution. Therefore, Ioffe, when combined with Xu, Li, and Umuroglu delineated in claim 1, supra, teaches the above limitation.) 
Xu, Umuroglu, Li, and Ioffe are analogous because all four references pertain to neural networks.  
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to have modified Xu in view of Umuroglu and Li to incorporate Ioffe’s use of an offset parameter (Ioffe, supra). The modification avoids the costly full whitening of each layer’s input and solves the deficiencies of conventional normalization by introducing a scale parameter and an offset parameter (Ioffe, p. 3, § 3, ¶ 1: “Since the full whitening of each layer’s inputs is costly and not everywhere differentiable, we make two necessary simpliﬁcations. The ﬁrst is that instead of whitening the features in layer inputs and outputs jointly, we will normalize each scalar feature independently, by making it have the mean of zero and the variance of 1.” p. 3, § 3, ¶ 2: “Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform. To accomplish this, we introduce, for each activation x(k), a pair of parameters γ(k), β(k), which scale and shift the normalized value”.)
 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERICH C. TZOU whose telephone number is (571)272-9852. The examiner can normally be reached Monday-Friday 6:00AM-5:00PM PST with alternative Fridays off.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann J. Lo can be reached on 571-272-9767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/E.C.T./Examiner, Art Unit 2126      
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126