Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This action is in response to amendments and remarks filed on 4/13/2022. In the current amendments, claims 1 and 10 are amended and claim 20 is added. Claims 1-20 are pending and have been examined.
In response to amendments and remarks filed on 4/13/2022, the 35 U.S.C. 101 rejection to claims 1-19 made in the previous Office Action have been withdrawn.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 9-10, and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Lin et al. (US 10373050 B2; hereinafter “Lin-1”) in view of Lin et al. (“Fixed Point Quantization of Deep Convolutional Networks”; hereinafter “Lin-2”) in view of Kum et al. (“Combined world-length optimization and high-level synthesis of digital signal processing systems”)
Regarding Claim 1,
Lin-1 teaches a processor implemented method, the method comprising (Lin-1, Col. 2 Lines 33-36, “An apparatus for quantizing a floating point machine learning network to obtain a fixed point machine learning network using a quantizer may include a memory unit and at least one processor coupled to the memory unit” teaches at least one processor).
performing training or an inference operation with a neural network, by (Lin-1, Col. 1 Lines 35-40, “Theses weight values are determined by the iterative flow of training data through the network (e.g., weight values are established during a training phase in which the network learns how to identify particular classes by their typical input data characteristics)” teaches training through the neural network).
obtaining a parameter for the neural network in a floating-point format (Lin-1, Fig. 8 and Col. 14 Lines 2-12, “In block 802, at least one moment of an input distribution of a floating point machine learning network is selected. The at least one moment of the input distribution of the floating point machine learning network may include a mean, a variance or other like moment of the input distribution. In block 804, quantizer parameters for quantizing values of the floating point machine learning network are determined based on the selected moment of the input distribution of the floating point machine learning network” teaches obtaining the parameters of the network in floating point values).
… quantizing the parameter in the floating-point format to a parameter in the fixed-point format (Lin-1, Col. 2 Lines 15-21, “The method may also include determining quantizer parameters for quantizing values of the floating point machine learning network based at least in part on the at least one selected moment of the input distribution of the floating point machine learning network to obtain corresponding values of the fixed point machine learning network” teaches determining quantized parameters for the floating point machine learning network to obtain corresponding values of the fixed point machine learning network).
… generating the trained neural network or a result of the inference operation, dependent on results of the quantizing of the parameter (Lin-1, Col. 2 Lines 15-21, “The method may also include determining quantizer parameters for quantizing values of the floating point machine learning network based at least in part on the at least one selected moment of the input distribution of the floating point machine learning network to obtain corresponding values of the fixed point machine learning network” teaches determining quantized parameters for the floating point machine learning network to obtain corresponding values of the fixed point machine learning network. Col. 13 Lines 49-53, “In one configuration, after quantizing the floating point model into a fixed point model, the fixed point network is fine-tuned via additional training to further improve the network performance. Fine-tuning may include training via back-propagation” teaches generating a trained neural network after quantization).
Lin-1 does not appear to explicitly teach applying a fractional length of a fixed-point format to the parameter in the floating-point format
However, Lin-2, teaches applying a fractional length of a fixed-point format to the parameter in the floating-point format (Lin-2, pg. 4 Section 3.3, “Note that determining the fixed point format is equivalent to determining the resolution, which in turn means identifying the number of fractional bits it requires to represent the number. The following equations can be used to compute the number of fractional bits: • Determine the effective standard deviation of the quantity being quantized: ξ. • Calculate step size via Table 1: s = ξ · Stepsize(β). • Compute number of fractional bits: n = −[log2 s]” teaches computing the number of fractional bits (corresponds to fraction length) of a fixed point format. Pg. 3 Section 3.3, “Any floating point DCN model can be converted to fixed point by following these steps: • Run a forward pass in floating point using a large set of typical inputs and record the activations” teaches parameter in float point format and converting float to fixed point).
Lin-1 in view of Lin-2 are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1 with Lin-2, with motivation to apply a fractional length of a fixed-point format to the parameter in the floating-point format. “We show that the naive method of quantizing all the layers in the DCN with uniform bit-width value results in DCN networks with subpar performance in terms of error rates relative to our proposed approach of SQNR based optimization of bit-widths. Specifically, we present results for a floating point DCN trained CIFAR-10 benchmark, which on conversion to its fixed point counter-part results in >20 % reduction in model size without any loss in accuracy” (Lin-2, Conclusion). The proposed teaching is beneficial in that it helps in reduction of the model size without any loss in accuracy.
Lin-1 in view of Lin-2 does not appear to explicitly teach performing, by an integer arithmetic logic unit (ALU), an operation for determining whether to round off a fixed point based on a most significant bit among bit values to be discarded after a quantization process and based on a determination result of the operation with the ALU
However, Kum et al., teaches performing, by an integer arithmetic logic unit (ALU), an operation for determining whether to round off a fixed point based on a most significant bit among bit values to be discarded after a quantization process (Kum et al., Fig. 9 and pg. 927 Section IV. B, “For a right shift, the most significant bits (MSBs) are sign extended and the MSB of truncated bits is used as the carry-in signal of the adders for rounding. For a left shift, the least significant bits (LSBs) are filled with zeros and the MSBs are thrown away, but overflows do not occur because the IWLs are carefully determined throughout the range estimation” teaches the arithmetic logic unit (see Fig. 9) and further teaches rounding of the adders based on the most significant bits and the MSBs being thrown away (corresponds to discarded) after quantization).
… based on a determination result of the operation with the ALU (Kum et al., Fig. 9 and pg. 928, teaches the operation with the arithmetic logic unit and its results).
Lin-1 in view of Lin-2 in view of Kum et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1 and Lin-2 with Kum et al., with motivation of performing, by an integer arithmetic logic unit (ALU), an operation for determining whether to round off a fixed point based on a most significant bit among bit values to be discarded after a quantization process and based on a determination result of the operation with the ALU. “A combined WL optimization and high-level synthesis approach that results in a more efficient or cost-effective design when compared with the previous WL optimization followed by high-level synthesis approaches. The developed method also requires less time for optimization since the use of the hardware sharing information for signal grouping results in fewer signal groups” (Kum et al., Conclusion). The proposed teaching is beneficial in that it results in a more efficient or cost-effective design that also requires less time for optimization.
Regarding Claim 9,
Lin-1 in view of Lin-2 in view of Kum et al. teaches the method of claim 1, further comprising
Lin-1 further teaches to quantize the parameter in the floating- point format processed in the first layer back to a parameter in the fixed-point format (Lin-1, Col. 2 Lines 15-21, “The method may also include determining quantizer parameters for quantizing values of the floating point machine learning network based at least in part on the at least one selected moment of the input distribution of the floating point machine learning network to obtain corresponding values of the fixed point machine learning network” teaches determining quantized parameters for the floating point machine learning network to obtain corresponding values of the fixed point machine learning network).
Lin-2 et al. further teaches converting the quantized parameter in the fixed-point format to the floating-point format based on processing conditions of a first layer of the neural network that receives the parameter in the floating-point format, from among layers of the neural network (Lin-2 et al., pg. 2 Section 3, “In this section, we will propose an algorithm to convert a floating point DCN to fixed point. For a given layer of DCN the goal of conversion is to represent the input activations, the output activations, and the parameters of that layer in fixed point. This can be seen as a process of quantization” teaches a quantization process that converts the parameters of a given layer (corresponds to the first layer of the neural network) from floating point to fixed point. Pg. 3-4 Section 3.3, “Any floating point DCN model can be converted to fixed point by following these steps: • Run a forward pass in floating point using a large set of typical inputs and record the activations. • Collect the statistics of weights, biases and activations for each layer. • Determine the fixed point formats of the weights, biases and activations for each layer” teaches the inputs (corresponds to input to the first layer of the neural network) which consist of weights, biases and activation (corresponds to the parameter) for the neural network being in floating point).
providing the parameter in the floating-point format to the first layer (Lin-2 et al., pg. 3-4 Section 3.3, “Any floating point DCN model can be converted to fixed point by following these steps: • Run a forward pass in floating point using a large set of typical inputs and record the activations. • Collect the statistics of weights, biases and activations for each layer. • Determine the fixed point formats of the weights, biases and activations for each layer” teaches the inputs (corresponds to input to the first layer of the neural network) which consist of weights, biases and activation (corresponds to the parameter) for the neural network being in floating point).
Lin-1 in view of Lin-2 in view of Kum et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1 and Lin-2 with Kum et al., with motivation to convert the quantized parameter in the fixed-point format to the floating-point format based on processing conditions of a first layer of the neural network that receives the parameter in the floating-point format, from among layers of the neural network and provide the parameter in the floating-point format to the first layer. “We show that the naive method of quantizing all the layers in the DCN with uniform bit-width value results in DCN networks with subpar performance in terms of error rates relative to our proposed approach of SQNR based optimization of bit-widths. Specifically, we present results for a floating point DCN trained CIFAR-10 benchmark, which on conversion to its fixed point counter-part results in >20 % reduction in model size without any loss in accuracy” (Lin-2, Conclusion). The proposed teaching is beneficial in that it helps in reduction of the model size without any loss in accuracy.
Kum et al. further teaches performing the operation with the integer ALU (Kum et al., Fig. 9 and pg. 928, teaches the operation with the arithmetic logic unit).  
Lin-1 in view of Lin-2 in view of Kum et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1 and Lin-2 with Kum et al., with motivation of performing the operation with the integer ALU. “A combined WL optimization and high-level synthesis approach that results in a more efficient or cost-effective design when compared with the previous WL optimization followed by high-level synthesis approaches. The developed method also requires less time for optimization since the use of the hardware sharing information for signal grouping results in fewer signal groups” (Kum et al., Conclusion). The proposed teaching is beneficial in that it results in a more efficient or cost-effective design that also requires less time for optimization.
Regarding Claim 10,
Lin-1 teaches a neural network apparatus, the apparatus comprising: a processor configured to (Lin-1, FIG. 1 and Col. 4 Lines 52-58, “FIG. 1 illustrates an example implementation of the aforementioned reduction of computation complexity by quantizing a floating point neural network to obtain a fixed point neural network using a system-on-a-chip (SOC) 100, which may include a general-purpose processor (CPU) or multi-core general-purpose processors (CPUs) 102 in accordance with certain aspects of the present disclosure” teaches a neural network comprising of a processor)
perform training or an inference operation with a neural network, which includes the processor being further configured to (Lin-1, Col. 1 Lines 35-40, “Theses weight values are determined by the iterative flow of training data through the network (e.g., weight values are established during a training phase in which the network learns how to identify particular classes by their typical input data characteristics)” teaches training through the neural network).
obtain a parameter for the neural network in a floating-point format (Lin-1, Fig. 8 and Col. 14 Lines 2-12, “In block 802, at least one moment of an input distribution of a floating point machine learning network is selected. The at least one moment of the input distribution of the floating point machine learning network may include a mean, a variance or other like moment of the input distribution. In block 804, quantizer parameters for quantizing values of the floating point machine learning network are determined based on the selected moment of the input distribution of the floating point machine learning network” teaches obtaining the parameters of the network in floating point values).
… quantize the parameter in the floating-point format to a parameter in the fixed-point format (Lin-1, Col. 2 Lines 15-21, “The method may also include determining quantizer parameters for quantizing values of the floating point machine learning network based at least in part on the at least one selected moment of the input distribution of the floating point machine learning network to obtain corresponding values of the fixed point machine learning network” teaches determining quantized parameters for the floating point machine learning network to obtain corresponding values of the fixed point machine learning network).
… generate the trained neural network or a result of the inference operation dependent on results of the quantizing of the parameter (Lin-1, Col. 2 Lines 15-21, “The method may also include determining quantizer parameters for quantizing values of the floating point machine learning network based at least in part on the at least one selected moment of the input distribution of the floating point machine learning network to obtain corresponding values of the fixed point machine learning network” teaches determining quantized parameters for the floating point machine learning network to obtain corresponding values of the fixed point machine learning network. Col. 13 Lines 49-53, “In one configuration, after quantizing the floating point model into a fixed point model, the fixed point network is fine-tuned via additional training to further improve the network performance. Fine-tuning may include training via back-propagation” teaches generating a trained neural network after quantization).
Lin-1 does not appear to explicitly teach apply a fractional length of a fixed-point format to the parameter in the floating-point format 
However, Lin-2, teaches apply a fractional length of a fixed-point format to the floating-point format (Lin-2, pg. 4 Section 3.3, “Note that determining the fixed point format is equivalent to determining the resolution, which in turn means identifying the number of fractional bits it requires to represent the number. The following equations can be used to compute the number of fractional bits: • Determine the effective standard deviation of the quantity being quantized: ξ. • Calculate step size via Table 1: s = ξ · Stepsize(β). • Compute number of fractional bits: n = −[log2 s]” teaches computing the number of fractional bits (corresponds to fraction length) of a fixed point format. Pg. 3 Section 3.3, “Any floating point DCN model can be converted to fixed point by following these steps: • Run a forward pass in floating point using a large set of typical inputs and record the activations” teaches parameter in float point format and converting float to fixed point).
Lin-1 in view of Lin-2 are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1 with Lin-2, with motivation to apply a fractional length of a fixed-point format to the parameter in the floating-point format. “We show that the naive method of quantizing all the layers in the DCN with uniform bit-width value results in DCN networks with subpar performance in terms of error rates relative to our proposed approach of SQNR based optimization of bit-widths. Specifically, we present results for a floating point DCN trained CIFAR-10 benchmark, which on conversion to its fixed point counter-part results in >20 % reduction in model size without any loss in accuracy” (Lin-2, Conclusion). The proposed teaching is beneficial in that it helps in reduction of the model size without any loss in accuracy.
Lin-1 in view of Lin-2 does not appear to explicitly teach perform, by an integer arithmetic logic unit (ALU), an operation for determining whether to round off a fixed point based on a most significant bit among bit values to be discarded after a quantization process and based on a result of the operation with the ALU
However, Kum et al., teaches perform, by an integer arithmetic logic unit (ALU), an operation for determining whether to round off a fixed point based on a most significant bit among bit values to be discarded after a quantization process (Kum et al., Fig. 9 and pg. 927 Section IV. B, “For a right shift, the most significant bits (MSBs) are sign extended and the MSB of truncated bits is used as the carry-in signal of the adders for rounding. For a left shift, the least significant bits (LSBs) are filled with zeros and the MSBs are thrown away, but overflows do not occur because the IWLs are carefully determined throughout the range estimation” teaches the arithmetic logic unit (see Fig. 9) and further teaches rounding of the adders based on the most significant bits and the MSBs being thrown away (corresponds to discarded) after quantization).
… based on a result of the operation with the ALU (Kum et al., Fig. 9 and pg. 928, teaches the operation with the arithmetic logic unit and its results).
Lin-1 in view of Lin-2 in view of Kum et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1 and Lin-2 with Kum et al., with motivation to perform, by an integer arithmetic logic unit (ALU), an operation for determining whether to round off a fixed point based on a most significant bit among bit values to be discarded after a quantization process and based on a result of the operation with the ALU. “A combined WL optimization and high-level synthesis approach that results in a more efficient or cost-effective design when compared with the previous WL optimization followed by high-level synthesis approaches. The developed method also requires less time for optimization since the use of the hardware sharing information for signal grouping results in fewer signal groups” (Kum et al., Conclusion). The proposed teaching is beneficial in that it results in a more efficient or cost-effective design that also requires less time for optimization.
Regarding Claim 18,
Lin-1 in view of Lin-2 in view of Kum et al. teaches the neural network apparatus of claim 10, wherein the processor is further38012055.0456 configured to
Lin-1 further teaches to quantize the parameter in the floating-point format processed in the first layer back to a parameter in the fixed-point format (Lin-1, Col. 2 Lines 15-21, “The method may also include determining quantizer parameters for quantizing values of the floating point machine learning network based at least in part on the at least one selected moment of the input distribution of the floating point machine learning network to obtain corresponding values of the fixed point machine learning network” teaches determining quantized parameters for the floating point machine learning network to obtain corresponding values of the fixed point machine learning network).
Lin-2 et al. further teaches convert the quantized parameter in the fixed-point format to the floating-point format based on processing conditions of a first layer of the neural network that receives the parameter in the floating-point format, from among layers of the neural network (Lin-2 et al., pg. 2 Section 3, “In this section, we will propose an algorithm to convert a floating point DCN to fixed point. For a given layer of DCN the goal of conversion is to represent the input activations, the output activations, and the parameters of that layer in fixed point. This can be seen as a process of quantization” teaches a quantization process that converts the parameters of a given layer (corresponds to the first layer of the neural network) from floating point to fixed point. Pg. 3-4 Section 3.3, “Any floating point DCN model can be converted to fixed point by following these steps: • Run a forward pass in floating point using a large set of typical inputs and record the activations. • Collect the statistics of weights, biases and activations for each layer. • Determine the fixed point formats of the weights, biases and activations for each layer” teaches the inputs (corresponds to input to the first layer of the neural network) which consist of weights, biases and activation (corresponds to the parameter) for the neural network being in floating point).
provide the parameter in the floating-point format to the first layer (Lin-2 et al., pg. 3-4 Section 3.3, “Any floating point DCN model can be converted to fixed point by following these steps: • Run a forward pass in floating point using a large set of typical inputs and record the activations. • Collect the statistics of weights, biases and activations for each layer. • Determine the fixed point formats of the weights, biases and activations for each layer” teaches the inputs (corresponds to input to the first layer of the neural network) which consist of weights, biases and activation (corresponds to the parameter) for the neural network being in floating point).
Lin-1 in view of Lin-2 in view of Kum et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1 and Lin-2 with Kum et al., with motivation for converting the quantized parameter in the fixed-point format to the floating-point format based on processing conditions of a first layer of the neural network that receives the parameter in the floating-point format, from among layers of the neural network and providing the parameter in the floating-point format to the first layer. “We show that the naive method of quantizing all the layers in the DCN with uniform bit-width value results in DCN networks with subpar performance in terms of error rates relative to our proposed approach of SQNR based optimization of bit-widths. Specifically, we present results for a floating point DCN trained CIFAR-10 benchmark, which on conversion to its fixed point counter-part results in >20 % reduction in model size without any loss in accuracy” (Lin-2, Conclusion). The proposed teaching is beneficial in that it helps in reduction of the model size without any loss in accuracy.
Kum et al. further teaches perform the operation with the integer ALU (Kum et al., Fig. 9 and pg. 928, teaches the operation with the arithmetic logic unit).  
Lin-1 in view of Lin-2 in view of Kum et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1 and Lin-2 with Kum et al., with motivation to perform the operation with the integer ALU. “A combined WL optimization and high-level synthesis approach that results in a more efficient or cost-effective design when compared with the previous WL optimization followed by high-level synthesis approaches. The developed method also requires less time for optimization since the use of the hardware sharing information for signal grouping results in fewer signal groups” (Kum et al., Conclusion). The proposed teaching is beneficial in that it results in a more efficient or cost-effective design that also requires less time for optimization.
Regarding Claim 19,
Lin-1 in view of Lin-2 in view of Kum et al. teaches the method of claim 1
Lin-1 further teaches non-transitory computer-readable recording medium having recorded thereon a computer program, which, when executed by a computer, performs the method… (Lin-1, Col. 2 Lines 46-59, “A non-transitory computer-readable medium having program code recorded thereon for quantizing a floating point machine learning network to obtain a fixed point machine learning network using a quantizer when executed by a processor may include program code to select at least one moment of an input distribution of the floating point machine learning network. The non-transitory computer-readable medium may further include program code to determine quantizer parameters for quantizing values of the floating point machine learning network based at least in part on the at least one selected moment of the input distribution of the floating point machine learning network to obtain corresponding values of the fixed point machine learning network” teaches a non-transitory computer-readable medium containing program code).
Regarding Claim 20,
Lin-1 in view of Lin-2 in view of Kum et al. teaches the neural network apparatus of claim 10,
Lin-1 further teaches further comprising a memory storing instruction, which when executed by the processors, configure the processor to perform the obtaining of the parameter (Lin-1, Col. Lines, “The processor may be responsible for managing the bus and general processing, including the execution of software stored on the machine-readable media. The processor may be implemented with one or more general-purpose and/or special-purpose processors. Examples include microprocessors, microcontrollers, DSP processors, and other circuitry that can execute software. Software shall be construed broadly to mean instructions, data, or any combination thereof, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. Machine-readable media may include, by way of example, random access memory (RAM), flash memory, read only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable Read-only memory (EEPROM), registers, magnetic disks, optical disks, hard drives, or any other suitable storage medium, or any combination thereof” teaches a memory storing instructions that are executed by the processor. Fig. 8 and Col. 14 Lines 2-12, “In block 802, at least one moment of an input distribution of a floating point machine learning network is selected. The at least one moment of the input distribution of the floating point machine learning network may include a mean, a variance or other like moment of the input distribution. In block 804, quantizer parameters for quantizing values of the floating point machine learning network are determined based on the selected moment of the input distribution of the floating point machine learning network” teaches obtaining the parameters of the network in floating point values). 
… the determining, and the quantizing of the parameter (Lin-1, Col. 2 Lines 15-21, “The method may also include determining quantizer parameters for quantizing values of the floating point machine learning network based at least in part on the at least one selected moment of the input distribution of the floating point machine learning network to obtain corresponding values of the fixed point machine learning network” teaches determining quantized parameters for the floating point machine learning network to obtain corresponding values of the fixed point machine learning network).
Lin-2 further teaches the applying of the fractional length to the floating-point format (Lin-2, pg. 4 Section 3.3, “Note that determining the fixed point format is equivalent to determining the resolution, which in turn means identifying the number of fractional bits it requires to represent the number. The following equations can be used to compute the number of fractional bits: • Determine the effective standard deviation of the quantity being quantized: ξ. • Calculate step size via Table 1: s = ξ · Stepsize(β). • Compute number of fractional bits: n = −[log2 s]” teaches computing the number of fractional bits (corresponds to fraction length) of a fixed point format).
Lin-1 in view of Lin-2 in view of Kum et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1 and Lin-2 with Kum et al., with motivation of the applying of the fractional length to the floating-point format. “We show that the naive method of quantizing all the layers in the DCN with uniform bit-width value results in DCN networks with subpar performance in terms of error rates relative to our proposed approach of SQNR based optimization of bit-widths. Specifically, we present results for a floating point DCN trained CIFAR-10 benchmark, which on conversion to its fixed point counter-part results in >20 % reduction in model size without any loss in accuracy” (Lin-2, Conclusion). The proposed teaching is beneficial in that it helps in reduction of the model size without any loss in accuracy.
Claims 2-4 and 11-13 are rejected under 35 U.S.C. 103 as being unpatentable over Lin-1 in view of Lin-2 in view of Kum et al. and in further view of Rao et al. (“IMPLEMENTATION OF THE STANDARD FLOATING POINT MAC USING IEEE 754 FLOATING POINT ADDER”)
Regarding Claim 2,
Lin-1 in view of Lin-2 in view of Kum et al. teaches the method of claim 1, 
Kum et al. further teaches wherein the performing of the operation with the ALU comprises (Kum et al., Fig. 9 and pg. 928, teaches the operation with the arithmetic logic unit).  
Lin-1 in view of Lin-2 in view of Kum et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1 and Lin-2 with Kum et al., with motivation of performing the operation with the integer ALU. “A combined WL optimization and high-level synthesis approach that results in a more efficient or cost-effective design when compared with the previous WL optimization followed by high-level synthesis approaches. The developed method also requires less time for optimization since the use of the hardware sharing information for signal grouping results in fewer signal groups” (Kum et al., Conclusion). The proposed teaching is beneficial in that it results in a more efficient or cost-effective design that also requires less time for optimization.
Lin-1 in view of Lin-2 in view of Kum et al. does not appear to explicitly teach extracting a sign, a first exponent value, and a first mantissa value from the parameter in the floating-point format, calculating a second exponent value based on the first exponent value, the fractional length of the fixed-point format, and a bias constant that is determined based on a format of the floating-point, and calculating a second mantissa value by performing a bit manipulation operation and an integer operation with respect to the first mantissa value, based on the second exponent value
However, Rao et al., teaches extracting a sign, a first exponent value, and a first mantissa value from the parameter in the floating-point format (Rao et al., Fig. 1 and pg. 719 Section III.A, “In half precision, the field of IEEE 754 standard can be represented as, for the sign, 1-bit; for the exponent, 4-bits; and for the mantissa, 11-bits” teaches the sign, an exponent value, and a mantissa value  being extracted from  the field of IEEE 754 standard (corresponds to floating-point format)).
calculating a second exponent value based on the first exponent value, the fractional length of the fixed-point format, and a bias constant that is determined based on a format of the floating-point (Rao et al., pg. 721 Section III.C, “The exponents, E1and E2 of the two operands of N1 and N2 are added up from which the bias of 7 for half precision is subtracted and the exponent value is finalized based on the carry propagated from the result of multiplication of the two mantissas” teaches determining the finalized exponent value (corresponds to second exponent value) based on the exponents, E1and E2 of the two operands of N1 and N2 (corresponds to the first exponent value). Pg. 717 Section II, “The fixed point MAC is constituted by the fixed point adder, fixed point multiplier and a shifter. The sampled values which are x(n) will be given as input to the shifter. The shifter will shift the value of ‘n’ for different samples starting from first sample n=0 to the last sample i.e., n=N-1 where ‘n’ indicates number of samples and ‘N’ indicates length of the filter” teaches determining the filter length (corresponds to fractional length) of the fixed-point format. Pg. 719 Section III.A, “IEEE 754 uses biased representation for exponent which is nothing but, Value of exponent = Val(E) = E-Bias, where Bias is a constant” teaches determining a bias constant based on IEEE 754 (corresponds to floating-point format)).
calculating a second mantissa value by performing a bit manipulation operation and an integer operation with respect to the first mantissa value, based on the second exponent value (Rao et al., pg. 720 Section III.B, “While adding the two floating point numbers the smallest number is to be identified so that eight bit subtractor is required for the exponent. Similarly, it requires one 2×1 multiplexer to select the input data depending upon the status of the select line; one 32 bit swap register to swap the smallest mantissa with the highest mantissa; one 16 bit register to shift right the smallest mantissa to the right side bit by bit to make equal the smallest mantissa to the biggest mantissa value depending upon the difference value of the two exponents” teaches determining the highest mantissa (corresponds to the second mantissa) with respect to the smallest mantissa (corresponds to the one mantissa) based upon the value of the two exponents (corresponds to the second exponent value)).
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1, Lin-2, and Kum et al. with Rao et al., with motivation to extract a sign, a first exponent value, and a first mantissa value from the parameter in the floating-point format, calculate a second exponent value based on the first exponent value, the fractional length of the fixed-point format, and a bias constant that is determined based on a format of the floating-point, and calculate a second mantissa value by performing a bit manipulation operation and an integer operation with respect to the first mantissa value, based on the second exponent value. “Hence, to improve the performance of the traditional fixed point MAC, in this work we implemented the standard floating point MAC using IEEE 754 floating point adder. This can be used to design all floating point DSP processors through the standard floating point MAC” (Rao et al., Abstract). The proposed teaching is beneficial in that it helps improve the performance of the traditional MAC.
Regarding Claim 3,
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. teaches the method of claim 2, 
Rao et al. further teaches wherein the calculating of the second exponent value comprises: performing an integer operation of subtracting, from the first exponent value, the bias constant (Rao et al., pg. 719 Section III.A, “IEEE 754 uses biased representation for exponent which is nothing but, Value of exponent = Val(E) = E-Bias, where Bias is a constant” teaches calculating the value of exponent (corresponds to the second exponent) by subtracting the exponent (corresponds to the first exponent) from the Bias (corresponds to the bias constant)).
 calculating the second exponent value by performing an integer operation of adding the fractional length to a result of the integer operation of subtracting the bias constant (Rao et al., Figure 3 and pg. 720 Section III.B, “Let S1, E1, M1 are the sign, exponent and mantissa of the first floating point operands of N1 and S2, E2, M2 are the sign, exponent and mantissa of the second floating point operands N2, then for the standard floating point adder, the explanation of the algorithm is as follows: a) Initially, the system reads the two operands of N1 and N2 for denormalization and infinity. Set the hidden bit of the fraction to 0 if numbers are denormalized otherwise set to 1. b) Using the 4-bit subtractor, the two exponents E1, E2 are compared. If E1 is less than E2, N1 and N2 are swapped which means that previous M2 is now referred to as M1 and vice versa. c) The smaller fraction, M2 is shifted right by the absolute difference result of the two exponents’ subtraction. Now both the numbers have the same exponent. d) Now the two mantissas of M1 and M2 are added. e) For the normalization, after addition the result is then passed through a leading one detector. f) Using the results from the leading one detector, if it is needed, the result is then shifted right by 1 bit to complete the normalization process. g) After normalization, using the default rounding mode the result is rounded to the nearest value. h) The exponent is adjusted using the results from the leading one detector. i) The sign is computed depending on the value of exponents of E1 and E2 which means that whichever the exponent is the maximum, that sign is computed. The result is registered after the overflow and underflow check” teaches an addition algorithm that determining the adjusted exponent (corresponds to second exponent) adding the fractions of M1 and M2 (corresponds to fractional length)).
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1, Lin-2, and Kum et al. with Rao et al., with motivation wherein the calculating of the second exponent value comprises: performing an integer operation of subtracting, from the first exponent value, the bias constant and calculating the second exponent value by performing an integer operation of adding the fractional length to a result of the integer operation of subtracting the bias constant. “Hence, to improve the performance of the traditional fixed point MAC, in this work we implemented the standard floating point MAC using IEEE 754 floating point adder. This can be used to design all floating point DSP processors through the standard floating point MAC” (Rao et al., Abstract). The proposed teaching is beneficial in that it helps improve the performance of the traditional MAC.
Regarding Claim 4,
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. teaches the method of claim 2, 
Rao et al. further teaches wherein the calculating of the second mantissa value comprises (Rao et al., pg. 720 Section III.B, “While adding the two floating point numbers the smallest number is to be identified so that eight bit subtractor is required for the exponent. Similarly, it requires one 2×1 multiplexer to select the input data depending upon the status of the select line; one 32 bit swap register to swap the smallest mantissa with the highest mantissa; one 16 bit register to shift right the smallest mantissa to the right side bit by bit to make equal the smallest mantissa to the biggest mantissa value depending upon the difference value of the two exponents” teaches determining the highest mantissa (corresponds to the second mantissa) with respect to the smallest mantissa (corresponds to the one mantissa) based upon the value of the two exponents).
updating the first mantissa value by adding a bit value of 1 to a position before the first mantissa value (Rao et al., Figure 3 and pg. 720 Section III.B, “Let S1, E1, M1 are the sign, exponent and mantissa of the first floating point operands of N1 and S2, E2, M2 are the sign, exponent and mantissa of the second floating point operands N2, then for the standard floating point adder, the explanation of the algorithm is as follows: a) Initially, the system reads the two operands of N1 and N2 for denormalization and infinity. Set the hidden bit of the fraction to 0 if numbers are denormalized otherwise set to 1. b) Using the 4-bit subtractor, the two exponents E1, E2 are compared. If E1 is less than E2, N1 and N2 are swapped which means that previous M2 is now referred to as M1 and vice versa. c) The smaller fraction, M2 is shifted right by the absolute difference result of the two exponents’ subtraction. Now both the numbers have the same exponent. d) Now the two mantissas of M1 and M2 are added. e) For the normalization, after addition the result is then passed through a leading one detector. f) Using the results from the leading one detector, if it is needed, the result is then shifted right by 1 bit to complete the normalization process. g) After normalization, using the default rounding mode the result is rounded to the nearest value. h) The exponent is adjusted using the results from the leading one detector. i) The sign is computed depending on the value of exponents of E1 and E2 which means that whichever the exponent is the maximum, that sign is computed. The result is registered after the overflow and underflow check” teaches updating the mantissa (corresponds to the first mantissa value) by shifting the result right by 1 bit (corresponds to adding a bit value of 1 to a position) to complete the normalization process).
comparing a number of bits of the first mantissa value with a number of bits of the 34012055.0456 second mantissa value (Rao et al., Figure 3 and pg. 720 Section III.B, “one 32 bit swap register to swap the smallest mantissa with the highest mantissa; one 16 bit register to shift right the smallest mantissa to the right side bit by bit to make equal the smallest mantissa to the biggest mantissa value depending upon the difference value of the two exponents” teaches comparing the smallest mantissa (corresponds to the first mantissa value) with the number of bits of the biggest mantissa value (corresponds to the second mantissa value)).
shifting the updated first mantissa value to the right, based on a result of the comparing of the number of bits of the first mantissa value with the number of bits of the second mantissa value (Rao et al., Figure 3 and pg. 720 Section III.B, “one 32 bit swap register to swap the smallest mantissa with the highest mantissa; one 16 bit register to shift right the smallest mantissa to the right side bit by bit to make equal the smallest mantissa to the biggest mantissa value depending upon the difference value of the two exponents” teaches shifting the smallest mantissa (corresponds to the first mantissa value) to the right side bit by bit after comparing the bits to the biggest mantissa value (corresponds to the second mantissa value) to make equal).
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1, Lin-2, and Kum et al. with Rao et al., with motivation of updating the first mantissa value by adding a bit value of 1 to a position before the first mantissa value, comparing a number of bits of the first mantissa value with a number of bits of the 34012055.0456 second mantissa value, and shifting the updated first mantissa value to the right, based on a result of the comparing of the number of bits of the first mantissa value with the number of bits of the second mantissa value. “Hence, to improve the performance of the traditional fixed point MAC, in this work we implemented the standard floating point MAC using IEEE 754 floating point adder. This can be used to design all floating point DSP processors through the standard floating point MAC” (Rao et al., Abstract). The proposed teaching is beneficial in that it helps improve the performance of the traditional MAC.
Regarding Claim 11,
Lin-1 in view of Lin-2 in view of Kum et al. teaches the neural network apparatus of claim 10, 
Lin-1 further teaches wherein the processor is further configured to (Lin-1, Col. 2 Lines 33-36, “An apparatus for quantizing a floating point machine learning network to obtain a fixed point machine learning network using a quantizer may include a memory unit and at least one processor coupled to the memory unit” teaches at least one processor).
Lin-1 in view of Lin-2 in view of Kum et al. does not appear to explicitly teach 36012055.0456extract a sign, a first exponent value, and a first mantissa value from the parameter in the floating-point format, calculate a second exponent value based on the first exponent value, the fractional length of the fixed-point format, and a bias constant that is determined based on a format of the floating-point, and calculate a second mantissa value by performing a bit manipulation operation and an integer operation with respect to the first mantissa value, based on the second exponent value.  
However, Rao et al., teaches extract a sign, a first exponent value, and a first mantissa value from the parameter in the floating-point format (Rao et al., Fig. 1 and pg. 719 Section III. A, “In half precision, the field of IEEE 754 standard can be represented as, for the sign, 1-bit; for the exponent, 4-bits; and for the mantissa, 11-bits” teaches the sign, an exponent value, and a mantissa value  being extracted from  the field of IEEE 754 standard (corresponds to floating-point format)).
calculate a second exponent value based on the first exponent value, the fractional length of the fixed-point format, and a bias constant that is determined based on a format of the floating-point (Rao et al., pg. 721 Section III.C, “The exponents, E1and E2 of the two operands of N1 and N2 are added up from which the bias of 7 for half precision is subtracted and the exponent value is finalized based on the carry propagated from the result of multiplication of the two mantissas” teaches determining the finalized exponent value (corresponds to second exponent value) based on the exponents, E1and E2 of the two operands of N1 and N2 (corresponds to the first exponent value). Pg. 717 Section II, “The fixed point MAC is constituted by the fixed point adder, fixed point multiplier and a shifter. The sampled values which are x(n) will be given as input to the shifter. The shifter will shift the value of ‘n’ for different samples starting from first sample n=0 to the last sample i.e., n=N-1 where ‘n’ indicates number of samples and ‘N’ indicates length of the filter” teaches determining the filter length (corresponds to fractional length) of the fixed-point format. Section III.A, “IEEE 754 uses biased representation for exponent which is nothing but, Value of exponent = Val(E) = E-Bias, where Bias is a constant” teaches determining a bias constant based on IEEE 754 (corresponds to floating-point format)).
calculate a second mantissa value by performing a bit manipulation operation and an integer operation with respect to the first mantissa value, based on the second exponent value (Rao et al., pg. 720 Section III.B, “While adding the two floating point numbers the smallest number is to be identified so that eight bit subtractor is required for the exponent. Similarly, it requires one 2×1 multiplexer to select the input data depending upon the status of the select line; one 32 bit swap register to swap the smallest mantissa with the highest mantissa; one 16 bit register to shift right the smallest mantissa to the right side bit by bit to make equal the smallest mantissa to the biggest mantissa value depending upon the difference value of the two exponents” teaches determining the highest mantissa (corresponds to the second mantissa) with respect to the smallest mantissa (corresponds to the one mantissa) based upon the value of the two exponents (corresponds to the second exponent value)).
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1, Lin-2, and Kum et al. with Rao et al., with motivation to extract a sign, a first exponent value, and a first mantissa value from the parameter in the floating-point format, calculate a second exponent value based on the first exponent value, the fractional length of the fixed-point format, and a bias constant that is determined based on a format of the floating-point, and calculate a second mantissa value by performing a bit manipulation operation and an integer operation with respect to the first mantissa value, based on the second exponent value. “Hence, to improve the performance of the traditional fixed point MAC, in this work we implemented the standard floating point MAC using IEEE 754 floating point adder. This can be used to design all floating point DSP processors through the standard floating point MAC” (Rao et al., Abstract). The proposed teaching is beneficial in that it helps improve the performance of the traditional MAC.
Regarding Claim 12,
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. teaches the neural network apparatus of claim 11, 
Lin-1 further teaches wherein the processor is further configured to (Lin-1, Col. 2 Lines 33-36, “An apparatus for quantizing a floating point machine learning network to obtain a fixed point machine learning network using a quantizer may include a memory unit and at least one processor coupled to the memory unit” teaches at least one processor).
Rao et al. further teaches perform an integer operation of subtracting, from the first exponent value, the bias constant (Rao et al., pg. 719 Section III.A, “IEEE 754 uses biased representation for exponent which is nothing but, Value of exponent = Val(E) = E-Bias, where Bias is a constant” teaches calculating the value of exponent (corresponds to the second exponent) by subtracting the exponent (corresponds to the first exponent) from the Bias (corresponds to the bias constant)).
calculate the second exponent value by performing an integer operation of adding the fractional length to a result of the integer operation of subtracting the bias constant (Rao et al., Figure 3 and pg. 720 Section III.B, “Let S1, E1, M1 are the sign, exponent and mantissa of the first floating point operands of N1 and S2, E2, M2 are the sign, exponent and mantissa of the second floating point operands N2, then for the standard floating point adder, the explanation of the algorithm is as follows: a) Initially, the system reads the two operands of N1 and N2 for denormalization and infinity. Set the hidden bit of the fraction to 0 if numbers are denormalized otherwise set to 1. b) Using the 4-bit subtractor, the two exponents E1, E2 are compared. If E1 is less than E2, N1 and N2 are swapped which means that previous M2 is now referred to as M1 and vice versa. c) The smaller fraction, M2 is shifted right by the absolute difference result of the two exponents’ subtraction. Now both the numbers have the same exponent. d) Now the two mantissas of M1 and M2 are added. e) For the normalization, after addition the result is then passed through a leading one detector. f) Using the results from the leading one detector, if it is needed, the result is then shifted right by 1 bit to complete the normalization process. g) After normalization, using the default rounding mode the result is rounded to the nearest value. h) The exponent is adjusted using the results from the leading one detector. i) The sign is computed depending on the value of exponents of E1 and E2 which means that whichever the exponent is the maximum, that sign is computed. The result is registered after the overflow and underflow check” teaches an addition algorithm that determining the adjusted exponent (corresponds to second exponent) adding the fractions of M1 and M2 (corresponds to fractional length)).
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1, Lin-2, and Kum et al. with Rao et al., with motivation to perform an integer operation of subtracting, from the first exponent value, the bias constant and calculate the second exponent value by performing an integer operation of adding the fractional length to a result of the integer operation of subtracting the bias constant. “Hence, to improve the performance of the traditional fixed point MAC, in this work we implemented the standard floating point MAC using IEEE 754 floating point adder. This can be used to design all floating point DSP processors through the standard floating point MAC” (Rao et al., Abstract). The proposed teaching is beneficial in that it helps improve the performance of the traditional MAC.
Regarding Claim 13,
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. teaches the neural network apparatus of claim 11, 
Lin-1 further teaches wherein the processor is further configured to (Lin-1, Col. 2 Lines 33-36, “An apparatus for quantizing a floating point machine learning network to obtain a fixed point machine learning network using a quantizer may include a memory unit and at least one processor coupled to the memory unit” teaches at least one processor).
Rao et al. further teaches update the first mantissa value by adding a bit value of 1 to a position before the first mantissa value (Rao et al., Figure 3 and pg. 720 Section III.B, “Let S1, E1, M1 are the sign, exponent and mantissa of the first floating point operands of N1 and S2, E2, M2 are the sign, exponent and mantissa of the second floating point operands N2, then for the standard floating point adder, the explanation of the algorithm is as follows: a) Initially, the system reads the two operands of N1 and N2 for denormalization and infinity. Set the hidden bit of the fraction to 0 if numbers are denormalized otherwise set to 1. b) Using the 4-bit subtractor, the two exponents E1, E2 are compared. If E1 is less than E2, N1 and N2 are swapped which means that previous M2 is now referred to as M1 and vice versa. c) The smaller fraction, M2 is shifted right by the absolute difference result of the two exponents’ subtraction. Now both the numbers have the same exponent. d) Now the two mantissas of M1 and M2 are added. e) For the normalization, after addition the result is then passed through a leading one detector. f) Using the results from the leading one detector, if it is needed, the result is then shifted right by 1 bit to complete the normalization process. g) After normalization, using the default rounding mode the result is rounded to the nearest value. h) The exponent is adjusted using the results from the leading one detector. i) The sign is computed depending on the value of exponents of E1 and E2 which means that whichever the exponent is the maximum, that sign is computed. The result is registered after the overflow and underflow check” teaches updating the mantissa (corresponds to the first mantissa value) by shifting the result right by 1 bit (corresponds to adding a bit value of 1 to a position) to complete the normalization process).
compare a number of bits of the first mantissa value with a number of bits of the second mantissa value (Rao et al., Figure 3 and pg. 720 Section III.B, “one 32 bit swap register to swap the smallest mantissa with the highest mantissa; one 16 bit register to shift right the smallest mantissa to the right side bit by bit to make equal the smallest mantissa to the biggest mantissa value depending upon the difference value of the two exponents” teaches comparing the smallest mantissa (corresponds to the first mantissa value) with the number of bits of the biggest mantissa value (corresponds to the second mantissa value)).
shift the updated first mantissa value to the right, based on a result of the comparing of the number of bits of the first mantissa value with the number of bits of the second mantissa value (Rao et al., Figure 3 and pg. 720 Section III.B, “one 32 bit swap register to swap the smallest mantissa with the highest mantissa; one 16 bit register to shift right the smallest mantissa to the right side bit by bit to make equal the smallest mantissa to the biggest mantissa value depending upon the difference value of the two exponents” teaches shifting the smallest mantissa (corresponds to the first mantissa value) to the right side bit by bit after comparing the bits to the biggest mantissa value (corresponds to the second mantissa value) to make equal).
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1, Lin-2, and Kum et al. with Rao et al., with motivation to update the first mantissa value by adding a bit value of 1 to a position before the first mantissa value, compare a number of bits of the first mantissa value with a number of bits of the second mantissa value, and shift the updated first mantissa value to the right, based on a result of the comparing of the number of bits of the first mantissa value with the number of bits of the second mantissa value. “Hence, to improve the performance of the traditional fixed point MAC, in this work we implemented the standard floating point MAC using IEEE 754 floating point adder. This can be used to design all floating point DSP processors through the standard floating point MAC” (Rao et al., Abstract). The proposed teaching is beneficial in that it helps improve the performance of the traditional MAC.
Claims 5-6, 8, 14-15, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. and in further view of Lutz et al. (US 20160092168 A1) and Guardia et al. (“FPGA implementation of a binary32 floating point cube root”)
Regarding Claim 5,
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. teaches the method of claim 4
Rao et al. further teaches wherein the calculating of the second mantissa value further comprises (Rao et al., pg. 720 Section III.B, “While adding the two floating point numbers the smallest number is to be identified so that eight bit subtractor is required for the exponent. Similarly, it requires one 2×1 multiplexer to select the input data depending upon the status of the select line; one 32 bit swap register to swap the smallest mantissa with the highest mantissa; one 16 bit register to shift right the smallest mantissa to the right side bit by bit to make equal the smallest mantissa to the biggest mantissa value depending upon the difference value of the two exponents” teaches determining the highest mantissa (corresponds to the second mantissa) with respect to the smallest mantissa (corresponds to the one mantissa) based upon the value of the two exponents). 
Lin-1 in view of Lin-2 in view of Rao et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1 and Lin-2 with Rao et al., with motivation wherein the calculating of the second mantissa value further comprises. “Hence, to improve the performance of the traditional fixed point MAC, in this work we implemented the standard floating point MAC using IEEE 754 floating point adder. This can be used to design all floating point DSP processors through the standard floating point MAC” (Rao et al., Abstract). The proposed teaching is beneficial in that it helps improve the performance of the traditional MAC.
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. does not appear to explicitly teach shifting the updated first mantissa value to the right by a value obtained by subtracting the second exponent value from a predetermined number determined based on a type of a floating point-format when it is determined that the second exponent value is less than the number of bits of the first mantissa value, in order to determine whether to round off the fixed point, calculating the second mantissa value by determining whether to round off the fixed point by shifting the shifted first mantissa value to the right by 1 one more time and adding the extracted LSB value and wherein the LSB value is a factor that determines whether to round off the fixed point.
However, Lutz et al., teaches shifting the updated first mantissa value to the right by a value obtained by subtracting the second exponent value from a predetermined number determined based on a type of a floating point-format when it is determined that the second exponent value is less than the number of bits of the first mantissa value, in order to determine whether to round off the fixed point (Lutz et al., Para. [0028], “The exponent is biased, which means that the true exponent differs from the one stored in the number. For example, biased SP exponents are 8-bits long and range from 0 to 255. Exponents 0 and 255 are special cases, but all other exponents have bias 127, meaning that the true exponent is 127 less than the biased exponent. The smallest biased exponent is 1, which corresponds to a true exponent of −126. The maximum biased exponent is 254, which corresponds to a true exponent of 127. HP and DP exponents work the same way, with the biases indicated in the table above” teaches determining when true exponent (corresponds to the second exponent value) being less than the number of bits of the biased exponent (corresponds to the first mantissa value)).
… calculating the second mantissa value by determining whether to round off the fixed point by shifting the shifted first mantissa value to the right by 1 one more time and adding the extracted LSB value (Lutz et al., Para [0045], “If we convert an FP number to integer or fixed-point we also have to round. The concept is basically the same as FP rounding” teaches rounding off the fixed-point number. Fig. 2-3 and Para. [0048], “The first floating-point value is placed in a register 22. A 3-input multiplexer selects the appropriate 64 bits to be input to the right shifter 12, according to one of the formats shown in FIG. 3” teaches a right shifter (corresponds to shifting value to the right by 1 or more) that shifts the shifted first floating point-value (corresponds to the shifted first mantissa value). Fig. 8 and Para [0076], “FIG. 8 is a flow diagram showing how to determine the rounding increment at step 130 of FIG. 6. At step 200 the control circuitry 14 obtains the least significant bit L0, guard bit G0 and sticky bit S0 from the shifter 12. At step 202 it is determined whether the conversion is to an integer or fixed-point format and the first value is negative. If the second value is a floating-point value or the value is positive, then the least, guard and sticky bits L, G, S used for the rounding are the same as the bits L0, G0, S0 generated by the shifter 12(step 204)” teaches obtaining the least significant bit (LSB) to determine conversion to an integer or fixed-point format (corresponds to adding the obtained LSB)).
wherein the LSB value is a factor that determines whether to round off the fixed point (Lutz et al., Fig. 8 and Para [0076], “FIG. 8 is a flow diagram showing how to determine the rounding increment at step 130 of FIG. 6. At step 200 the control circuitry 14 obtains the least significant bit L0, guard bit G0 and sticky bit S0 from the shifter 12. At step 202 it is determined whether the conversion is to an integer or fixed-point format and the first value is negative. If the second value is a floating-point value or the value is positive, then the least, guard and sticky bits L, G, S used for the rounding are the same as the bits L0, G0, S0 generated by the shifter 12(step 204). If the conversion is to an integer or fixed-point value and the first value is negative, then at step 206 the bits LGS used for the rounding and determination are set according to: S=S0, G=(G0 ̂ S0), L=(L0 ̂ (G0|S0)). At step 208, the control circuitry 214 determines the rounding increment from L, G, S based on the rules set out in Table 1 above” teaches determining round increment from obtaining the least significant bit).
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. in view of Lutz et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1, Lin-2, Kum et al., and Rao et al. with Lutz et al., with motivation of shifting the updated first mantissa value to the right by a value obtained by subtracting the second exponent value from a predetermined number determined based on a type of a floating point-format when it is determined that the second exponent value is less than the number of bits of the first mantissa value, in order to determine whether to round off the fixed point, calculating the second mantissa value by determining whether to round off the fixed point by shifting the shifted first mantissa value to the right by 1 one more time and adding the extracted LSB value and wherein the LSB value is a factor that determines whether to round off the fixed point. “Also, a value may have an integer format, representing an integer value with no fractional bits, or a fixed-point format, representing a numeric value using a fixed number of integer-valued bits and a fixed number of fractional-valued bits. In an apparatus supporting more than one format, it may be desirable to convert between the different formats and so a conversion operation may be performed. The present technique seeks to provide an improved apparatus and method for converting from a floating-point value to a value of a different format” (Lutz et al., Abstract). The proposed teaching is beneficial in that it is capable of converting values to different formats so a conversion operation may be performed.
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. does not appear to explicitly teach extracting a least significant bit (LSB) value from the shifted first mantissa value
However, Guardia et al., teaches extracting a least significant bit (LSB) value from the shifted first mantissa value (Guardia et al., pg. 3-4 Section V, “The least significant bit (LSB), guard (G), round (R) and sticky (STK) bit are generated capturing the LSB and subsequent bits of Q. The 3-bit MSB of the captured data represents the LSB, G, and R respectively. The STK is obtained by means of or-chain operations of the remaining bits of the captured data… In the rounding stage, special cases as overflow and underflow are tested again. In compliance with IEEE 754–2008 standard the mantissa is defined into ã1, 2]. Therefore a left-shift operation by 1-bit on Fcr is executed if its respective MSB is zero” teaches obtaining the least significant bit value from the 3-bit MSB of the capture data from the two shifted mantissa).
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. in view of Lutz et al. in view of Guardia et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1, Lin-2, Kum et al., Rao et al., and Lutz et al. with Guardia et al., with motivation of extracting a least significant bit (LSB) value from the shifted first mantissa value. “Our proposal is able to be performed up to 149 Mhz over Virtex5. The hardware cost occupies 230 Slices and 12 Dsp48s taking a latency of 19 clock cycles” (Guardia et al., Abstract). The proposed teaching is beneficial in that it improves hardware cost and latency.
Regarding Claim 6,
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. in view of Guardia et al. in view of Lutz et al. teaches the method of claim 5, wherein the quantizing comprises: 
Lutz et al. further teaches tuning a number of bits of the calculated second mantissa value to be equal to a number of bits of the fixed-point format (Lutz et al., Fig. 4 and Para [0049], “FIG. 4 shows different examples of second value formats that can be generated by the conversion circuitry. It will be appreciated that other formats could also be supported. The top two rows of FIG. 4 show examples where the second value is another floating-point value with a smaller significand than the first value (e.g. single or half precision compared to double or single precision for the first value). The last three examples show 64-bit, 32-bit and 16-bit fixed-point or integer values. If a fixed-point value is to be generated, a radix position parameter 24 is input to the first adder 10 as shown in FIG. 2, to indicate the number of fractional bits in the fixed-point value” teaches the second value (corresponds to second mantissa value) having the same bit as the fixed-point value).
quantizing the parameter in the floating-point format to the fixed-point format by applying the extracted sign to the tuned second mantissa value (Lutz et al., Para. [0021], “In this case, then the conversion circuitry may have shift control circuitry which determines the shift amount based on at least one control parameter which specifies one or both of the formats of the first and second values. For example, a conversion instruction which triggers the conversion circuitry to perform the conversion operation may specify the at least one control parameter for controlling the shift control circuitry to determine the appropriate shift amount” teaches the parameter specifying the format of the first and second value. Para. [0023], “The conversion circuitry may comprise inverting circuitry to invert the significand of the first floating-point value or the output of the shift circuitry if the first floating-point value represents a negative value and the second value is a fixed-point or integer value. Floating-point values are represented using sign-magnitude representation, while fixed-point or integer values are represented using two's complement representation. Therefore, when converting between floating-point values and fixed-point or integer values, an inversion may be applied to preserve the sign of the value” teaches converting the floating point value to fixed-point or integer value (corresponds to quantization) by applying the preserved sign of the value to the second value (corresponds to the tuned second mantissa value)).
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. in view of Lutz et al. in view of Guardia et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1, Lin-2, Kum et al., Rao et al., and Lutz et al. with Guardia et al., with motivation of tuning a number of bits of the calculated second mantissa value to be equal to a number of bits of the fixed-point format and quantizing the parameter in the floating-point format to the fixed-point format by applying the extracted sign to the tuned second mantissa value. “Also, a value may have an integer format, representing an integer value with no fractional bits, or a fixed-point format, representing a numeric value using a fixed number of integer-valued bits and a fixed number of fractional-valued bits. In an apparatus supporting more than one format, it may be desirable to convert between the different formats and so a conversion operation may be performed. The present technique seeks to provide an improved apparatus and method for converting from a floating-point value to a value of a different format” (Lutz et al., Abstract). The proposed teaching is beneficial in that it is capable of converting values to different formats so a conversion operation may be performed.
Regarding Claim 8,
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. in view of Lutz et al. in view of Guardia et al. teaches the method of claim 5, wherein 

Lutz et al. further teaches when the floating-point format is a single-precision floating-point format, the bias constant is a decimal number of 127, the number of bits of the first mantissa value is a decimal number of 23, and the predetermined number is a decimal number of 22 (Lutz et al., Para. [0027]: 
    PNG
    media_image1.png
    113
    388
    media_image1.png
    Greyscale

teaches for the SP (corresponds to single-precision floating-point format), the bias constant is 127, number of bits of the first mantissa value is 23 bits, and the predetermined number is 22).
when the floating-point format is a double-precision floating-point format, the bias constant is a decimal number of 1023, the number of bits of the first mantissa value is a decimal number of 52, and the predetermined number is a decimal number of 51 (Lutz et al., Para. [0027], teaches for the DP (corresponds to double-precision floating-point format), the bias constant is 1023, number of bits of the first mantissa value is 52 bits, and the predetermined number is 51).
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. in view of Lutz et al. in view of Guardia et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1, Lin-2, Kum et al., Rao et al., and Lutz et al. with Guardia et al., with motivation of when the floating-point format is a single-precision floating-point format, the bias constant is a decimal number of 127, the number of bits of the first mantissa value is a decimal number of 23, and the predetermined number is a decimal number of 22 and when the floating-point format is a double-precision floating-point format, the bias constant is a decimal number of 1023, the number of bits of the first mantissa value is a decimal number of 52, and the predetermined number is a decimal number of 51. “Also, a value may have an integer format, representing an integer value with no fractional bits, or a fixed-point format, representing a numeric value using a fixed number of integer-valued bits and a fixed number of fractional-valued bits. In an apparatus supporting more than one format, it may be desirable to convert between the different formats and so a conversion operation may be performed. The present technique seeks to provide an improved apparatus and method for converting from a floating-point value to a value of a different format” (Lutz et al., Abstract). The proposed teaching is beneficial in that it is capable of converting values to different formats so a conversion operation may be performed.
Regarding Claim 14,
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. teaches the neural network apparatus of claim 13
Lin-1 further teaches wherein the processor is further configured to (Lin-1, Col. 2 Lines 33-36, “An apparatus for quantizing a floating point machine learning network to obtain a fixed point machine learning network using a quantizer may include a memory unit and at least one processor coupled to the memory unit” teaches at least one processor).
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. does not appear to explicitly teach shift the updated first mantissa value to the right by a value obtained by subtracting the second exponent value from a predetermined number determined depending on the type of a floating-point format when it is determined that the second exponent value is less than the number of bits of the first mantissa value, in order to determine whether to round off the fixed point, calculate the second mantissa value by determining whether to round off the fixed point by shifting the shifted first mantissa value to the right by 1 one more time and adding the extracted LSB value and wherein the LSB value is a factor that determines whether to round off the fixed point.
However, Lutz et al., teaches shift the updated first mantissa value to the right by a value obtained by subtracting the second exponent value from a predetermined number determined depending on the type of a floating-point format when it is determined that the second exponent value is less than the number of bits of the first mantissa value, in order to determine whether to round off the fixed point (Lutz et al., Para. [0028], “The exponent is biased, which means that the true exponent differs from the one stored in the number. For example, biased SP exponents are 8-bits long and range from 0 to 255. Exponents 0 and 255 are special cases, but all other exponents have bias 127, meaning that the true exponent is 127 less than the biased exponent. The smallest biased exponent is 1, which corresponds to a true exponent of −126. The maximum biased exponent is 254, which corresponds to a true exponent of 127. HP and DP exponents work the same way, with the biases indicated in the table above” teaches determining when true exponent (corresponds to the second exponent value) being less than the number of bits of the biased exponent (corresponds to the first mantissa value)).
… calculate the second mantissa value by determining whether to round off the fixed point 37012055.0456 by shifting the shifted first mantissa value to the right by 1 one more time and adding the extracted LSB value (Lutz et al., Para [0045], “If we convert an FP number to integer or fixed-point we also have to round. The concept is basically the same as FP rounding” teaches rounding off the fixed-point number. Fig. 2-3 and Para. [0048], “The first floating-point value is placed in a register 22. A 3-input multiplexer selects the appropriate 64 bits to be input to the right shifter 12, according to one of the formats shown in FIG. 3” teaches a right shifter (corresponds to shifting value to the right by 1 or more) that shifts the shifted first floating point-value (corresponds to the shifted first mantissa value). Fig. 8 and Para [0076], “FIG. 8 is a flow diagram showing how to determine the rounding increment at step 130 of FIG. 6. At step 200 the control circuitry 14 obtains the least significant bit L0, guard bit G0 and sticky bit S0 from the shifter 12. At step 202 it is determined whether the conversion is to an integer or fixed-point format and the first value is negative. If the second value is a floating-point value or the value is positive, then the least, guard and sticky bits L, G, S used for the rounding are the same as the bits L0, G0, S0 generated by the shifter 12(step 204)” teaches obtaining the least significant bit (LSB) to determine conversion to an integer or fixed-point format (corresponds to adding the obtained LSB)).
wherein the LSB value is a factor that determines whether to round off the fixed point (Lutz et al., Fig. 8 and Para [0076], “FIG. 8 is a flow diagram showing how to determine the rounding increment at step 130 of FIG. 6. At step 200 the control circuitry 14 obtains the least significant bit L0, guard bit G0 and sticky bit S0 from the shifter 12. At step 202 it is determined whether the conversion is to an integer or fixed-point format and the first value is negative. If the second value is a floating-point value or the value is positive, then the least, guard and sticky bits L, G, S used for the rounding are the same as the bits L0, G0, S0 generated by the shifter 12(step 204). If the conversion is to an integer or fixed-point value and the first value is negative, then at step 206 the bits LGS used for the rounding and determination are set according to: S=S0, G=(G0 ̂ S0), L=(L0 ̂ (G0|S0)). At step 208, the control circuitry 214 determines the rounding increment from L, G, S based on the rules set out in Table 1 above” teaches determining round increment from obtaining the least significant bit).
Lin-1 in view of Lin-2 in view of Kum et al. Rao et al. in view of Lutz et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1, Lin-2, Kum et al., and Rao et al. with Lutz et al., with motivation to shift the updated first mantissa value to the right by a value obtained by subtracting the second exponent value from a predetermined number determined depending on the type of a floating-point format when it is determined that the second exponent value is less than the number of bits of the first mantissa value, in order to determine whether to round off the fixed point, calculate the second mantissa value by determining whether to round off the fixed point by shifting the shifted first mantissa value to the right by 1 one more time and adding the extracted LSB value and wherein the LSB value is a factor that determines whether to round off the fixed point. “Also, a value may have an integer format, representing an integer value with no fractional bits, or a fixed-point format, representing a numeric value using a fixed number of integer-valued bits and a fixed number of fractional-valued bits. In an apparatus supporting more than one format, it may be desirable to convert between the different formats and so a conversion operation may be performed. The present technique seeks to provide an improved apparatus and method for converting from a floating-point value to a value of a different format” (Lutz et al., Abstract). The proposed teaching is beneficial in that it is capable of converting values to different formats so a conversion operation may be performed.
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. in view of Lutz et al. does not appear to explicitly teach extract a least significant bit (LSB) value from the shifted first mantissa value
However, Guardia et al., teaches extract a least significant bit (LSB) value from the shifted first mantissa value (Guardia et al., pg. 3-4 Section V, “The least significant bit (LSB), guard (G), round (R) and sticky (STK) bit are generated capturing the LSB and subsequent bits of Q. The 3-bit MSB of the captured data represents the LSB, G, and R respectively. The STK is obtained by means of or-chain operations of the remaining bits of the captured data… In the rounding stage, special cases as overflow and underflow are tested again. In compliance with IEEE 754–2008 standard the mantissa is defined into ã1, 2]. Therefore a left-shift operation by 1-bit on Fcr is executed if its respective MSB is zero” teaches obtaining the least significant bit value from the 3-bit MSB of the capture data from the two shifted mantissa).
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. in view of Lutz et al. in view of Guardia et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1, Lin-2, Kum et al., Rao et al., and Lutz et al. with Guardia et al., with motivation to extract a least significant bit (LSB) value from the shifted first mantissa value. “Our proposal is able to be performed up to 149 Mhz over Virtex5. The hardware cost occupies 230 Slices and 12 Dsp48s taking a latency of 19 clock cycles” (Guardia et al., Abstract). The proposed teaching is beneficial in that it improves hardware cost and latency.
Regarding Claim 15,
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. in view of Lutz et al. in view of Guardia et al. teaches the neural network apparatus of claim 14
Lin-1 further teaches wherein the processor is further configured to (Lin-1, Col. 2 Lines 33-36, “An apparatus for quantizing a floating point machine learning network to obtain a fixed point machine learning network using a quantizer may include a memory unit and at least one processor coupled to the memory unit” teaches at least one processor).
Lutz et al. further teaches tune a number of bits of the calculated second mantissa value to be equal to a number of bits of the fixed-point format (Lutz et al., Fig. 4 and Para [0049], “FIG. 4 shows different examples of second value formats that can be generated by the conversion circuitry. It will be appreciated that other formats could also be supported. The top two rows of FIG. 4 show examples where the second value is another floating-point value with a smaller significand than the first value (e.g. single or half precision compared to double or single precision for the first value). The last three examples show 64-bit, 32-bit and 16-bit fixed-point or integer values. If a fixed-point value is to be generated, a radix position parameter 24 is input to the first adder 10 as shown in FIG. 2, to indicate the number of fractional bits in the fixed-point value” teaches the second value (corresponds to second mantissa value) having the same bit as the fixed-point value).
quantize the parameter in the floating-point format to the fixed-point format by applying the extracted sign to the tuned second mantissa value (Lutz et al., Para. [0021], “In this case, then the conversion circuitry may have shift control circuitry which determines the shift amount based on at least one control parameter which specifies one or both of the formats of the first and second values. For example, a conversion instruction which triggers the conversion circuitry to perform the conversion operation may specify the at least one control parameter for controlling the shift control circuitry to determine the appropriate shift amount” teaches the parameter specifying the format of the first and second value. Para. [0023], “The conversion circuitry may comprise inverting circuitry to invert the significand of the first floating-point value or the output of the shift circuitry if the first floating-point value represents a negative value and the second value is a fixed-point or integer value. Floating-point values are represented using sign-magnitude representation, while fixed-point or integer values are represented using two's complement representation. Therefore, when converting between floating-point values and fixed-point or integer values, an inversion may be applied to preserve the sign of the value” teaches converting the floating point value to fixed-point or integer value (corresponds to quantization) by applying the preserved sign of the value to the second value (corresponds to the tuned second mantissa value)).
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. in view of Lutz et al. in view of Guardia et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1, Lin-2, Kum et al., Rao et al., and Lutz et al. with Guardia et al., with motivation to tune a number of bits of the calculated second mantissa value to be equal to a number of bits of the fixed-point format and quantize the parameter in the floating-point format to the fixed-point format by applying the extracted sign to the tuned second mantissa value. “Also, a value may have an integer format, representing an integer value with no fractional bits, or a fixed-point format, representing a numeric value using a fixed number of integer-valued bits and a fixed number of fractional-valued bits. In an apparatus supporting more than one format, it may be desirable to convert between the different formats and so a conversion operation may be performed. The present technique seeks to provide an improved apparatus and method for converting from a floating-point value to a value of a different format” (Lutz et al., Abstract). The proposed teaching is beneficial in that it is capable of converting values to different formats so a conversion operation may be performed.
Regarding Claim 17,
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. in view of Lutz et al. in view of Guardia et al. teaches the neural network apparatus of claim 14, wherein 
when the floating-point format is a single-precision floating-point format, the bias constant is a decimal number of 127, the number of bits of the first mantissa value is a decimal number of 23, and the predetermined number is a decimal number of 22 (Lutz et al., Para. [0027]: 
    PNG
    media_image1.png
    113
    388
    media_image1.png
    Greyscale

teaches for the SP (corresponds to single-precision floating-point format), the bias constant is 127, number of bits of the first mantissa value is 23 bits, and the predetermined number is 22).
when the floating-point format is a double-precision floating-point format, the bias constant is a decimal number of 1023, the number of bits of the first mantissa value is a decimal number of 52, and the predetermined number is a decimal number of 51 (Lutz et al., Para. [0027], teaches for the DP (corresponds to double-precision floating-point format), the bias constant is 1023, number of bits of the first mantissa value is 52 bits, and the predetermined number is 51).
Lin-1 in view of Lin-2 in view of Kum et al. in view of Rao et al. in view of Lutz et al. in view of Guardia et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network” and “”quantization”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Lin-1, Lin-2, Kum et al., Rao et al., and Lutz et al. with Guardia et al., with motivation of when the floating-point format is a single-precision floating-point format, the bias constant is a decimal number of 127, the number of bits of the first mantissa value is a decimal number of 23, and the predetermined number is a decimal number of 22 and when the floating-point format is a double-precision floating-point format, the bias constant is a decimal number of 1023, the number of bits of the first mantissa value is a decimal number of 52, and the predetermined number is a decimal number of 51. “Also, a value may have an integer format, representing an integer value with no fractional bits, or a fixed-point format, representing a numeric value using a fixed number of integer-valued bits and a fixed number of fractional-valued bits. In an apparatus supporting more than one format, it may be desirable to convert between the different formats and so a conversion operation may be performed. The present technique seeks to provide an improved apparatus and method for converting from a floating-point value to a value of a different format” (Lutz et al., Abstract). The proposed teaching is beneficial in that it is capable of converting values to different formats so a conversion operation may be performed.

Response to Arguments
Applicant's arguments filed 3/9/2022 with respect to the 35 U.S.C.103 rejection to claim 1-6, 8-15, and 17-19 in the previous Office Action have been fully considered but they are not persuasive.	

Applicant assets that “Accordingly, as the Office has concluded that Lin1 fails to disclose the claimed "applying a fractional length of a fixed-point format to the parameter in the floating-point format," of independent claim 1, and Lin2 discloses no more that already disclosed in Lin1,  it is respectfully submitted that no modification of Lin1 in view of Lim2 would disclose or suggest the claimed "applying a fractional length of a fixed-point format to the parameter in the floating-point format."” (Remarks, Pg. 23).
Examiner’s Response:
The Examiner respectfully disagrees. Lin-2 teaches “applying a fractional length of a fixed-point format to the parameter in the floating-point format” (Lin-2, pg. 4 Section 3.3, “Note that determining the fixed point format is equivalent to determining the resolution, which in turn means identifying the number of fractional bits it requires to represent the number. The following equations can be used to compute the number of fractional bits: • Determine the effective standard deviation of the quantity being quantized: ξ. • Calculate step size via Table 1: s = ξ · Stepsize(β). • Compute number of fractional bits: n = −[log2 s]” teaches computing the number of fractional bits (corresponds to fraction length) of a fixed point format. Pg. 3 Section 3.3, “Any floating point DCN model can be converted to fixed point by following these steps: • Run a forward pass in floating point using a large set of typical inputs and record the activations” teaches parameter in float point format and converting float to fixed point).

Applicant assets that “Furthermore, the Office Action acknowledged that Lin1 and Lin2, either alone or in combination, fails to describe or suggest, inter alia, of "performing, by an integer arithmetic logic unit (ALU), an operation for determining whether to round off a fixed point based on a most significant bit among bit values to be discarded after a quantization process; and quantizing the parameter in the floating-point format to a parameter in the fixed-point format, based on a determination result of the operation with the ALU, for generating the neural network in a low power device using the quantized fixed-point parameter," as recited in independent claim 1 (see Office Action, page 55).” (Remarks, Pg. 24)
The Examiner respectfully disagrees. Kum et al. teaches “performing, by an integer arithmetic logic unit (ALU), an operation for determining whether to round off a fixed point based on a most significant bit among bit values to be discarded after a quantization process” (Kum et al., Fig. 9 and pg. 927 Section IV. B, “For a right shift, the most significant bits (MSBs) are sign extended and the MSB of truncated bits is used as the carry-in signal of the adders for rounding. For a left shift, the least significant bits (LSBs) are filled with zeros and the MSBs are thrown away, but overflows do not occur because the IWLs are carefully determined throughout the range estimation” teaches the arithmetic logic unit (see Fig. 9) and further teaches rounding of the adders based on the most significant bits and the MSBs being thrown away (corresponds to discarded) after quantization).

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Henry T Nguyen whose telephone number is (571)272-8860. The examiner can normally be reached Monday-Friday 8:00am-4:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/HENRY TRONG NGUYEN/Examiner, Art Unit 2125                                                                                                                                                                                                        
/BRIAN M SMITH/Primary Examiner, Art Unit 2122