Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The amendment filed 2021-09-01 has been entered.  Applicant’s amendments to the Specification and Claims have overcome each and every objection and 112(b) rejection previously set forth in the Non-Final office action mailed 2021-03-31.  The status of the claims is as follows:
Claims 18-20 are cancelled.
Claims 1-17 and 21-25 are pending in the application.
Claims 1-12, 14, and 17 have been amended.
Claims 21-25 are new.
Response to Arguments
Applicant's argument with respect to rejections under 35 U.S.C. 103 has been fully considered but they are not persuasive. Applicant argues that the combination with Frantz does not teach that the saturated mantissa value is saved or included in a particular format.  Examiner respectfully disagrees, and points out that the claim language states:  “second activation values comprising: (1) values in a second block floating-point format for all of the first activation values, and (2) outlier values having higher precision than the first floating-point format for at least one but not all of the first activation values”.  See Examiner’s diagram of Frantz’s concept below:

    PNG
    media_image1.png
    616
    1109
    media_image1.png
    Greyscale

As shown above, Frantz discloses a second activation value, which is a representation of the outlier value.  It is an “outlier” because it “saturates the mantissa”, thus requiring a new shared exponent.  Frantz also gives no indication that these values are not “saved”.  In fact, Frantz, Para [0007] Last Sentence, discloses that floating point representations require large memory requirements:  “Floating-point pixel-level ADC image sensors require large memory to store the data, and also require a complex image reconstruction process.”  Frantz, Para [0013], then discloses:  “Therefore, a need exists for a signal and image processor to capture and store this wide dynamic range (WDR) signal in a standard format, as well as to perform the associated signal and image processing operations efficiently.”  Here, Frantz discloses that the data in the standard format they are using will be stored, and Fig. 2 discloses Memory 225.  The partitioning of the outlier values in Frantz Para [0062-0063] is part of the process of converting to the format that will be stored.  Examiner notes that since the second activation value is the 
	Applicant has amended the claim to state “outlier values having higher precision than the first floating-point format”.  Examiner argues that the outlier value indeed has “higher precision”, as it required more bits to store under the original exponent, and also, the fact that the blocks had to be split into smaller blocks, indicates a more targeted and “precise” representation of the outlier value.
Claim Objections
Claim 1 is objected to because of the following informalities:  “floating-poing” should be changed to read “floating-point”.  Appropriate correction is required.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: "an outlier selector configured to select a shared exponent for the second values" in claim 5.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) 
Pursuant to the above, the limitation "an outlier selector configured to select a shared exponent for the second values" in Claim 5 is being interpreted under 35 U.S.C. 112(f), as “selector” is simply a substitute for “means”, as “outlier selector” does not have any generally understood structural meaning in the art (see MPEP 2181(I)(A)) and “configured to” is a linking phrase in place of “for” that makes it clear that the claim is reciting a function (see MPEP 2181(I)(B) re: “configured to”).  Accordingly, it is being interpreted as per the Specification Para [0105] Lines 2-10, which provides acts for achieving the specified function:  “The outlier quantizer 765 can include an outlier selector that determines a shared exponent for the compressed activation values. For example, the selector can identify a shared exponent by determining that at least one of a mean (average), a median, and/or a mode for at least a portion of the activation values. In some examples, the selector can identify a shared exponent by identifying a group of largest outlying values in the set of activation values and for the remaining group of activation values not identified to be in the group of the largest outliers, determining a shared exponent by selecting the largest exponent of the remaining group. In some examples, the exponent used by the largest number of activation values is selected as the shared exponent.”  
101 Remarks
Independent claim 22, and dependent claims 23-25, recite “computer-readable storage devices or media”.  While “non-transitory” is not explicitly stated in the claims, the specification 
  
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5 and 22-25 are rejected under 35 U.S.C. 103 as being unpatentable over Drumond et. al. (“Training DNNs with Hybrid Block Floating Point”, hereinafter Drumond) in view of Ling et. al. (“Harnessing Numerical Flexibility for Deep Learning on FPGAs”; hereinafter Ling) and Frantz (US 2012/0262597 A1; hereinafter Frantz).
As per Claim 1, Drumond teaches a neural network comprising (Drumond, Abstract, discloses “DNNs”:  “The wide adoption of DNNs has given birth to unrelenting computing requirements, forcing datacenter operators to adopt domain-specific accelerators to train them” and Drumond, Conclusion, Ends with “Higher throughput leads to faster and more energy-efficient DNN training/inference, while model compression leads to lower bandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and lower communication bandwidth requirements for distributed training”.  Here, Drumond discloses memory and a chip (i.e., processor)).
an outlier quantizer formed from one or more processors, the outlier quantizer being in communication with a memory (Drumond, Conclusion, Ends with “Higher throughput leads to faster and more energy-efficient DNN training/inference, while model compression leads to lower bandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and lower communication bandwidth requirements for distributed training”.  Here, Drumond discloses memory and a chip (i.e., processor)).  Drumond, Sec 5.3, discloses a piece of hardware, an “accelerator”:  “HBFP accelerators exhibit arithmetic density that is similar to their fixed-point counterparts. To further illustrate this point, we synthesized a proof-of-concept FPGA-based accelerator. Figure 2 shows the block diagram of the accelerator.”  Drumond, Figure 2, discloses an “External IO interface”.  IO refers to input and output of data, therefore the accelerator must be in communication with the memory, which is where data is stored.  Drumond, Conclusion, also discloses on-chip and off-chip memory:  “Higher throughput leads to faster and more energy-efficient DNN training/inference, while model compression leads to lower bandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and lower communication bandwidth requirements for distributed training”)
and the neural network being configured to: produce first activation values in a first floating-point format (Drumond, Sec 5.1 Line 6, discloses “In the forward pass, we convert the activations to BFP, giving the x tensor one exponent per training input. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative”.  Here, Drumond discloses BFP, or block floating point (i.e., a floating point format).  Drumond also discloses activations, which are in both the forward and backward pass, comprising two activation values.  As no specific definition is given for “first” or “second”, examiner is considering the activations on the backward pass as “first activation values”).
	with the outlier quantizer, convert a plurality of the first activation values to a second block floating-point format [with outlier values having higher precision than the first floating-point format] and thereby produce second activation values comprising: (1) values in a second block floating-point format for all of the first activation values (Drumond, Sec 5.1 Line 6, discloses “In the forward pass, we convert the activations to BFP, giving the x tensor one exponent per training input. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative”.  Examiner’s Note:  Each pass of a neural network updates the activation values (i.e., converts them).  As stated above, BFP is carried out on both the backward and forward passes.  So, after a backward pass, the activation is converted into a second BFP format on the next forward pass.  Note that Drumond converts them to FP in the interim, but the net result is the same, a conversion of the activation value from the first to the second BFP format)
However, Drumond does not explicitly teach store the second activation values in the memory; outlier values having higher precision than the first floating-point format
Ling teaches store the second activation values in the memory (Ling, Sec 2.2 Last Paragraph, discloses:  “In addition to implementing dot-products in block floating point form, we also can store the data in block floating point form.”  Here, Ling discloses calculating dot products with block floating point form. The dot product calculation in a neural network comprises activations, therefore Ling discloses converting activation values to block floating point form (i.e., second activation values).   Finally, Ling discloses storing the data, which includes activation values, in block floating point form (i.e., storing the second activation values in memory).  Ling, Table 3, discloses “Block size vs no blocking used to store weights and intermediate data (feature maps)”, wherein feature maps are also known in the art as “activation maps” and comprise activations.)  
Drumond and Ling are analogous art because they are in the field of endeavor of quantizing neural networks. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, with storing of quantized activation values of Ling. The modification would have been obvious because one of ordinary skill in the art would be motivated to reduce the memory requires to store the data. (Ling, Sec 2.2 Last Paragraph, discloses:  “In addition to implementing dot-products in block floating point form, we also can store the data in block floating point form. This can lead to a significant reduction in both memory bandwidth to fetch data, and memory capacity to store the data either on or off chip.”)
Drumond and Ling thus far fail to explicitly teach with the outlier quantizer, convert a plurality of the first activation values to a second block floating-point format with outlier values having higher precision than the first floating-point format and thereby produce second activation values comprising (2) outlier values having higher precision than the first floating-point format for at least one but not all of the first activation values, and store the second  Frantz teaches with the outlier quantizer, convert a plurality of the first activation values to a second block floating-point format with outlier values having higher precision than the first floating-point format and thereby produce second activation values comprising (2) outlier values having higher precision than the first floating-point format for at least one but not all of the first activation values, and store the second activation values and the outlier values in the memory. (Frantz, Para [0062-0063], discloses “Each block is assigned an exponent value depending on the mean brightness of that block. Then each pixel is assigned a relative mantissa value that represents its brightness compared to the mean brightness. For example, if one were to take a picture of a person standing in front of a window, the blocks would be such that the person's body would be represented by a set of blocks with the same exponent value (or similar), and the window would be represented by a set of blocks with another exponent value. In FIG. 5 and FIG. 6, it is evident how the blocks might be created for the picture shown. The very bright areas would have high exponent values, and the dark areas would have low exponent values. In the example, square blocks of pixels are used, but any shape of the blocks would be appropriate--even non-rectangular. The ultimate solution is that the mantissas for all of the pixels in the scene create a normalized picture, and the exponent values indicate the variation of brightness over the whole scene.  If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.”  Note Examiner’s diagram of Frantz below:

    PNG
    media_image2.png
    616
    1109
    media_image2.png
    Greyscale

Here, Frantz discloses a method of conversion to block floating point format, wherein some of the values “saturate the mantissa” (i.e., are outliers)).  These are stored in a second block floating point format with second shared exponents and activation values, which represent the outlier values.  The outlier value indeed has “higher precision”, as it required more bits to store under the original exponent, and also, the fact that the blocks had to be split into smaller blocks, indicates a more targeted and “precise” representation of the outlier value.)
store the outlier values in the memory (Frantz, Para [0007] Last Sentence, discloses that floating point representations require large memory requirements:  “Floating-point pixel-level ADC image sensors require large memory to store the data, and also require a complex image reconstruction process.”  Frantz, Para [0013], then discloses:  “Therefore, a need exists for a signal and image processor to capture and store this wide dynamic range (WDR) signal in a standard format, as well as to perform the associated signal and image processing operations efficiently.”  Here, Frantz discloses that the data in the standard format they are using will be stored, and Fig. 2 discloses Memory 225.  The partitioning of the outlier values in Frantz Para [0062-0063] is part of the process of converting to the format that will be stored.)
	Drumond and Frantz are analogous art because they are in the field of endeavor of efficient floating point operations. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, with the overflow storage of Frantz. The modification would have been obvious because one of ordinary skill in the art would be motivated to avoid losing data. (Drumond, Section 4, discloses: “In this example, BFP can only represent a accurately if the value distribution of a is not too wide to be captured by ma and the exponent ea is representative of said value distribution. If ea is too large then small values are lost and the most significant bits of the mantissas are wasted. If ea is too small, then the larger values in a will be saturated, leading to data loss.”)

	As per Claim 2, the combination of Drumond, Ling, and Frantz teaches the neural network of claim 1 as shown above, as well as wherein the first floating-point format is a block floating point format. (Drumond, Section 3 Last Paragraph, discloses “Given these requirements, we identify block floating-point (BFP) as the ideal numeric representation for DNNs. BFP represents numbers with a mantissa and exponent, like floating-point, but exponents are shared across entire tensors, as shown in Figure 1, resulting in dot products that can be computed entirely in fixed-point logic.”  Drumond, Section 5.1, discloses that this format is used in both the backward (i.e., first) and forward pass:  “In the forward pass, we convert
the activations to BFP, giving the x tensor one exponent per training input. Then we execute the target operation in native floating-point arithmetic. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative.”)

	As per Claim 3, the combination of Drumond, Ling, and Frantz teaches the neural network of claim 1 as shown above, as well as wherein each of the second activation values comprises a mantissa having fewer bits than its respective mantissa in the first floating-point format. (Drumond, Section 2 Para 2, discloses:  “Quantization [8] is a widely used technique for DNN inference. BFP [9] has also been proposed for inference. These techniques quantize the weights of DNNs trained with full precision floating point to use fixed-point logic during inference.  We consider the more challenging task of training DNNs with arithmetic density that matches quantized inference.”  Here, Drumond discloses that their work is an improvement on the known technique of only using BFP during inference (the forward pass, i.e., the second activation values).  In this case, the backward pass (i.e., first activation values), would be in FP.  Drumond, Sec 3 Top of Page 4, discloses:  “FP32 representations are easy to use but inefficient. They represent numbers with a 24-bit mantissa and a 8-bit exponent. In terms of precision, the 24-bit mantissa is an overkill for DNNs. Table 1 shows the validation error obtained when training ResNet-20 models on CIFAR10 using floating-point
representations with various mantissas and exponent widths. We observed convergence without loss of precision with 8-bit mantissas, convergence with a small loss of precision with 4-bit mantissas, and divergence only when using 2-bit mantissas.”  Here, Drumond discloses that standard floating point format, FP32, has a 24-bit mantissa.  Drumond then discloses that in their method, which uses BFP, they use an 8-bit mantissa.  Therefore, Drumond’s disclosure of only using BFP during inference, amounts to second activation values comprising a mantissa having fewer bits than its respective mantissa in the first floating-point format.    Examiner’s Note:  Drumond states “quantize the weights…to use fixed-point logic during inference”.  In order to “use fixed-point logic” for dot product operations, the activation values must also be quantized, not only the weights.)

As per Claim 4, the combination of Drumond, Ling, and Frantz teaches the neural network of claim 1 as shown above, as well as wherein each of the outlier values for the at least one but not all of the first activation values comprises a respective outlier exponent. (Frantz, Para [0062-0063], discloses “Each block is assigned an exponent value depending on the mean brightness of that block. Then each pixel is assigned a relative mantissa value that represents its brightness compared to the mean brightness. For example, if one were to take a picture of a person standing in front of a window, the blocks would be such that the person's body would be represented by a set of blocks with the same exponent value (or similar), and the window would be represented by a set of blocks with another exponent value. In FIG. 5 and FIG. 6, it is evident how the blocks might be created for the picture shown. The very bright areas would have high exponent values, and the dark areas would have low exponent values. In the example, square blocks of pixels are used, but any shape of the blocks would be appropriate--even non-rectangular. The ultimate solution is that the mantissas for all of the pixels in the scene create a normalized picture, and the exponent values indicate the variation of brightness over the whole scene.  If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.”  Here, Frantz discloses that some of the block floating point values, which comprise exponents, saturate the mantissa (I.e., are outliers).  Note Examiner’s diagram of Frantz below:

    PNG
    media_image2.png
    616
    1109
    media_image2.png
    Greyscale

Here, Frantz discloses a method of conversion to block floating point format, wherein some of the values “saturate the mantissa” (i.e., are outliers)).  These are stored in a second block floating point format with second shared exponents and activation values, which represent the outlier values.  The outlier value indeed has “higher precision”, as it required more bits to store under the original exponent, and also, the fact that the blocks had to be split into smaller blocks, indicates a more targeted and “precise” representation of the outlier value.)

As per Claim 5, the combination of Drumond, Ling, and Frantz teaches the neural network of claim 1 as shown above, as well as wherein the outlier quantizer comprises: an outlier selector configured to select a shared exponent for the values in the second block floating-point format.  (Drumond, Section 3 Last Paragraph Lines 3-4, discloses a shared exponent:  “However, BFP logic is denser because exponents are shared across entire tensors, resulting in dot products that can be computed entirely in fixed-point logic.”  Drumond, Section 2 Last Paragraph Lines 8-10, discloses “Our approach computes exponents more frequently and it does so in-device, without requiring any additional stat collection, and accommodating dynamic dataflows naturally.”  Here, Drumond discloses, an in-device (i.e., in the hardware, an outlier selector in the outlier quantizer) selector that selects a shared exponent.  Note that in the 112(f) analysis above, examiner is interpreting “outlier selector” as a device that, per [0105], “For example, the selector can identify a shared exponent by determining that at least one of a mean (average), a median, and/or a mode for at least a portion of the activation values.”  Frantz, Para [0062], discloses using the mean (average):  “Each block is assigned an exponent value depending on the mean brightness of that block.”)

As per Claim 22, Drumond teaches one or more computer-readable storage devices or media storing computer-executable instructions, which when executed by a computer, cause the computer to perform a method, the method comprising: 
 (Drumond, Abstract, discloses “computing requirements”:  “The wide adoption of DNNs has given birth to unrelenting computing requirements, forcing datacenter operators to adopt domain-specific accelerators to train them” and Drumond, Conclusion, Ends with “Higher throughput leads to faster and more energy-efficient DNN training/inference, while model compression leads to lower bandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and lower communication bandwidth requirements for distributed training”.  Here, Drumond discloses memory (i.e. computer-readable storage device)).
producing first activation values for a neural network in a first block floating-point format (Drumond, Sec 5.1 Line 6, discloses “In the forward pass, we convert the activations to BFP, giving the x tensor one exponent per training input. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative”.  Here, Drumond discloses BFP, or block floating point (i.e., a floating point format).  Drumond also discloses activations, which are in both the forward and backward pass, comprising two activation values.  As no specific definition is given for “first” or “second”, examiner is considering the activations on the backward pass as “first activation values”).
However, Drumond does not explicitly teach storing activation values in block floating point format; converting at least one but not all of the first activation values into a block floating-point format for outlier values, the converting resulting in outlier activation values in 
Ling teaches storing activation values in block floating point format (Ling, Sec 2.2 Last Paragraph, discloses:  “In addition to implementing dot-products in block floating point form, we also can store the data in block floating point form.”  Here, Ling discloses calculating dot products with block floating point form. The dot product calculation in a neural network comprises activations, therefore Ling discloses converting activation values to block floating point form (i.e., second activation values).   Finally, Ling discloses storing the data, which includes activation values, in block floating point form (i.e., storing the second activation values in memory).  Ling, Table 3, discloses “Block size vs no blocking used to store weights and intermediate data (feature maps)”, wherein feature maps are also known in the art as “activation maps” and comprise activations.)   
Drumond and Ling thus far fail to teach converting at least one but not all of the first activation values into a block floating-point format for outlier values, the converting resulting in outlier activation values in the block floating-point format for outlier values, the block floating-point format for outlier values being different than the first block floating-point format; and storing the outlier activation values in the block floating-point format for outlier values.
Frantz teaches converting at least one but not all of the first activation values into a block floating-point format for outlier values, the converting resulting in outlier activation values in the block floating-point format for outlier values, the block floating-point format for outlier values being different than the first block floating-point format (Frantz, Para [0062-0063], discloses “Each block is assigned an exponent value depending on the mean brightness of that block. Then each pixel is assigned a relative mantissa value that represents its brightness compared to the mean brightness. For example, if one were to take a picture of a person standing in front of a window, the blocks would be such that the person's body would be represented by a set of blocks with the same exponent value (or similar), and the window would be represented by a set of blocks with another exponent value. In FIG. 5 and FIG. 6, it is evident how the blocks might be created for the picture shown. The very bright areas would have high exponent values, and the dark areas would have low exponent values. In the example, square blocks of pixels are used, but any shape of the blocks would be appropriate--even non-rectangular. The ultimate solution is that the mantissas for all of the pixels in the scene create a normalized picture, and the exponent values indicate the variation of brightness over the whole scene.  If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.”  Here, Frantz discloses that some values, when converted to BFP, will be outliers and need to be stored in pieces.  They will have at least a first mantissa with a shared exponent, and a second mantissa with an outlier exponent, which in Frantz’s case, is the same as the shared exponent, as Frantz has started a new block with a shared exponent.  Note Examiner’s diagram of Frantz below:

    PNG
    media_image2.png
    616
    1109
    media_image2.png
    Greyscale

Here, Frantz discloses a method of conversion to block floating point format, wherein some of the values “saturate the mantissa” (i.e., are outliers)).  These are stored in a second block floating point format with second shared exponents and activation values, which represent the outlier values.  The outlier value indeed has “higher precision”, as it required more bits to store under the original exponent, and also, the fact that the blocks had to be split into smaller blocks, indicates a more targeted and “precise” representation of the outlier value.  This block floating-point format for outlier values being different than the first block floating-point format, as it relies on a different exponent and is based on a smaller group of values.)
and storing the outlier activation values in the block floating-point format for outlier values  (Frantz, Para [0007] Last Sentence, discloses that floating point representations require large memory requirements:  “Floating-point pixel-level ADC image sensors require large memory to store the data, and also require a complex image reconstruction process.”  Frantz, Para [0013], then discloses:  “Therefore, a need exists for a signal and image processor to capture and store this wide dynamic range (WDR) signal in a standard format, as well as to perform the associated signal and image processing operations efficiently.”  Here, Frantz discloses that the data in the standard format they are using will be stored, and Fig. 2 discloses Memory 225.  The partitioning of the outlier values in Frantz Para [0062-0063] is part of the process of converting to the format that will be stored.)

As per Claim 23, the combination of Drumond, Ling, and Frantz teaches the computer-readable storage devices or media of claim 22.  Frantz teaches wherein each of the outlier activation values in the block floating-point format for outlier values comprises a first mantissa associated with a shared exponent shared by all of the outlier values (Frantz, Para [0063], discloses “If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.” Here, the value comprises a first mantissa associated with a shared exponent shared by all of the outlier values.  This first mantissa, however, is incomplete, as it saturates the bits.  The value also comprises a second (and third, possibly fourth, etc) mantissa associated with a different exponent, as Frantz calculates a new exponent on the deconstructed data.  While Frantz is splitting up the data the value is representing, the overall value of this data still “comprises” the first incomplete value and the smaller constituent values, as an overall property of the data.)

As per Claim 24, the combination of Drumond, Ling, and Frantz teaches the computer-readable storage devices or media of claim 23.  Frantz teaches wherein each of the outlier activation values in the block floating-point format for outlier values comprises a second mantissa associated with a different exponent than the shared exponent (Frantz, Para [0063], discloses “If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.” Here, the value comprises a first mantissa associated with a shared exponent shared by all of the outlier values.  This first mantissa, however, is incomplete, as it saturates the bits.  The value also comprises a second (and third, possibly fourth, etc) mantissa associated with a different exponent, as Frantz calculates a new exponent on the deconstructed data.  While Frantz is splitting up the data the value is representing, the overall value of this data still “comprises” the first incomplete value and the smaller constituent values, as an overall property of the data.)

As per Claim 25, the combination of Drumond, Ling, and Frantz teaches the computer-readable storage devices or media of claim 24.  Frantz teaches wherein the first mantissa and the second mantissa each comprise the same number of bits. (Frantz, Abstract, discloses that all values will be in the same format:  “Embodiments of the invention provide a 16 bit floating point signal processor”, and Frantz Fig 4 shows an 11 bit mantissa.  Frantz does not change the number of bits in the mantissa when splitting up an outlier.)

Claims 6 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Drumond, Ling, and Frantz further in view of Hongxiang et. al. (“Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation”; hereinafter Hongxiang).
As per Claim 6, the combination of Drumond, Ling, and Frantz teaches the computing system of claim 5 as shown above.  However, the combination of Drumond, Ling, and Frantz does not teach wherein the outlier quantizer further comprises: a shift controller coupled to a shifter, the shifter being configured to, based on the selected shared exponent, shift the first activation values to produce mantissas for the second values and mantissas for the outlier values for the at least one but not all of the first activation values. 
Hongxiang teaches wherein the outlier quantizer further comprises: a shift controller coupled to a shifter, the shifter being configured to, based on the selected shared exponent, shift the first activation values to produce mantissas for the second values and mantissas for the outlier values for the at least one but not all of the first activation values. (Hongxiang, Sec II B I, discloses “Similar to floating-point (FP), BFP representation utilizes a mantissa and an exponent to represent a wide range of value. However, BFP separates the data into different blocks. The numbers in the same block have a joint scaling factor that corresponds to the largest exponent value within that block.”  Here, Hongxiang discloses that a shared exponent is selected as the largest exponent value.  Hongxiang, Sec III D “Accumulator” Lines 5-10, discloses “Essentially, reordering is first performed on the number to find the maximal exponent. Then the accumulator calculates the discrepancies between the maximum and the two other smaller exponents. The results are fed into shift module to complete the mantissa alignment.”  Here, Hongxiang discloses a shift module (i.e., a shift controller coupled to a shifter) that shifts values to produce mantissas based on a selected shared exponent.  Hongxiang, Figure 8, shows this shift module as a hardware component integrated into the circuit (i.e., outlier quantizer)
Drumond and Hongxiang are analogous art because they are in the field of endeavor of quantizing neural networks. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, with the shifter of Hongxiang. The modification would have been obvious because it amounts to combining prior art elements according to known methods to yield predictable results.  The prior art includes each element claimed, but not combined in a single art reference.  One of ordinary skill in the art could have combined the elements as claimed by known methods, and that in combination, each element merely performs the same function as it does separately.  Drumond’s accelerator and Hongxiang’s shifter perform the same functions apart as they do together.  As a shifter is a well-known basic computer component that performs a well-known function, which is simply to shift a numerical value by a number of bits, one of ordinary skill in the art would have recognized that the results of the combination were predictable (see MPEP 2143 KSR A).

As per Claim 9, the combination of Drumond, Ling, Frantz, and Hongxiang teaches the neural network of claim 1 as shown above.  Hongxiang teaches the outlier quantizer comprises a shifter configured to shift mantissas of the first activation values according to a shared (Hongxiang, Sec II B I, discloses “Similar to floating-point (FP), BFP representation utilizes a mantissa and an exponent to represent a wide range of value. However, BFP separates the data into different blocks. The numbers in the same block have a joint scaling factor that corresponds to the largest exponent value within that block.”  Here, Hongxiang discloses that a shared exponent is selected as the largest exponent value.  Hongxiang, Sec III D “Accumulator” Lines 5-10, discloses “Essentially, reordering is first performed on the number to find the maximal exponent. Then the accumulator calculates the discrepancies between the maximum and the two other smaller exponents. The results are fed into shift module to complete the mantissa alignment.”  Here, Hongxiang discloses a shift module (i.e., a shift controller coupled to a shifter) that shifts values to produce mantissas based on a selected shared exponent, a portion (which may be the entirety) of the result thereby producing a mantissa for second values.)
	 
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Drumond, Ling, and Frantz further in view of Koster et. al. (“Flexpoint: An adaptive numerical format for efficient training of deep neural networks”; hereinafter Koster).
As per Claim 7, the combination of Drumond, Ling and Frantz teaches the neural network of claim 1 as shown above.  However, the combination of Drumond, Ling, and Frantz does not teach wherein the outlier quantizer comprises a comparator to identify whether a particular one of the first activation values is selected as one of the outlier values for the at least one but not all of the first activation values.
Koster teaches wherein the outlier quantizer comprises a comparator to identify whether a particular one of the first activation values is selected as one of the outlier values for the at least one but not all of the first activation values.  (Koster, Sec 3.2 Last Paragraph, discloses implementation of their system in hardware “Finally, to implement Flexpoint efficiently in hardware, the output exponent has to be determined before the operation is actually performed. Otherwise the intermediate result needs to be stored in high precision, before reading the new exponent and quantizing the result, which would negate much of the potential savings in hardware. Therefore, intelligent management of the exponents is required.”  Koster, Section 3.3 Para 3, defines Gamma as a maximum absolute value based on the activations: “The Autoflex algorithm tracks the maximum absolute value Gamma, of the mantissa of every tensor, by using a dequeue to store a bounded history of these values.”   Koster, Section 3.4, discloses identifying which values are outliers: “At the beginning of training, the statistics queue is empty, so we use a simple trial-and-error scheme described in Algorithm 1 to initialize the exponents. We perform each operation in a loop, inspecting the output value of Gamma for overflows or underutilization, and repeat until the target exponent is found.”  Algorithm 1 discloses “If Gamma >= 2N-1 – 1 then overflow”.  Here, Koster is using a hardware component to identify outliers, and the >= operation in the algorithm is indicative of the use of a comparator.)
Drumond and Koster are analogous art because they are in the field of endeavor of quantizing neural networks. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, with the comparator 

Claims 8 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Drumond, Ling, and Frantz further in view of Nurvitadhi et. al. (US 2019/0205746 A1”; hereinafter Nurvitadhi).
As per Claim 8, the combination of Drumond, Ling, and Frantz teaches the neural network of claim 1 as shown above.  However, the combination of Drumond, Ling, and Frantz does not teach wherein the outlier quantizer comprises an address register, and wherein the outlier quantizer is configured to store an index in the memory indicating an address for at least one of the outlier values
Nurvitadhi teaches wherein the outlier quantizer comprises an address register, and wherein the outlier quantizer is configured to store an index in the memory indicating an address for at least one of the outlier values (Nurvitadhi, Para [0128], discloses the use of an address register:  “When indirect register addressing mode is used, the register address of one or more operands may be computed based on an address register value and an address immediate field in the instruction.”  Here, Nurvitadhi’s indirect addressing mode means that when the processor is processing an instruction that accesses an operand, the address immediate field of the instruction is not directly the address of the operand.  Rather, the address immediate field of the instruction is actually the address of the address register, which in turn contains the actual address of the operand.  Therefore, the address register functions as a lookup index in the memory which indicates an address of at least one operand. As for the “operands”, Nurvitadhi discloses that “operands” are used in dot product calculations in [0129]:  “The vector math group performs arithmetic such as dot product calculations on vector operands”, and that these calculations are for a neural network in [0060]:  “In embodiments, mechanisms for performing sparse matrix processing for arbitrary neural networks are disclosed.”  Dot product calculations in neural networks are matrix operations in which weights and activations are the operands.  Therefore, in combination with the outlier activation values established by the combination of Drumond and Frantz, the operands comprise outlier values, and the address register therefore stores an index in memory indicating an address for at least one of the outlier values.)
Drumond and Nurvitadhi are analogous art because they are in the field of endeavor of quantizing neural networks. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, with the address register for operands of Nurvitadhi. The modification would have been obvious because it amounts to combining prior art elements according to known methods to yield predictable results.  The prior art includes each element claimed, but not combined in a single art reference.  One of ordinary skill in the art could have combined the elements as claimed by known methods, and that in combination, each element merely performs the same function as 

As per Claim 10, the combination of Drumond, Ling, Frantz, and Nurvitadhi teaches the neural network of claim 1.  Nurvitadhi teaches the processors comprise at least one of the following: a tensor processing unit, a neural network accelerator, a graphics processing unit, or a processor implemented in a reconfigurable logic array (Nurvitadhi, Para [0159], discloses the processor comprises a graphics processing unit:  “FIG. 10 illustrates exemplary graphics software architecture for a data processing system 1000 according to some embodiments. In some embodiments, software architecture includes a 3D graphics application 1010, an operating system 1020, and at least one processor 1030. In some embodiments, processor 1030 includes a graphics processor 1032 and one or more general-purpose processor core(s) 1034. The graphics application 1010 and operating system 1020 each execute in the system memory 1050 of the data processing system.”)
and the memory is situated on a different integrated circuit than the processors, the memory includes dynamic random access memory (DRAM) or embedded DRAM (Nurvitadhi, Para [0069], discloses that the memory includes DRAM:  “The memory device 120 can be a dynamic random access memory (DRAM) device”.  Nurvitadhi, Figure 1, shows the memory device 120 on a different integrated circuit than the processors 102)
 the hardware accelerator memory including static RAM (SRAM) or a register file (Nurvitadhi, Para [0172], discloses “Memory interface may be provided via a memory controller 1265 for access to SDRAM or SRAM memory devices.”  Nurvatadhi, Figure 12, shows memory controller 1265 as being on the integrated circuit (on-chip memory)).
However, Nurvitadhi does not explicitly teach and the system further comprises a hardware accelerator including a memory temporarily storing the first activation values for at least a portion of a layer of the neural network.
Drumond teaches and the system further comprises a hardware accelerator including a memory temporarily storing the first activation values for at least a portion of a layer of the neural network. (Drumond, Sec 5.3, discloses a hardware accelerator:  “To further illustrate this point, we synthesized a proof-of-concept FPGA-based accelerator.”   Drumond, Sec 5.3 Last Line, discloses storing activation values in on-chip memory:  “The proof-of-concept accelerator operates with both weights and activations stored on-chip.”  This will be temporary, as the activation values will change during the forward and backward passes.  Drumond, Conclusion, specifically uses the term “on-chip memory”:  “Higher throughput leads to faster and more energy-efficient DNN training/inference, while model compression leads to lower bandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and lower communication bandwidth requirements for distributed training”.

Claims 11-13, 15, 17, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Mellempudi et. al. (US 2018/0322607 A1; hereinafter Mellempudi) in view of Frantz.
As per Claim 11, Mellempudi teaches a method of implementing a neural network, the method comprising (Mellempudi, Abstract, discloses “One embodiment provides for a graphics processing unit to perform computations associated with a neural network”
with the computing system: producing first activation values in a first block floating-point format  (Mellempudi, Para [0238], discloses “The activations at each layer can be quantized to a low-precision format, such as a dynamic fixed-point or blocked flow-precision floating-point format.”)
converting at least one but not all of the first activation values to a block floating-point format (Mellempudi, Para [0231], discloses “FIG. 21A-21D illustrate blocked dynamic multi-precision data operations, according to embodiments described herein. Blocked dynamic multi-precision data operations can be performed for dynamic fixed-point data and can also be generalized to enable block level scaling for any low precision data type. In a training scenario, some tensors can be blocked, while other tensors can be non-blocked. For example, back propagation may require a larger dynamic range, so the computational logic can be configured to block the tensor data using smaller block sizes. For forward propagation computations, blocking may not be required.”  With each pass of a neural network, activation values are updated (i.e., converted).  Here, Mellempudi discloses that the activation values may converted to a different format on the backward pass.  In BFP, a shared exponent is selected for each block.  Mellempudi discloses that the block sizes may be smaller for the back propagation, and in fact, blocking may not even be required in the forward propagation (i.e., one shared exponent for the whole tensor).  This means in the forward and backward passes, a given activation value may be in a different block, and therefore has a different shared exponent.  This can be considered a different “format”.)
and storing the [outlier] activation values and the outlier values in the block floating- point format having outlier values in a computer-readable memory or storage device. (Mellempudi, Para [0189], discloses storing the activation values: “Embodiments described herein provide for a dynamic fixed-point representation that can be used to store quantized floating-point data”.  Mellempudi Para [0249] discloses computer-readable memory:  “Memory device 2220 can be a dynamic random-access memory (DRAM) device, a static random-access memory (SRAM) device, flash memory device, phase-change memory device, or some other memory device having suitable performance to serve as process memory”)
However, Mellempudi does not teach thereby generating outlier activation values.
Frantz teaches having outlier values, the converting resulting in outlier activation values and outlier values in the block floating-point format having outlier values, the block floating-point format having outlier values being different than the first block floating-point format (Frantz, Para [0062-0063], discloses that conversion to BFP can lead to outliers: “Each block is assigned an exponent value depending on the mean brightness of that block. Then each pixel is assigned a relative mantissa value that represents its brightness compared to the mean brightness. For example, if one were to take a picture of a person standing in front of a window, the blocks would be such that the person's body would be represented by a set of blocks with the same exponent value (or similar), and the window would be represented by a set of blocks with another exponent value. In FIG. 5 and FIG. 6, it is evident how the blocks might be created for the picture shown. The very bright areas would have high exponent values, and the dark areas would have low exponent values. In the example, square blocks of pixels are used, but any shape of the blocks would be appropriate--even non-rectangular. The ultimate solution is that the mantissas for all of the pixels in the scene create a normalized picture, and the exponent values indicate the variation of brightness over the whole scene.  If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.”  Here, Frantz discloses a method of conversion to block floating point format, wherein some of the values “saturate the mantissa” (i.e., are outliers)).  Note Examiner’s diagram of Frantz below:

    PNG
    media_image2.png
    616
    1109
    media_image2.png
    Greyscale

Here, Frantz discloses a method of conversion to block floating point format, wherein some of the values “saturate the mantissa” (i.e., are outliers)).  These are stored in a second block floating point format with second shared exponents and activation values, which represent the outlier values.  The outlier value indeed has “higher precision”, as it required more bits to store under the original exponent, and also, the fact that the blocks had to be split into smaller blocks, indicates a more targeted and “precise” representation of the outlier value.  This block floating-point format for outlier values being different than the first block floating-point format, as it relies on a different exponent and is based on a smaller group of values.)
)
Mellempudi and Frantz are analogous art because they are in the field of endeavor of efficient floating point operations. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the neural network quantization of Mellempudi, with the overflow storage of Frantz. The modification would have been obvious because one of ordinary skill in the art would be motivated to avoid losing data. (Mellempudi, Para [0218], discloses: “Overflow and/or saturation of accumulator during integer arithmetic operations introduces significant computational errors while performing longer accumulation chains (such as GEMM or Convolution”)).

As per Claim 12, the combination of Mellempudi and Frantz teaches the method of claim 11 as shown above, as well as wherein each of the outlier activation values in the block floating-point format having outlier values comprises a first mantissa associated with a shared exponent shared by all of the outlier values and a second mantissa associated with a different exponent than the shared exponent. (Frantz, Para [0063], discloses “If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.” Here, the value comprises a first mantissa associated with a shared exponent shared by all of the outlier values.  This first mantissa, however, is incomplete, as it saturates the bits.  The value also comprises a second (and third, possibly fourth, etc) mantissa associated with a different exponent, as Frantz calculates a new exponent on the deconstructed data.  While Frantz is splitting up the data the value is representing, the overall value of this data still “comprises” the first incomplete value and the smaller constituent values, as an overall property of the data.)

As per Claim 13, the combination of Mellempudi and Frantz teaches the method of claim 12 as shown above, as well as wherein the first mantissa and the second mantissa each comprise the same number of bits. (Frantz, Abstract, discloses that all values will be in the same format:  “Embodiments of the invention provide a 16 bit floating point signal processor”, and Frantz Fig 4 shows an 11 bit mantissa.  Frantz does not change the number of bits in the mantissa when splitting up an outlier.)

As per Claim 15, the combination of Mellempudi and Frantz teaches the method of claim 11.  Mellempudi teaches generating uncompressed activation values by reading the stored [outlier] activation values from the computer-readable memory or storage device (Mellempudi, Para [0238], discloses activation values “The activations at each layer can be quantized to a low-precision format, such as a dynamic fixed-point or blocked flow-precision floating-point format” and Mellempudi Para [0189] discloses storing quantized values (i.e. activation values) in memory:  “Embodiments described herein provide for a dynamic fixed-point representation that can be used to store quantized floating-point data”.  In order for the activation values to be of any use in calculations like a dot product in Mellempudi [292] “The vector math group performs arithmetic such as dot product calculations on vector operands”, then the values must be retrieved from memory.  It has not been stated that the values have been “compressed”, and “uncompressed” has been given no meaning in this limitation, therefore, these activation values can be considered uncompressed.)
	However, Mellempudi does not teach outlier values;  wherein for each of the uncompressed activation values: when the uncompressed activation value is associated with an outlier activation value, generating a respective one of the uncompressed activation values by: combining a first value defined by a first mantissa for the uncompressed activation value and a shared exponent with a second value defined by a second mantissa associated with an outlier exponent associated with the uncompressed activation value, and when the uncompressed activation value is not associated with an outlier activation value, generating a respective one of the uncompressed activation values by: producing a first value defined by a first mantissa for the uncompressed activation value and a shared exponent.
	Frantz teaches outlier [activation] values; wherein for each of the uncompressed [activation] values: when the uncompressed [activation] value is associated with an outlier [activation] value, generating a respective one of the uncompressed [activation] values by: [activation] value and a shared exponent with a second value defined by a second mantissa associated with an outlier exponent associated with the uncompressed [activation] value, and when the uncompressed [activation] value is not associated with an outlier [activation] value, generating a respective one of the uncompressed [activation] values by: producing a first value defined by a first mantissa for the uncompressed [activation] value and a shared exponent. (Frantz, Para [0063], discloses outlier values, and how they are partitioned and stored: “If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.”  Mellempudi, as shown above, discloses retrieving activation values from memory.  When combined with Frantz, this necessitates retrieving non-outlier BFP representations with a first mantissa and shared exponent, and retrieving outlier BFP representations by combining a first mantissa with a shared exponent and a second mantissa with an outlier exponent, which in Frantz’s case, is the same exponent as the shared exponent used with the first mantissa, as Frantz has split the block off into another BFP block with a shared exponent.  Combining the values in this block reconstructs the original data.)

As per Claim 17, the combination of Mellempudi and Frantz teaches the method of claim 11 as shown above, as well as producing the first activation values by performing forward propagation for at least one layer of the neural network (Mellempudi, Para [0238], discloses activation values: “The activations at each layer can be quantized to a low-precision format, such as a dynamic fixed-point or blocked flow-precision floating-point format” and in [0231] performing forward propagation:  “For forward propagation computations, blocking may not be required.”
(a) converting the stored, second activation values into uncompressed activation values in the first block floating-point format (Note that “stored, second activation values” lacks antecedent basis in the claim.  Mellempudi, Para [0231], discloses backward propagation:  “For example, back propagation may require a larger dynamic range, so the computational logic can be configured to block the tensor data using smaller block sizes.”  Here, Mellempudi discloses second activation values (those constructed in a different block format in the back propagation as opposed to the forward propagation).  The backward propagation will necessarily lead to the subsequent forward propagation, resulting in conversion back to the first floating point format.  Compression has been given no definition, so values in both block floating point formats are “uncompressed”)
(b) performing a gradient operation with the uncompressed activation values (Mellempudi, Para [0158], discloses “The error values are then propagated backwards until each neuron has an associated error value which roughly represents its contribution to the original output. The network can then learn from those errors using an algorithm, such as the stochastic gradient descent algorithm, to update the weights of the neural network.”)
(c) and updating weights for at least one node of the neural network based on the uncompressed activation values (Mellempudi, Para [0140] discloses “Training a neural network involves selecting a network topology, using a set of training data representing a problem being modeled by the network, and adjusting the weights until the network model performs with a minimal error for all instances of the training data set”)

As per Claim 21, the combination of Mellempudi and Frantz teaches the method of claim 11 as shown above, as well as wherein the at least one but not all of the first activation values that are converted are larger than unconverted activation values. (Frantz, Para [0063], discloses “If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.” Here, Frantz discloses a value that “saturates the mantissa” and thus uses up all the bits of the mantissa, and thus a higher exponent is needed to fit the entire value.  This means that the first activation values that are converted are larger than unconverted activation values.)

Claims 14 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Mellempudi and Frantz further in view of Laerd Statistics (“Measures of Central Tendency”; hereinafter Laerd).
As per Claim 14, the combination of Mellempudi and Frantz teaches the method of claim 12 as shown above.  Frantz teaches teach further comprising identifying the shared exponent by determining [at least one of a median, and/or a mode] an average for at least a portion of the first activation values. Frantz, Para [0062], discloses “Each block is assigned an exponent value depending on the mean brightness of that block.”  Here, Frantz discloses the mean)
However, the combination of Mellempudi and Frantz does not teach determining at least one of a median, and/or a mode
Laerd teaches determining at least one of a median, and/or a mode (Recall that Frantz discloses the mean.  Laerd, Pg 2, discloses: “The mean has one main disadvantage: it is particularly susceptible to the influence of outliers.”, and later discloses:  “The median is less affected by outliers and skewed data.”  Here, Laerd discloses using the median.)
Frantz and Laerd are analogous art because they are both in the field of endeavor of statistics.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the mean shared exponent of Frantz, with the median of Laerd. The modification would have been obvious because one of ordinary skill in the art would be motivated to minimize the effects of skewed data.  For example, if the mean was 0 for a data set of -4000000, 1000000, 1000000, 1000000, 1000000, 4 of those values might saturate the mantissa and be outliers.  The median (and, in fact, also mode) here would be 1000000 and the exponent would be proper for most of the data, with only 1 outlier.  (Laerd Pg 2:  “The mean has one main disadvantage: it is particularly susceptible to the influence of outliers.”…“The median is less affected by outliers and skewed data.”)

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Mellempudi and Frantz further in view of Chong et. al. (US 2019/0373264 A1; hereinafter Chong).
As per Claim 16, the combination of Mellempudi and Frantz teaches the method of claim 11 as shown above.  However, the combination of Mellempudi and Frantz does not teach prior to the storing, compressing the outlier activation values stored in the computer- readable memory or storage device by one or more of the following techniques: entropy compression, zero compression, run length encoding, compressed sparse row compression, or compressed sparse column compression.
Chong teaches prior to the storing, compressing the outlier activation values stored in the computer- readable memory or storage device by one or more of the following techniques: entropy compression, zero compression, run length encoding, compressed sparse row compression, or compressed sparse column compression. (Chong, Para [0102], discloses compressing activation data of a neural network using entropy compression, and storing it:  “To reduce memory access bandwidth requirements for neural network data, the neural network device or neural network component can perform a method to compress data from intermediate nodes in the neural network in a lossless manner. For example, a neural network coding engine of the neural network device or neural network component can be used after each hidden layer of a neural network to compress the activation data output from each hidden layer. The activation data output from a hidden layer can include a 3D volume of data having a width, height, and depth, with the depth corresponding to multiple layers of filters for that hidden layer, and each depth layer having a width and height. For instance, a feature map (with activation or feature data) having a width and height is provided for each depth layer. The compressed data can be stored in a storage device or memory. The storage device or memory can be internal to the neural network device or neural network hardware component, or can be external to the device or hardware component. The neural network coding engine can retrieve the compressed activation data (e.g., read the compressed data and load the data in a local cache), and can decompress the compressed activation data before providing the decompressed activation data as input to a next layer of the neural network. In some examples, a prediction scheme can be applied to the activation data, and residual data can be determined based on the prediction scheme. In one illustrative example, given a block of neural network data (e.g., activation data from a hidden layer), the neural network coding engine can apply a prediction scheme to each sample in the block of neural network data, and residual data can be determined based on the prediction scheme. The residual data can then be coded using a coding technique. Any suitable coding technique can be used, such as variable-length coding (VLC), arithmetic coding, other type of entropy coding, or other suitable technique.”)
Mellempudi and Chong are analogous art because they are in the field of endeavor of neural networks. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the neural network quantization of Mellempudi, with the entropy compression of Chong. The modification would have been obvious because one of ordinary skill in the art would be motivated to reduce bandwidth and power consumption. (Chong Para [0003]:  “In some cases, for the intermediate layers of a neural network, either 8 bit or 16 bit fixed or 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710.  The examiner can normally be reached on M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/L.A.S./Examiner, Art Unit 2126 
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126