DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 2022-03-17 has been entered.  The status of the claims is as follows:
Claims 1-17 and 21-25 are pending in the application.
Claims 18-20 are cancelled.
Claims 1, 11, 17, and 22 are amended.
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 2021-09-15 and 2022-03-17 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Response to Arguments
Applicant's argument with respect to rejections under 35 U.S.C. 103 for Claim 1 has been fully considered.  While Drumond does teach “convert a plurality of the first activation values to a second block floating-point format having a different, second precision than the first precision” (Drumond, Page 5 Sec 5.1 Line 6, discloses converting from FP32 to a less precise BFP format), Applicant correctly argues, regarding newly amended matter, on Page 9 that “Frantz does not disclose ‘outlier values having additional bits of precision’”, as Frantz’s outliers are stored with the same number of mantissa bits, and just a different exponent.  This argument is now rendered moot with the addition of Retter, who discloses, in Col 2 Lines 37-50, storing 2 bits of overflow in a 2 bit latch, in a system of block floating point arithmetic.
Applicant’s argument with respect to rejections under 35 USC 103 for Claim 11 has been fully considered, but is moot in light of the removal of Mellempudi and Frantz from this claim, and replacement with Drumond, Ling, and Retter, which is now consistent with other independent claims 1 and 22.
Applicant's argument with respect to rejections under 35 U.S.C. 103 for Claim 22 has been fully considered, but is moot in light of the newly applied art Retter.
Drawings
Drawings 14-17 are objected to because color photographs and color drawings are not accepted in utility applications unless a petition filed under 37 CFR 1.84(a)(2) is granted. Any such petition must be accompanied by the appropriate fee set forth in 37 CFR 1.17(h), one set of color drawings or color photographs, as appropriate, if submitted via EFS-Web or three sets of color drawings or color photographs, as appropriate, if not submitted via EFS-Web, and, unless already present, an amendment to include the following language as the first paragraph of the brief description of the drawings section of the specification:
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Color photographs will be accepted if the conditions for accepting color drawings and black and white photographs have been satisfied. See 37 CFR 1.84(b)(2).

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: "an outlier selector configured to select a shared exponent for the second values" in claim 5.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Pursuant to the above, the limitation "an outlier selector configured to select a shared exponent for the second values" in Claim 5 is being interpreted under 35 U.S.C. 112(f), as “selector” is simply a substitute for “means”, as “outlier selector” does not have any generally understood structural meaning in the art (see MPEP 2181(I)(A)) and “configured to” is a linking phrase in place of “for” that makes it clear that the claim is reciting a function (see MPEP 2181(I)(B) re: “configured to”).  Accordingly, it is being interpreted as per the Specification Para [0105] Lines 2-10, which provides acts for achieving the specified function:  “The outlier quantizer 765 can include an outlier selector that determines a shared exponent for the compressed activation values. For example, the selector can identify a shared exponent by determining that at least one of a mean (average), a median, and/or a mode for at least a portion of the activation values. In some examples, the selector can identify a shared exponent by identifying a group of largest outlying values in the set of activation values and for the remaining group of activation values not identified to be in the group of the largest outliers, determining a shared exponent by selecting the largest exponent of the remaining group. In some examples, the exponent used by the largest number of activation values is selected as the shared exponent.”  
101 Remarks
Independent claim 22, and dependent claims 23-25, recite “computer-readable storage devices or media”.  While “non-transitory” is not explicitly stated in the claims, the specification para [0146] properly excludes signals:  “Computer-readable media are any available media that can be accessed within a computing environment 1300. By way of example, and not limitation, with the computing environment 1300, computer-readable media include memory 1320 and/or storage 1340. As should be readily understood, the term computer-readable storage media includes the media for data storage such as memory 1320 and storage 1340, and not transmission media such as modulated data signals.”
  
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 5, 11, 15, 17, and 21-22 are rejected under 35 U.S.C. 103 as being unpatentable over Drumond et. al. (“Training DNNs with Hybrid Block Floating Point”, hereinafter Drumond) in view of Ling et. al. (“Harnessing Numerical Flexibility for Deep Learning on FPGAs”; hereinafter Ling) and Retter (US 4,872,132 A; hereinafter Retter).
As per Claim 1, Drumond teaches a neural network comprising (Drumond, Page 1 Abstract, discloses “DNNs”:  “The wide adoption of DNNs has given birth to unrelenting computing requirements, forcing datacenter operators to adopt domain-specific accelerators to train them” and Drumond, Conclusion, Ends with “Higher throughput leads to faster and more energy-efficient DNN training/inference, while model compression leads to lower bandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and lower communication bandwidth requirements for distributed training”.  Here, Drumond discloses memory and a chip (i.e., processor)).
an outlier quantizer formed from one or more processors, the outlier quantizer being in communication with a memory (Drumond, Page 9 Conclusion, Ends with “Higher throughput leads to faster and more energy-efficient DNN training/inference, while model compression leads to lower bandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and lower communication bandwidth requirements for distributed training”.  Here, Drumond discloses memory and a chip (i.e., processor)).  Drumond, Page 6 Sec 5.3, discloses a piece of hardware, an “accelerator”:  “HBFP accelerators exhibit arithmetic density that is similar to their fixed-point counterparts. To further illustrate this point, we synthesized a proof-of-concept FPGA-based accelerator. Figure 2 shows the block diagram of the accelerator.”  Drumond, Page 6 Figure 2, discloses an “External IO interface”.  IO refers to input and output of data, therefore the accelerator must be in communication with the memory, which is where data is stored.  Drumond, Page 9 Conclusion, also discloses on-chip and off-chip memory:  “Higher throughput leads to faster and more energy-efficient DNN training/inference, while model compression leads to lower bandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and lower communication bandwidth requirements for distributed training”)
and the neural network being configured to: produce first activation values in a first floating-point format having a first precision (Drumond, Page 5 Sec 5.1 Line 6, discloses “In the forward pass, we convert the activations to BFP, giving the x tensor one exponent per training input. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative”.  Here, Drumond discloses that activation are converted from normal floating point (which has a precision of 32), to block floating point (BFP).  Thus, the neural network has produced activation values in FP32, a first precision.)
with the outlier quantizer, convert a plurality of the first activation values to a second block floating-point format having a different, second precision than the first precision and thereby produce second activation values comprising: (1) values in the second block floating-point format for all of the first activation values (Drumond, Page 5 Sec 5.1 Line 6, discloses “In the forward pass, we convert the activations to BFP, giving the x tensor one exponent per training input. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative”. Here, Drumond converts the activations from normal floating point to BFP.  This has a different, lower precision, as disclosed by Drumond at the bottom of Page 2:  “an exploration of the HBFP design space showing that DNNs trained on BFP with 12- and 8-bit mantissas match FP32 accuracy, serving as a drop-in replacement for this representation.”  Here the BFP precision is 12 or 8 bits, rather than the full 32, and is thus a different, second precision than the first precision.)
However, Drumond does not explicitly teach store the second activation values in the memory; outlier values having additional bits of precision used to represent at least one but not all of the second activation values in the second block floating- point format
Ling teaches store the second activation values in the memory (Ling, Page 2 Sec 2.2 Last Paragraph, discloses:  “In addition to implementing dot-products in block floating point form, we also can store the data in block floating point form.”  Here, Ling discloses calculating dot products with block floating point form. The dot product calculation in a neural network comprises activations, therefore Ling discloses converting activation values to block floating point form (i.e., second activation values).   Finally, Ling discloses storing the data, which includes activation values, in block floating point form (i.e., storing the second activation values in memory).  Ling, Page 2 Table 3, discloses “Block size vs no blocking used to store weights and intermediate data (feature maps)”, wherein feature maps are also known in the art as “activation maps” and comprise activations.)  
Drumond and Ling are analogous art because they are in the field of endeavor of quantizing neural networks. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, with storing of quantized activation values of Ling. The modification would have been obvious because one of ordinary skill in the art would be motivated to reduce the memory requires to store the data. (Ling, Page 2 Sec 2.2 Last Paragraph, discloses:  “In addition to implementing dot-products in block floating point form, we also can store the data in block floating point form. This can lead to a significant reduction in both memory bandwidth to fetch data, and memory capacity to store the data either on or off chip.”)
Drumond and Ling thus far fail to explicitly teach with the outlier quantizer, convert a plurality of the first activation values to a second block floating-point format with outlier values having higher precision than the first floating-point format and thereby produce second activation values comprising (2) outlier values having additional bits of precision used to represent at least one but not all of the second activation values in the second block floating- point format.
Retter teaches outlier values having additional bits of precision used to represent at least one but not all of the second activation values in the second block floating- point format. (Retter, Col 2 Lines 37-50, discloses:  “An object of the present invention is improved circuitry for expeditiously implementing block floating point arithmetic such as used in executing fast Fourier transforms. A feature of the invention is the use of a maximum scale register for the largest scale of data in a block, and the scaling of data based on the difference in the maximum scale register content and the individual data scale factor. Another feature of the invention is latch means for storing overflows during the processing of block data and the adjustment of data in subsequent data to conform with the overflow N-scale.”  Retter provides further detail in Col 3 Lines 58-65:  “The overflow of up to two bits is stored in a 2-bit latch 28, and at the end of each FFT pass the count stored in latch 28 is transferred to a 4-bit counter 30 and to a second 2-bit latch 32. The 4-bit counter 30 can store up to a 16 value overflow for each complete FFT operation, and at the end of each FFT operation the count stored in counter 30 is transferred to a scale register 34 and updates a maximum scale register 26. Maximum scale register 26 is then loaded into the old max scale register 24 prior to the beginning of subsequent operations on the data.”  As seen above, Retter discloses a block floating-point format, wherein outlier values (“overflow”) have additional bits of precision (“The overflow of up to two bits is stored in a 2-bit latch 28”)).
store the outlier values in the memory (Retter, as shown above, discloses storing the outlier bits in a “2-bit latch”.)
	Retter and the combination of Drumond and Ling are analogous art because they are in the field of endeavor of efficient floating point operations. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond and Ling, with the overflow storage of Retter. The modification would have been obvious because one of ordinary skill in the art would be motivated to avoid losing data (Retter, Col 2 Lines 9-14, discloses: “This way, the scaling is done only when necessary, retaining maximum precision throughout the whole operation, and the number of shifts is according to the actual number of overflow bits--giving complete protection against losing information on account of the overflow.”)

	As per Claim 2, the combination of Drumond, Ling, and Retter teaches the neural network of claim 1. Drumond teaches wherein the first floating-point format is a block floating point format. (As shown above in Claim 1, Drumond discloses converting activations from FP32 to block floating point format.  However, Drumond also suggests converting from one precision of BFP to another during different operations of the neural network.  Drumond, top of Page 6, discloses:  “We handle the weights in the optimizer. We created a shell optimizer that takes the original optimizer, performs its update function in FP32 and converts the weights to two BFP formats: one with wide and another with narrow mantissas. The former is used in future weight updates while the latter is used in forward and backward passes.”  Here, Drumond discloses BFP formats with “wide” and “narrow” mantissas.  Thus, both the first and second floating point formats for the activation values can be BFP.)

	As per Claim 3, the combination of Drumond, Ling, and Retter teaches the neural network of claim 1. Drumond teaches wherein each of the second activation values comprises a mantissa having fewer bits than its respective mantissa in the first floating-point format. (Drumond, Page 5 Sec 5.1 Line 6, discloses “In the forward pass, we convert the activations to BFP, giving the x tensor one exponent per training input. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative”. Here, Drumond converts the activations from normal floating point to BFP.  This has a different, lower precision, as disclosed by Drumond at the bottom of Page 2:  “an exploration of the HBFP design space showing that DNNs trained on BFP with 12- and 8-bit mantissas match FP32 accuracy, serving as a drop-in replacement for this representation.”  Here the BFP precision is 12 or 8 bits, rather than the full 32, and thus the mantissa has fewer bits than in the first FP32 format.)

As per Claim 5, the combination of Drumond, Ling, and Retter teaches the neural network of claim 1. Drumond teaches wherein the outlier quantizer comprises: an outlier selector configured to select a shared exponent for the values in the second block floating-point format.  (Drumond, Page 4 Section 3 Last Paragraph Lines 3-4, discloses a shared exponent:  “However, BFP logic is denser because exponents are shared across entire tensors, resulting in dot products that can be computed entirely in fixed-point logic.”  Drumond, Section 2 Last Paragraph Lines 8-10, discloses “Our approach computes exponents more frequently and it does so in-device, without requiring any additional stat collection, and accommodating dynamic dataflows naturally.”  Here, Drumond discloses, an in-device (i.e., in the hardware, an outlier selector in the outlier quantizer) selector that selects a shared exponent.  Note that in the 112(f) analysis above, examiner is interpreting “outlier selector” as a device that, per [0105], “For example, the selector can identify a shared exponent by determining that at least one of a mean (average), a median, and/or a mode for at least a portion of the activation values.”)

As per Claim 11, Drumond teaches a method of implementing a neural network, the method comprising (Drumond, Bottom of Page 2, discloses “a hybrid BFP-FP (HBFP) DNN training framework”)
with a computing system implementing the neural network (Drumond, Bottom of Page 2, discloses a computing system:  “we show, with an FPGA prototype”)
with the computing system: producing first activation values for a tensor in a first block floating-point format; converting the first activation values into [outlier] activation values in a block floating-point format (Drumond, Sec 5.1 Line 6, discloses “In the forward pass, we convert the activations to BFP, giving the x tensor one exponent per training input. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative”.  Here, Drumond discloses that activation are converted from normal floating point (which has a precision of 32), to block floating point (BFP).  Thus, the neural network has produced activation values in FP32, a first precision. Here, Drumond converts the activations from normal floating point to BFP.  This has a different, lower precision, as disclosed by Drumond at the bottom of Page 2:  “an exploration of the HBFP design space showing that DNNs trained on BFP with 12- and 8-bit mantissas match FP32 accuracy, serving as a drop-in replacement for this representation.”  Here the BFP precision is 12 or 8 bits, rather than the full 32, and is thus a different, second precision than the first precision. Drumond, Page 4 Section 4, discloses tensors:  “Equation (1) computes the real value ai of an element i of a BFP tensor a with mantissa mai and exponent ea.”
Examiner notes that the first activation values described above are FP32, which is normal floating point, and not block floating point format.  However, Drumond also suggests converting from one precision of BFP to another during different operations of the neural network.  Drumond, top of Page 6, discloses:  “We handle the weights in the optimizer. We created a shell optimizer that takes the original optimizer, performs its update function in FP32 and converts the weights to two BFP formats: one with wide and another with narrow mantissas. The former is used in future weight updates while the latter is used in forward and backward passes.”  Here, Drumond discloses BFP formats with “wide” and “narrow” mantissas.  Thus, both the first and second floating point formats for the activation values can be BFP.)
However, Drumond does not teach block floating-point format having outlier values representing one or more additional bits of mantissa, exponent, or mantissa and exponent for at least one but not all of the outlier activation values; and storing the outlier activation values including the outlier values in the block floating-point format having outlier values in a computer-readable memory or storage device.
Ling teaches storing the [outlier] activation values in the block floating-point format [having outlier values] in a computer-readable memory or storage device. (Ling, Sec 2.2 Last Paragraph, discloses:  “In addition to implementing dot-products in block floating point form, we also can store the data in block floating point form.”  Here, Ling discloses calculating dot products with block floating point form. The dot product calculation in a neural network comprises activations, therefore Ling discloses converting activation values to block floating point form (i.e., second activation values).   Finally, Ling discloses storing the data, which includes activation values, in block floating point form (i.e., storing the second activation values in memory).  Ling, Table 3, discloses “Block size vs no blocking used to store weights and intermediate data (feature maps)”, wherein feature maps are also known in the art as “activation maps” and comprise activations.)   
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Drumond and Ling for at least the reasons recited in Claim 1.
However, the combination of Drumond and Ling thus far fails to teach block floating-point format having outlier values representing one or more additional bits of mantissa, exponent, or mantissa and exponent for at least one but not all of the outlier activation values; and storing the outlier values in the block floating-point format having outlier values in a computer-readable memory or storage device.
Retter teaches block floating-point format having outlier values representing one or more additional bits of mantissa, exponent, or mantissa and exponent for at least one but not all of the outlier activation values; (Retter, Col 2 Lines 37-50, discloses:  “An object of the present invention is improved circuitry for expeditiously implementing block floating point arithmetic such as used in executing fast Fourier transforms. A feature of the invention is the use of a maximum scale register for the largest scale of data in a block, and the scaling of data based on the difference in the maximum scale register content and the individual data scale factor. Another feature of the invention is latch means for storing overflows during the processing of block data and the adjustment of data in subsequent data to conform with the overflow N-scale.”  Retter provides further detail in Col 3 Lines 58-65:  “The overflow of up to two bits is stored in a 2-bit latch 28, and at the end of each FFT pass the count stored in latch 28 is transferred to a 4-bit counter 30 and to a second 2-bit latch 32. The 4-bit counter 30 can store up to a 16 value overflow for each complete FFT operation, and at the end of each FFT operation the count stored in counter 30 is transferred to a scale register 34 and updates a maximum scale register 26. Maximum scale register 26 is then loaded into the old max scale register 24 prior to the beginning of subsequent operations on the data.”  As seen above, Retter discloses a block floating-point format, wherein outlier values (“overflow”) have additional bits of precision (“The overflow of up to two bits is stored in a 2-bit latch 28”).  Drumond and Ling established activation values, and while some of them may be “outliers” compared to the other activation values, it is possible that “at least one but not all” of them would be far enough from the others that they would overflow their registers and require the overflow latch of Retter.)
storing the outlier values in the block floating-point format having outlier values in a computer-readable memory or storage device (Retter, as shown above, discloses storing the outlier bits in a “2-bit latch”.)

As per Claim 15, the combination of Drumond, Ling, and Retter teaches the method of claim 11 as well as outlier activation values (see Rejection to Claim 11).  Ling teaches generating uncompressed activation values by reading the stored [outlier] activation values from the computer-readable memory or storage device (Ling discloses storing (and thus retrieving) activation values from memory.  Ling, Sec 2.2 Last Paragraph, discloses:  “In addition to implementing dot-products in block floating point form, we also can store the data in block floating point form.”  Here, Ling discloses calculating dot products with block floating point form. The dot product calculation in a neural network comprises activations, therefore Ling discloses converting activation values to block floating point form (i.e., second activation values).   Finally, Ling discloses storing the data, which includes activation values, in block floating point form (i.e., storing the second activation values in memory).  Ling, Table 3, discloses “Block size vs no blocking used to store weights and intermediate data (feature maps)”, wherein feature maps are also known in the art as “activation maps” and comprise activations.   It has not been stated that the values have been “compressed”, and “uncompressed” has been given no meaning in this limitation, therefore, these activation values can be considered uncompressed.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Ling with Drumond for at least the reasons recited in Claim 1.
However, the combination of Drumond, Ling, and Retter does not yet teach wherein for each of the uncompressed activation values: when the uncompressed activation value is associated with an outlier activation value, generating a respective one of the uncompressed activation values by: combining a first value defined by a first mantissa for the uncompressed activation value and a shared exponent with a second value defined by a second mantissa associated with an outlier exponent associated with the uncompressed activation value, and when the uncompressed activation value is not associated with an outlier activation value, generating a respective one of the uncompressed activation values by: producing a first value defined by a first mantissa for the uncompressed activation value and a shared exponent.
Recall that Retter teaches storing outlier values in an extra 2 bit overflow register.  However, Retter does not suggest an outlier exponent with this.
Frantz teaches second value defined by a second mantissa associated with an outlier exponent (Frantz, Para [0062-0063], discloses “Each block is assigned an exponent value depending on the mean brightness of that block. Then each pixel is assigned a relative mantissa value that represents its brightness compared to the mean brightness. For example, if one were to take a picture of a person standing in front of a window, the blocks would be such that the person's body would be represented by a set of blocks with the same exponent value (or similar), and the window would be represented by a set of blocks with another exponent value. In FIG. 5 and FIG. 6, it is evident how the blocks might be created for the picture shown. The very bright areas would have high exponent values, and the dark areas would have low exponent values. In the example, square blocks of pixels are used, but any shape of the blocks would be appropriate--even non-rectangular. The ultimate solution is that the mantissas for all of the pixels in the scene create a normalized picture, and the exponent values indicate the variation of brightness over the whole scene.  If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.”  Here, Frantz discloses storing outliers with their own exponent.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Frantz with Drumond, Ling, and Retter for at least the reasons recited in Claim 12.
	The combination of Drumond, Ling, Retter, and Frantz further teaches wherein for each of the uncompressed activation values: when the uncompressed activation value is associated with an outlier activation value, generating a respective one of the uncompressed activation values by: combining a first value defined by a first mantissa for the uncompressed activation value and a shared exponent with a second value defined by a second mantissa associated with an outlier exponent associated with the uncompressed activation value, and when the uncompressed activation value is not associated with an outlier activation value, generating a respective one of the uncompressed activation values by: producing a first value defined by a first mantissa for the uncompressed activation value and a shared exponent. (Ling teaches storing and retrieving activation values from memory. Frantz discloses storing outlier values with their own outlier exponent, thus their own BFP format.   Recall that Retter stores values in 2 pieces (a first register and a 2 bit overflow register), and thus one of ordinary skill in the art will appreciate that in order to use the full outlier value in calculations, one would have to combine the values of these 2 registers.  Thus, when combined with Frantz’s storing of outlier values in the same block floating point format, with an outlier exponent, the combination would result in having to combining the values of the first mantissa and first exponent, and outlier mantissa and outlier exponent, in order to retrieve the full value.)
	
As per Claim 17, the combination of Drumond, Ling, and Retter teaches the method of claim 11.  Drumond teaches producing the first activation values by performing forward propagation for at least one layer of the neural network (Drumond, Page 5 Sec 5.1 Line 6, discloses “In the forward pass, we convert the activations to BFP, giving the x tensor one exponent per training input. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative”.  Here, Drumond discloses producing first activation values in the forward pass, which are then converted to BFP.)
(a) converting the stored, second activation values into uncompressed activation values in the first block floating-point format (Drumond, top of Page 6, discloses changing between block floating point formats for various operations, including weight updates:  “We created a shell optimizer that takes the original optimizer, performs its update function in FP32 and converts the weights to two BFP formats: one with wide and another with narrow mantissas. The former is used in future weight updates while the latter is used in forward and backward passes.”)
(b) performing a gradient operation with the uncompressed activation values (Drumond Page 6 Section 5.3 discloses performing training operations, including calculating training loss in a “loss unit”:  “We implemented the basic operations needed for neural network training (i.e., matrix multiplication, transpose, convolutions, outer product, weight update, and data movement operations) using a dataflow similar to [28]. We employ a matrix multiplication (MatMul) unit followed by an activation/loss unit, sized to maximize resource utilization in the FPGA.”  In the art, weight update is done using gradient descent, and this is mentioned in Page 4 Para 2:  “DNN training requires numeric representations with wide range because, as the loss value and the learning rates decrease, the gradient values also decrease, often by several orders of magnitude.”)
(c) and updating weights for at least one node of the neural network based on the uncompressed activation values  (Drumond, top of Page 6, discloses performing weight updates with the more precise values with a “wide” mantissa:  “We created a shell optimizer that takes the original optimizer, performs its update function in FP32 and converts the weights to two BFP formats: one with wide and another with narrow mantissas. The former is used in future weight updates while the latter is used in forward and backward passes.”)

As per Claim 21, the combination of Drumond, Ling, and Retter teaches the method of claim 11 as well as activation values (see Rejection to Claim 11). Retter teaches wherein the at least one but not all of the first activation values that are converted are larger than unconverted activation values. (Retter, Col 3 Lines 58-65, discloses:  “The overflow of up to two bits is stored in a 2-bit latch 28, and at the end of each FFT pass the count stored in latch 28 is transferred to a 4-bit counter 30 and to a second 2-bit latch 32. The 4-bit counter 30 can store up to a 16 value overflow for each complete FFT operation, and at the end of each FFT operation the count stored in counter 30 is transferred to a scale register 34 and updates a maximum scale register 26. Maximum scale register 26 is then loaded into the old max scale register 24 prior to the beginning of subsequent operations on the data.”  As seen above, Retter discloses a block floating-point format, where larger values (those that overflow) are converted to a format where they have overflow bits stored in a 2-bit register.  These values are larger than those that do not cause overflow.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Retter with Drumond and Ling for at least the reasons recited in Claim 1.

As per Claim 22, Drumond teaches one or more computer-readable storage devices or media storing computer-executable instructions, which when executed by a computer, cause the computer to perform a method, the method comprising: 
 (Drumond, Page 1 Abstract, discloses “computing requirements”:  “The wide adoption of DNNs has given birth to unrelenting computing requirements, forcing datacenter operators to adopt domain-specific accelerators to train them” and Drumond, Conclusion, Ends with “Higher throughput leads to faster and more energy-efficient DNN training/inference, while model compression leads to lower bandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and lower communication bandwidth requirements for distributed training”.  Here, Drumond discloses memory (i.e. computer-readable storage device)).
producing first activation values for a neural network in a first block floating-point format; converting the first activation values to second activation values in a second block floating-point format (Drumond, Sec 5.1 Line 6, discloses “In the forward pass, we convert the activations to BFP, giving the x tensor one exponent per training input. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative”.  Here, Drumond discloses that activation are converted from normal floating point (which has a precision of 32), to block floating point (BFP).  Thus, the neural network has produced activation values in FP32, a first precision. Here, Drumond converts the activations from normal floating point to BFP.  This has a different, lower precision, as disclosed by Drumond at the bottom of Page 2:  “an exploration of the HBFP design space showing that DNNs trained on BFP with 12- and 8-bit mantissas match FP32 accuracy, serving as a drop-in replacement for this representation.”  Here the BFP precision is 12 or 8 bits, rather than the full 32, and is thus a different, second precision than the first precision.
Examiner notes that the first activation values described above are FP32, which is normal floating point, and not block floating point format.  However, Drumond also suggests converting from one precision of BFP to another during different operations of the neural network.  Drumond, top of Page 6, discloses:  “We handle the weights in the optimizer. We created a shell optimizer that takes the original optimizer, performs its update function in FP32 and converts the weights to two BFP formats: one with wide and another with narrow mantissas. The former is used in future weight updates while the latter is used in forward and backward passes.”  Here, Drumond discloses BFP formats with “wide” and “narrow” mantissas.  Thus, both the first and second floating point formats for the activation values can be BFP.)
However, Drumond does not explicitly teach storing activation values in block floating point format; generating additional bits for at least one but not all of the second activation values into a block floating-point format for outlier values, the converting resulting in outlier activation values in the block floating-point format for outlier values, the block floating- point format for outlier values having at least one additional bit of precision than the second block floating-point format; and storing the outlier activation values in the block floating-point format for outlier values.
Ling teaches storing activation values in block floating point format (Ling, Sec 2.2 Last Paragraph, discloses:  “In addition to implementing dot-products in block floating point form, we also can store the data in block floating point form.”  Here, Ling discloses calculating dot products with block floating point form. The dot product calculation in a neural network comprises activations, therefore Ling discloses converting activation values to block floating point form (i.e., second activation values).   Finally, Ling discloses storing the data, which includes activation values, in block floating point form (i.e., storing the second activation values in memory).  Ling, Table 3, discloses “Block size vs no blocking used to store weights and intermediate data (feature maps)”, wherein feature maps are also known in the art as “activation maps” and comprise activations.)   
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Drumond and Ling for at least the reasons recited in Claim 1.
Drumond and Ling thus far fail to teach generating additional bits for at least one but not all of the second activation values into a block floating-point format for outlier values, the converting resulting in outlier activation values in the block floating-point format for outlier values, the block floating- point format for outlier values having at least one additional bit of precision than the second block floating-point format; and storing the outlier activation values in the block floating-point format for outlier values.
Retter teaches generating additional bits for at least one but not all of the second activation values into a block floating-point format for outlier values, the converting resulting in outlier activation values in the block floating-point format for outlier values, the block floating- point format for outlier values having at least one additional bit of precision than the second block floating-point format  (Retter, Col 2 Lines 37-50, discloses:  “An object of the present invention is improved circuitry for expeditiously implementing block floating point arithmetic such as used in executing fast Fourier transforms. A feature of the invention is the use of a maximum scale register for the largest scale of data in a block, and the scaling of data based on the difference in the maximum scale register content and the individual data scale factor. Another feature of the invention is latch means for storing overflows during the processing of block data and the adjustment of data in subsequent data to conform with the overflow N-scale.”  Retter provides further detail in Col 3 Lines 58-65:  “The overflow of up to two bits is stored in a 2-bit latch 28, and at the end of each FFT pass the count stored in latch 28 is transferred to a 4-bit counter 30 and to a second 2-bit latch 32. The 4-bit counter 30 can store up to a 16 value overflow for each complete FFT operation, and at the end of each FFT operation the count stored in counter 30 is transferred to a scale register 34 and updates a maximum scale register 26. Maximum scale register 26 is then loaded into the old max scale register 24 prior to the beginning of subsequent operations on the data.”  As seen above, Retter discloses a block floating-point format, wherein outlier values (“overflow”) have additional bits of precision (“The overflow of up to two bits is stored in a 2-bit latch 28”)).
and storing the outlier activation values in the block floating-point format for outlier values. (Retter, as shown above, discloses “block floating point” and storing the outlier bits in a “2-bit latch”.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Retter with Drumond and Ling for at least the reasons recited in Claim 1.

Claims 4, 12-13, and 23-25 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Drumond, Ling, and Retter further in view of Frantz (US 2012/0262597 A1).
As per Claim 4, the combination of Drumond, Ling, and Retter teaches the neural network of claim 1. However, the combination of Drumond, Ling, and Retter does not teach wherein each of the outlier values for the at least one but not all of the first activation values comprises a respective outlier exponent.
Frantz teaches wherein each of the outlier values for the at least one but not all of the first activation values comprises a respective outlier exponent. (Frantz, Para [0062-0063], discloses “Each block is assigned an exponent value depending on the mean brightness of that block. Then each pixel is assigned a relative mantissa value that represents its brightness compared to the mean brightness. For example, if one were to take a picture of a person standing in front of a window, the blocks would be such that the person's body would be represented by a set of blocks with the same exponent value (or similar), and the window would be represented by a set of blocks with another exponent value. In FIG. 5 and FIG. 6, it is evident how the blocks might be created for the picture shown. The very bright areas would have high exponent values, and the dark areas would have low exponent values. In the example, square blocks of pixels are used, but any shape of the blocks would be appropriate--even non-rectangular. The ultimate solution is that the mantissas for all of the pixels in the scene create a normalized picture, and the exponent values indicate the variation of brightness over the whole scene.  If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.”  Here, Frantz discloses that some of the block floating point values, which comprise exponents, saturate the mantissa (I.e., are outliers).  Note Examiner’s diagram of Frantz below:

    PNG
    media_image1.png
    616
    1109
    media_image1.png
    Greyscale

Here, Frantz discloses a method of conversion to block floating point format, wherein some of the values “saturate the mantissa” (i.e., are outliers)).  These are stored in a second block floating point format with second shared exponents and activation values, which represent the outlier values.  The outlier value indeed has “higher precision”, as it required more bits to store under the original exponent, and also, the fact that the blocks had to be split into smaller blocks, indicates a more targeted and “precise” representation of the outlier value.)
	Frantz and the combination of Drumond, Ling, and Retter are analogous art because they are both in the field of endeavor of efficient floating point operations. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, Ling, and Retter, with the overflow storage of Frantz. The modification would have been obvious because one of ordinary skill in the art would be motivated to avoid losing data. (Drumond, Page 4 Section 4, discloses: “In this example, BFP can only represent a accurately if the value distribution of a is not too wide to be captured by ma and the exponent ea is representative of said value distribution. If ea is too large then small values are lost and the most significant bits of the mantissas are wasted. If ea is too small, then the larger values in a will be saturated, leading to data loss.”)

As per Claim 12, the combination of Drumond, Ling, and Retter teaches the method of claim 11. However, the combination of Drumond, Ling, and Retter does not teach wherein each of the outlier activation values in the block floating-point format having outlier values comprises a first mantissa associated with a shared exponent shared by all of the outlier values and a second mantissa associated with a different exponent than the shared exponent.
Frantz teaches wherein each of the outlier activation values in the block floating-point format having outlier values comprises a first mantissa associated with a shared exponent shared by all of the outlier values and a second mantissa associated with a different exponent than the shared exponent. (Frantz, Para [0063], discloses “If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.” Here, the value comprises a first mantissa associated with a shared exponent shared by all of the outlier values.  This first mantissa, however, is incomplete, as it saturates the bits.  The value also comprises a second (and third, possibly fourth, etc) mantissa associated with a different exponent, as Frantz calculates a new exponent on the deconstructed data.  While Frantz is splitting up the data the value is representing, the overall value of this data still “comprises” the first incomplete value and the smaller constituent values, as an overall property of the data.)
Frantz and the combination of Drumond, Ling, and Retter are analogous art because they are both in the field of endeavor of efficient floating point operations. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, Ling, and Retter, with the overflow storage of Frantz. The modification would have been obvious because one of ordinary skill in the art would be motivated to avoid losing data. (Drumond, Page 4 Section 4, discloses: “In this example, BFP can only represent a accurately if the value distribution of a is not too wide to be captured by ma and the exponent ea is representative of said value distribution. If ea is too large then small values are lost and the most significant bits of the mantissas are wasted. If ea is too small, then the larger values in a will be saturated, leading to data loss.”)

As per Claim 13, the combination of Drumond, Ling, Retter, and Frantz teaches the method of claim 12.  Frantz teaches wherein the first mantissa and the second mantissa each comprise the same number of bits. (Frantz, Abstract, discloses that all values will be in the same format:  “Embodiments of the invention provide a 16 bit floating point signal processor”, and Frantz Fig 4 shows an 11 bit mantissa.  Frantz does not change the number of bits in the mantissa when splitting up an outlier.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Frantz with Drumond, Ling, and Retter for at least the reasons recited in Claim 12.

As per Claim 23, the combination of Drumond, Ling, and Frantz teaches the computer-readable storage devices or media of claim 22.  Frantz teaches wherein each of the outlier activation values in the block floating-point format for outlier values comprises a first mantissa associated with a shared exponent shared by all of the outlier values (Frantz, Para [0063], discloses “If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.” Here, the value comprises a first mantissa associated with a shared exponent shared by all of the outlier values.  This first mantissa, however, is incomplete, as it saturates the bits.  The value also comprises a second (and third, possibly fourth, etc) mantissa associated with a different exponent, as Frantz calculates a new exponent on the deconstructed data.  While Frantz is splitting up the data the value is representing, the overall value of this data still “comprises” the first incomplete value and the smaller constituent values, as an overall property of the data.)
Frantz and the combination of Drumond, Ling, and Retter are analogous art because they are both in the field of endeavor of efficient floating point operations. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, Ling, and Retter, with the overflow storage of Frantz. The modification would have been obvious because one of ordinary skill in the art would be motivated to avoid losing data. (Drumond, Page 4 Section 4, discloses: “In this example, BFP can only represent a accurately if the value distribution of a is not too wide to be captured by ma and the exponent ea is representative of said value distribution. If ea is too large then small values are lost and the most significant bits of the mantissas are wasted. If ea is too small, then the larger values in a will be saturated, leading to data loss.”)

As per Claim 24, the combination of Drumond, Ling, and Frantz teaches the computer-readable storage devices or media of claim 23.  Frantz teaches wherein each of the outlier activation values in the block floating-point format for outlier values comprises a second mantissa associated with a different exponent than the shared exponent (Frantz, Para [0063], discloses “If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.” Here, the value comprises a first mantissa associated with a shared exponent shared by all of the outlier values.  This first mantissa, however, is incomplete, as it saturates the bits.  The value also comprises a second (and third, possibly fourth, etc) mantissa associated with a different exponent, as Frantz calculates a new exponent on the deconstructed data.  While Frantz is splitting up the data the value is representing, the overall value of this data still “comprises” the first incomplete value and the smaller constituent values, as an overall property of the data.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Frantz with Drumond, Ling, and Retter for at least the reasons recited in Claim 23.

As per Claim 25, the combination of Drumond, Ling, and Frantz teaches the computer-readable storage devices or media of claim 24.  Frantz teaches wherein the first mantissa and the second mantissa each comprise the same number of bits. (Frantz, Abstract, discloses that all values will be in the same format:  “Embodiments of the invention provide a 16 bit floating point signal processor”, and Frantz Fig 4 shows an 11 bit mantissa.  Frantz does not change the number of bits in the mantissa when splitting up an outlier.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Frantz with Drumond, Ling, and Retter for at least the reasons recited in Claim 23.

Claims 6 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Drumond, Ling, and Retter further in view of Hongxiang et. al. (“Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation”; hereinafter Hongxiang).
As per Claim 6, the combination of Drumond, Ling, and Retter teaches the computing system of claim 5 as shown above.  However, the combination of Drumond, Ling, and Retter does not teach wherein the outlier quantizer further comprises: a shift controller coupled to a shifter, the shifter being configured to, based on the selected shared exponent, shift the first activation values to produce mantissas for the second values and mantissas for the outlier values for the at least one but not all of the first activation values. 
Hongxiang teaches wherein the outlier quantizer further comprises: a shift controller coupled to a shifter, the shifter being configured to, based on the selected shared exponent, shift the first activation values to produce mantissas for the second values and mantissas for the outlier values for the at least one but not all of the first activation values. (Hongxiang, Page 288 Sec II B I, discloses “Similar to floating-point (FP), BFP representation utilizes a mantissa and an exponent to represent a wide range of value. However, BFP separates the data into different blocks. The numbers in the same block have a joint scaling factor that corresponds to the largest exponent value within that block.”  Here, Hongxiang discloses that a shared exponent is selected as the largest exponent value.  Hongxiang, Page 291 Sec III D “Accumulator” Lines 5-10, discloses “Essentially, reordering is first performed on the number to find the maximal exponent. Then the accumulator calculates the discrepancies between the maximum and the two other smaller exponents. The results are fed into shift module to complete the mantissa alignment.”  Here, Hongxiang discloses a shift module (i.e., a shift controller coupled to a shifter) that shifts values to produce mantissas based on a selected shared exponent.  Hongxiang, Page 291 Figure 8, shows this shift module as a hardware component integrated into the circuit (i.e., outlier quantizer)
Hongxiang and the combination of Drumond, Ling, and Retter are analogous art because they are in the field of endeavor of quantizing neural networks. 
It would have been obvious before the effective filing date to combine the quantizing neural network accelerator of Drumond, Ling, and Retter, with the shifter of Hongxiang. One of ordinary skill in the art would be motivated to do so in order to maintain accuracy during BFP calculations (Hongxiang, Page 288 under Figure 2:  “The block size and the number distribution are the two major factors that affect the precision loss of BFP. As numbers belonging to one block share the same exponent value, the mantissa of each number needs to be shifted to align.”)

As per Claim 9, the combination of Drumond, Ling, Retter, and Hongxiang teaches the neural network of claim 1 as shown above.  Hongxiang teaches the outlier quantizer comprises a shifter configured to shift mantissas of the first activation values according to a shared exponent (Hongxiang, Sec II B I, discloses “Similar to floating-point (FP), BFP representation utilizes a mantissa and an exponent to represent a wide range of value. However, BFP separates the data into different blocks. The numbers in the same block have a joint scaling factor that corresponds to the largest exponent value within that block.”  Here, Hongxiang discloses that a shared exponent is selected as the largest exponent value.  Hongxiang, Sec III D “Accumulator” Lines 5-10, discloses “Essentially, reordering is first performed on the number to find the maximal exponent. Then the accumulator calculates the discrepancies between the maximum and the two other smaller exponents. The results are fed into shift module to complete the mantissa alignment.”  Here, Hongxiang discloses a shift module (i.e., a shift controller coupled to a shifter) that shifts values to produce mantissas based on a selected shared exponent, a portion (which may be the entirety) of the result thereby producing a mantissa for second values.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Hongxiang with Drumond, Ling, and Retter for at least the reasons recited in Claim 6.
	 
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Drumond, Ling, and Retter further in view of Koster et. al. (“Flexpoint: An adaptive numerical format for efficient training of deep neural networks”; hereinafter Koster).
As per Claim 7, the combination of Drumond, Ling and Retter teaches the neural network of claim 1 as shown above.  However, the combination of Drumond, Ling, and Retter does not teach wherein the outlier quantizer comprises a comparator to identify whether a particular one of the first activation values is selected as one of the outlier values for the at least one but not all of the first activation values.
Koster teaches wherein the outlier quantizer comprises a comparator to identify whether a particular one of the first activation values is selected as one of the outlier values for the at least one but not all of the first activation values.  (Koster, Page 4 Sec 3.2 Last Paragraph, discloses implementation of their system in hardware “Finally, to implement Flexpoint efficiently in hardware, the output exponent has to be determined before the operation is actually performed. Otherwise the intermediate result needs to be stored in high precision, before reading the new exponent and quantizing the result, which would negate much of the potential savings in hardware. Therefore, intelligent management of the exponents is required.”  Koster, Page 4 Section 3.3 Para 3, defines Gamma as a maximum absolute value based on the activations: “The Autoflex algorithm tracks the maximum absolute value Gamma, of the mantissa of every tensor, by using a dequeue to store a bounded history of these values.”   Koster, Page 5 Section 3.4, discloses identifying which values are outliers: “At the beginning of training, the statistics queue is empty, so we use a simple trial-and-error scheme described in Algorithm 1 to initialize the exponents. We perform each operation in a loop, inspecting the output value of Gamma for overflows or underutilization, and repeat until the target exponent is found.”  Page 5 Algorithm 1 discloses “If Gamma >= 2N-1 – 1 then overflow”.  Here, Koster is using a hardware component to identify outliers, and the >= operation in the algorithm is indicative of the use of a comparator.)
Koster and the combination of Drumond, Ling, and Retter are analogous art because they are in the field of endeavor of quantizing neural networks. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, Ling, and Retter, with the comparator to identify overflows of Koster. The modification would have been obvious because one of ordinary skill in the art would be motivated to not lose precision due to overflow so as to not impede convergence of a neural network during training (Koster, Page 2 Section 2 Para 3 Last sentence: “The main drawback is that this update mechanism only passively reacts to overflows rather than anticipating and preemptively avoiding overflows; this turns out to be catastrophic for maintaining convergence of the training.”)

Claims 8 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Drumond, Ling, and Retter further in view of Nurvitadhi et. al. (US 2019/0205746 A1”; hereinafter Nurvitadhi).
As per Claim 8, the combination of Drumond, Ling, and Retter teaches the neural network of claim 1 as shown above.  However, the combination of Drumond, Ling, and Retter does not teach wherein the outlier quantizer comprises an address register, and wherein the outlier quantizer is configured to store an index in the memory indicating an address for at least one of the outlier values
Nurvitadhi teaches wherein the outlier quantizer comprises an address register, and wherein the outlier quantizer is configured to store an index in the memory indicating an address for at least one of the outlier values (Nurvitadhi, Para [0128], discloses the use of an address register:  “When indirect register addressing mode is used, the register address of one or more operands may be computed based on an address register value and an address immediate field in the instruction.”  Here, Nurvitadhi’s indirect addressing mode means that when the processor is processing an instruction that accesses an operand, the address immediate field of the instruction is not directly the address of the operand.  Rather, the address immediate field of the instruction is actually the address of the address register, which in turn contains the actual address of the operand.  Therefore, the address register functions as a lookup index in the memory which indicates an address of at least one operand. As for the “operands”, Nurvitadhi discloses that “operands” are used in dot product calculations in [0129]:  “The vector math group performs arithmetic such as dot product calculations on vector operands”, and that these calculations are for a neural network in [0060]:  “In embodiments, mechanisms for performing sparse matrix processing for arbitrary neural networks are disclosed.”  Dot product calculations in neural networks are matrix operations in which weights and activations are the operands.  Therefore, in combination with the outlier activation values established by the combination of Drumond and Frantz, the operands comprise outlier values, and the address register therefore stores an index in memory indicating an address for at least one of the outlier values.)
Nurvitadhi and the combination of Drumond, Ling, and Retter are analogous art because they are in the field of endeavor of quantizing neural networks. 
It would have been obvious before the effective filing date to combine the quantizing neural network accelerator of Drumond, Ling, and Retter, with the address register for operands of Nurvitadhi. One of ordinary skill in the art would be motivated to do so in order to be able to keep track of the outlier values, without having to store copies of them, but rather to point to where they are already stored, thus saving on space and complexity (Nurvitadhi [0128]:  “When indirect register addressing mode is used, the register address of one or more operands may be computed based on an address register value and an address immediate field in the instruction.”)

As per Claim 10, the combination of Drumond, Ling, Retter, and Nurvitadhi teaches the neural network of claim 1.  Drumond teaches and the system further comprises a hardware accelerator including a memory temporarily storing the first activation values for at least a portion of a layer of the neural network. (Drumond, Sec 5.3, discloses a hardware accelerator:  “To further illustrate this point, we synthesized a proof-of-concept FPGA-based accelerator.”   Drumond, Sec 5.3 Last Line, discloses storing activation values in on-chip memory:  “The proof-of-concept accelerator operates with both weights and activations stored on-chip.”  This will be temporary, as the activation values will change during the forward and backward passes.  Drumond, Conclusion, specifically uses the term “on-chip memory”:  “Higher throughput leads to faster and more energy-efficient DNN training/inference, while model compression leads to lower bandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and lower communication bandwidth requirements for distributed training”.
However, Drumond does not explicitly teach the processors comprise at least one of the following: a tensor processing unit, a neural network accelerator, a graphics processing unit, or a processor implemented in a reconfigurable logic array; and the memory is situated on a different integrated circuit than the processors, the memory includes dynamic random access memory (DRAM) or embedded DRAM; the hardware accelerator memory including static RAM (SRAM) or a register file.
Nurvitadhi teaches the processors comprise at least one of the following: a tensor processing unit, a neural network accelerator, a graphics processing unit, or a processor implemented in a reconfigurable logic array (Nurvitadhi, Para [0159], discloses the processor comprises a graphics processing unit:  “FIG. 10 illustrates exemplary graphics software architecture for a data processing system 1000 according to some embodiments. In some embodiments, software architecture includes a 3D graphics application 1010, an operating system 1020, and at least one processor 1030. In some embodiments, processor 1030 includes a graphics processor 1032 and one or more general-purpose processor core(s) 1034. The graphics application 1010 and operating system 1020 each execute in the system memory 1050 of the data processing system.”)
and the memory is situated on a different integrated circuit than the processors, the memory includes dynamic random access memory (DRAM) or embedded DRAM (Nurvitadhi, Para [0069], discloses that the memory includes DRAM:  “The memory device 120 can be a dynamic random access memory (DRAM) device”.  Nurvitadhi, Figure 1, shows the memory device 120 on a different integrated circuit than the processors 102)
 the hardware accelerator memory including static RAM (SRAM) or a register file (Nurvitadhi, Para [0172], discloses “Memory interface may be provided via a memory controller 1265 for access to SDRAM or SRAM memory devices.”  Nurvatadhi, Figure 12, shows memory controller 1265 as being on the integrated circuit (on-chip memory)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Nurvitadhi with Drumond, Ling, and Retter for at least the reasons recited in Claim 8.

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Drumond, Ling, Retter, and Frantz further in view of Laerd Statistics (“Measures of Central Tendency”; hereinafter Laerd).
As per Claim 14, the combination of Drumond, Ling, Retter, and Frantz teaches the method of claim 12 as shown above.  Frantz teaches teach further comprising identifying the shared exponent by determining [at least one of a median, and/or a mode] for at least a portion of the first activation values. Frantz, Para [0062], discloses “Each block is assigned an exponent value depending on the mean brightness of that block.”  Here, Frantz discloses the mean).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Frantz with Drumond, Ling, and Retter for at least the reasons recited in Claim 12.
However, the combination of Drumond, Ling, Retter, and Frantz does not teach determining at least one of a median, and/or a mode
Laerd teaches determining at least one of a median, and/or a mode (Recall that Frantz discloses the mean.  Laerd, Pg 2, discloses: “The mean has one main disadvantage: it is particularly susceptible to the influence of outliers.”, and later discloses:  “The median is less affected by outliers and skewed data.”  Here, Laerd discloses using the median.)
Laerd and the combination of Drumond, Ling, Retter, and Frantz are analogous art because they are both in the field of endeavor of statistics.
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the mean shared exponent of Frantz, with the median of Laerd. The modification would have been obvious because one of ordinary skill in the art would be motivated to minimize the effects of skewed data.  For example, if the mean was 0 for a data set of -4000000, 1000000, 1000000, 1000000, 1000000, 4 of those values might saturate the mantissa and be outliers.  The median (and, in fact, also mode) here would be 1000000 and the exponent would be proper for most of the data, with only 1 outlier.  (Laerd Pg 2:  “The mean has one main disadvantage: it is particularly susceptible to the influence of outliers.”…“The median is less affected by outliers and skewed data.”)

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Drumond, Ling, and Retter further in view of Chong et. al. (US 2019/0373264 A1; hereinafter Chong).
As per Claim 16, the combination of Drumond, Ling, and Retter teaches the method of claim 11.  However, the combination of Drumond, Ling, and Retter does not teach prior to the storing, compressing the outlier activation values stored in the computer- readable memory or storage device by one or more of the following techniques: entropy compression, zero compression, run length encoding, compressed sparse row compression, or compressed sparse column compression.
Chong teaches prior to the storing, compressing the outlier activation values stored in the computer- readable memory or storage device by one or more of the following techniques: entropy compression, zero compression, run length encoding, compressed sparse row compression, or compressed sparse column compression. (Chong, Para [0102], discloses compressing activation data of a neural network using entropy compression, and storing it:  “To reduce memory access bandwidth requirements for neural network data, the neural network device or neural network component can perform a method to compress data from intermediate nodes in the neural network in a lossless manner. For example, a neural network coding engine of the neural network device or neural network component can be used after each hidden layer of a neural network to compress the activation data output from each hidden layer. The activation data output from a hidden layer can include a 3D volume of data having a width, height, and depth, with the depth corresponding to multiple layers of filters for that hidden layer, and each depth layer having a width and height. For instance, a feature map (with activation or feature data) having a width and height is provided for each depth layer. The compressed data can be stored in a storage device or memory ... Any suitable coding technique can be used, such as variable-length coding (VLC), arithmetic coding, other type of entropy coding, or other suitable technique.”)
Chong and the combination of Drumond, Ling, and Retter are analogous art because they are in the field of endeavor of neural networks. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the neural network quantization of Drumond, Ling, and Retter, with the entropy compression of Chong. The modification would have been obvious because one of ordinary skill in the art would be motivated to reduce bandwidth and power consumption. (Chong Para [0003]:  “In some cases, for the intermediate layers of a neural network, either 8 bit or 16 bit fixed or floating point operations are performed, which requires a large memory access burden (for both internal memory and external memory). Such data access requires high bandwidth usage, which leads to largely complex processing requirements and high power consumption.”)

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Wegener et. al. (US 2014/0208068 A1), in Para [0132], discloses “Output registers 2000a and 2000b implement a double-buffering technique, where samples are written to the first ("active") register until the register contains more than Nout=128 bits, whereupon remaining bits are stored in the second ("overflow") register… Any overflow bits remaining after the "active" register has been packed are stored in the "overflow" register
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710. The examiner can normally be reached M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/L.A.S./Examiner, Art Unit 2126                                                                                                                                                                                                        
/NICHOLAS KLICOS/Primary Examiner, Art Unit 2145