Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 04/25/2019, 05/08/2019, 05/06/2020, 05/22/2020, 08/25/2020, and 09/14/2020 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Specification
The disclosure is objected to because of the following informalities:
Para [051] Line 2:  “static read only memory (SRAM)” should be changed to read either “static read only memory (ROM)” or “static random access memory (SRAM)”.
Para [093] Pg 31 Line 8:  “outlier values vci” should be changed to read “outlier values oci”
[094] Line 7:  “and at least one of the following ways” should be changed to read “in at least one of the following ways”
[0105] Line 4:  “by determining that at least one of” should be changed to read “by determining at least one of”
[0119] Pg 41 Lines 4-6:  “These normal precision floating-point values 910 are converted to a set of values Q2(yi) in the first block floating-point format 930” should be changed to read “These normal precision floating-point values 920 are converted to a set of values Q2(yi) in the first block floating-point format 910”
[0120] Line 4:  “second block floating-point format 920” should be changed to read “second block floating-point format 930”
[0105]  Lines 2-3:  “exponent selector” should be changed to read “outlier selector”, in order to resolve 112a/b rejections arising from 112f in below analysis 
Appropriate correction is required.
Claim Objections
Claim 18 is objected to because of the following informalities:  "the instruction comprising" should be changed to read "the instructions comprising" in Line 4.  Appropriate correction is required.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 

(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: "an outlier selector configured to select a shared exponent for the second values" in claim 5.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform 
Pursuant to the above, the limitation "an outlier selector configured to select a shared exponent for the second values" in Claim 5 is being interpreted under 35 U.S.C. 112(f), as “selector” is simply a substitute for “means”, as “outlier selector” does not have any generally understood structural meaning in the art (see MPEP 2181(I)(A)) and “configured to” is a linking phrase in place of “for” that makes it clear that the claim is reciting a function (see MPEP 2181(I)(B) re: “configured to”).  Accordingly, it is being interpreted as per the Specification Para [0105] Lines 2-10, which provides acts for achieving the specified function:  “The outlier quantizer 765 can include an exponent selector that determines a shared exponent for the compressed activation values. For example, the selector can identify a shared exponent by determining that at least one of a mean (average), a median, and/or a mode for at least a portion of the activation values. In some examples, the selector can identify a shared exponent by identifying a group of largest outlying values in the set of activation values and for the remaining group of activation values not identified to be in the group of the largest outliers, determining a shared exponent by selecting the largest exponent of the remaining group. In some examples, the exponent used by the largest number of activation values is selected as the shared exponent.”  Examiner notes that applicant must resolve the objection to the specification and replace “an exponent selector” with “an outlier selector” in order to prevent 112a/b rejections for failure to provide a written description and distinctly point out the claim.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:


Claims 1-9 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis:
In the instant case, Claims 1-10 are directed to a computing system. Thus, each of the claims falls within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter).
Step 2 Analysis:
Based on the claims being determined to be within one of the four categories (Step 1), it must be determined if the claims are directed to a judicial exception (i.e., law of nature, natural phenomenon, and abstract idea). In this case the claims fall within the judicial exception of an abstract idea, specifically, “Mental Processes (processes that can be performed in the human mind, or by a human using a pen and paper)”.
Step 2A: Prong 1 analysis:
The claim(s) recite(s):
Claim 1:
“convert one or more of the first activation values to an outlier block floating-point format” (mental process)
Step 2A: Prong 2 analysis:
This judicial exception is not integrated into a practical application because the additional elements in claim 1 “memory”, “one or more processors”, and “outlier quantizer necessary data gathering and outputting, see MPEP 2106.05(g)(3)). Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claims are directed to an abstract idea.
Step 2B analysis:
Claim 1 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional limitations of claim 1 “memory”, “one or more processors”, and “outlier quantizer formed from at least one of the processors” correspond to mere instructions to implement an abstract idea or other exception on a computer. The additional limitations amount to well understood, routine, and conventional activity: “produce first activation values in a first floating-point format” (Receiving or transmitting data over a network, see MPEP 2106.05(d)(II)(i) and “storing the second activation values and the outlier values in the memory” (Storing and retrieving information in memory, see MPEP 2106.05(d)(II)(iv)). The claims are directed to a judicial exception.
Dependent claim(s) 2-9 when analyzed as a whole are held to be patent ineligible under 35 U.S.C. 101 because the additional recited limitation(s) fail(s) to establish that the claim(s) 
Claim 2 recites the same limitations as Claim 1, merely specifying the first floating-point format is a block floating point format. The claims are still directed to the judicial exception (mental process).
Claim 3 recites the same limitations as Claim 1, merely specifying more details about the format of values. The claims are still directed to the judicial exception (mental process).
Claim 4 recites the same limitations as Claim 1, merely specifying more details about the format of values. The claims are still directed to the judicial exception (mental process).
Claim 5 recites the same limitations as Claim 1, further performing selection of a shared exponent for the second values (mental process).  Additional element “outlier selector” is merely a subcomponent of the “outlier quantizer”, and amounts to mere instructions to implement an abstract idea or other exception on a computer.  
Claim 6 recites the same limitations as Claim 5, further performing shifting of the first activation values to produce mantissas for the second values and mantissas for the outlier values (mental process).  Additional element “shifter” is merely a subcomponent of the “outlier quantizer”, and amounts to mere instructions to implement an abstract idea or other exception on a computer.  
Claim 7 recites the same limitations as Claim 1, further performing identifying whether a particular one of the first activation values is selected as one of the outlier values (mental process).  Additional element “comparator” is merely a subcomponent of the “outlier 
Claim 8 recites the same limitations as Claim 1, further performing, using an “address register”, which is merely a subcomponent of the “outlier quantizer”, storing an index in the memory indicating an address for at least one of the outlier values which amounts to merely “applying” the concept in a computer environment; the additional limitation does not amount to significantly more, as it is a well-understood, routine, and conventional activity (Storing and retrieving information in memory, see MPEP 2106.05(d)(II)(iv)).
Claim 9 recites the same limitations as Claim 1, further performing shifting mantissas of the first activation values according to a shared exponent selected for second values (mental process).  Additional element “shifter” is merely a subcomponent of the “outlier quantizer”, and amounts to mere instructions to implement an abstract idea or other exception on a computer.  
Claim 10 recites “temporarily storing the first activation values for at least a portion of a layer of the neural network”, and here the judicial exception is integrated into a practical application (implementation of a neural network).

Claims 11-16 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis:
In the instant case, Claims 11-17 are directed to a method. Thus, each of the claims falls within one of the four statutory categories (i.e., process, machine, manufacture, or composition of matter).

Based on the claims being determined to be within one of the four categories (Step 1), it must be determined if the claims are directed to a judicial exception (i.e., law of nature, natural phenomenon, and abstract idea). In this case the claims fall within the judicial exception of an abstract idea, specifically, “Mental Processes (processes that can be performed in the human mind, or by a human using a pen and paper)”.  Examiner points out that in the below analysis, “implementing a neural network” in the preamble is not considered a limitation and is of no significance to claim construction, as the body of the claim fully and intrinsically sets forth all of the limitations of the claimed invention, and the preamble merely states, the purpose or intended use of the invention, rather than any distinct definition of any of the claimed invention’s limitations (see MPEP 2111.02(II)).
Step 2A: Prong 1 analysis:
The claim(s) recite(s):
Claim 11:
“converting at least one but not all of the first activation values to an outlier block floating-point format” (mental process)
Step 2A: Prong 2 analysis:
This judicial exception is not integrated into a practical application because the additional element in claim 11 “computing system” corresponds to mere instructions to implement an abstract idea or other exception on a computer.  Considering the further limitation “storing the outlier activation values in a computer-readable memory or storage device”, the claim as a whole merely describes how to generally “apply” the concept in a necessary data gathering and outputting, see MPEP 2106.05(g)(3)). Accordingly, these additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claims are directed to an abstract idea.
Step 2B analysis:
Claim 1 does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional limitation of claim 11 “computing system” corresponds to mere instructions to implement an abstract idea or other exception on a computer. The additional limitations amount to well understood, routine, and conventional activity: “producing first activation values in a first block floating-point format” (Receiving or transmitting data over a network, see MPEP 2106.05(d)(II)(i) and “and storing the outlier activation values in a computer-readable memory or storage device” (Storing and retrieving information in memory, see MPEP 2106.05(d)(II)(iv)). The claims are directed to a judicial exception.
Dependent claim(s) 12-16 when analyzed as a whole are held to be patent ineligible under 35 U.S.C. 101 because the additional recited limitation(s) fail(s) to establish that the claim(s) is/are not directed to an abstract idea, as they recite further embellishment of the judicial exception.
Claim 12 recites the same limitations as Claim 11, merely specifying more details about the format of values. The claims are still directed to the judicial exception (mental process).
mental process).
Claim 14 recites the same limitations as Claim 12, further performing identifying the shared exponent by determining at least one of a mean, a median, and/or a mode (mental process).
Claim 15 recites the same limitations as Claim 1, further performing “generating…by combining” (mental process); and additional limitations “generating…by reading…from the computer-readable memory or storage device” and “generating…by producing a first value” (necessary data gathering and outputting, see MPEP 2106.05(g)(3)) amount to merely “applying” the concept in a computer environment; the additional limitations do not amount to significantly more, as they are each well-understood, routine, and conventional activity (Storing and retrieving information in memory, see MPEP 2106.05(d)(II)(iv)).
Claim 16 recites the same limitations as Claim 11, further performing compressing the outlier activation values (mental process); and additional limitations “stored in the computer- readable memory or storage device” (necessary data gathering and outputting, see MPEP 2106.05(g)(3)) amount to merely “applying” the concept in a computer environment; the additional limitations do not amount to significantly more, as they are each well-understood, routine, and conventional activity (Storing and retrieving information in memory, see MPEP 2106.05(d)(II)(iv))
Claim 17 recites “performing forward propagation for at least one layer of the neural network”, and here the judicial exception is integrated into a practical application (implementation of a neural network).

Independent claim 18, and dependent claims 19-20, recite limitations directed to the implementation of a neural network, and are thus not directed to an abstract idea.  While these claims recite “computer-readable storage devices or media”, the specification para [0146] properly excludes signals:  “Computer-readable media are any available media that can be accessed within a computing environment 1300. By way of example, and not limitation, with the computing environment 1300, computer-readable media include memory 1320 and/or storage 1340. As should be readily understood, the term computer-readable storage media includes the media for data storage such as memory 1320 and storage 1340, and not transmission media such as modulated data signals.”

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-10 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.  In claim 1, it is unclear what is meant by an “outlier block floating point format”, as a “outlier block floating point format” is not a known 
Claims 11-17 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.  In claim 11, it is unclear what is meant by “outlier block floating point format” as this is not a term known in the art, and also unclear how the outcome of a conversion to such format results in “thereby generating outlier activation values”.
Claim 12 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.  Claim 12 recites the limitation "the outlier values" in Line 5.  There is insufficient antecedent basis for this limitation in the claim.
Claim 17 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the  recites the limitation "the stored, second activation values" in Line 5.  There is insufficient antecedent basis for this limitation in the claim.
Claim 17 also recites the limitation “performing backward propagation for the at least one layer of the neural network by converting the stored, second activation values to activation values in the first block floating- point format, producing uncompressed activation values”.  The meaning of this limitation is not clear to the examiner, as typically a phrase “achieving A by action B” means that action B results in A.  However, converting values from one format to another does not accomplish backward propagation of a neural network.
Claims 18-20 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.  Claim 18 recites the limitation “generate second activation values for each of the first activation values by: determining that some of the first activation values will be stored as outlier values having a first mantissa associated with a shared exponent and a respective second mantissa associated with a respective outlier exponent, and determining that the remaining values of the first activation values will be stored as a first mantissa associated with the shared exponent, without an associated second mantissa.”  It is unclear to the examiner how the two instances of “determining” necessarily lead to “generating”.  It is also unclear how a “second activation value” can comprise a “first mantissa”, when the “first activation values” would presumably comprise a “first mantissa”.
Claim 20 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.  Claim 20 recites the limitation “perform backward propagation for the artificial neural network by converting at least some of the stored, second activation values to the first floating point format by: for the outlier values, adding a first value described by the outlier value's respective first mantissa and the shared exponent to a second value described by the outlier value's respective second mantissa and respective outlier exponent, and for the remaining values, producing a value described by the remaining value's respective first mantissa and the shared exponent.”  The meaning of this limitation is not clear to the examiner, as typically a phrase “achieving A by action B” means that action B results in A.  However, converting values from one format to another does not accomplish backward propagation of a neural network.
  
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Drumond et. al. (“Training DNNs with Hybrid Block Floating Point”, hereinafter Drumond) in Ling et. al. (“Harnessing Numerical Flexibility for Deep Learning on FPGAs”; hereinafter Ling) and Frantz (US 2012/0262597 A1; hereinafter Frantz).
As per Claim 1, Drumond teaches a computing system comprising: one or more processors; memory comprising computer-readable storage devices and/or memory (Drumond, Abstract, discloses “computing requirements”:  “The wide adoption of DNNs has given birth to unrelenting computing requirements, forcing datacenter operators to adopt domain-specific accelerators to train them” and Drumond, Conclusion, Ends with “Higher throughput leads to faster and more energy-efficient DNN training/inference, while model compression leads to lower bandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and lower communication bandwidth requirements for distributed training”.  Here, Drumond discloses memory and a chip (i.e., processor)).
an outlier quantizer formed from at least one of the processors, the outlier quantizer being in communication with the memory (Drumond, Sec 5.3, discloses a piece of hardware, an “accelerator”:  “HBFP accelerators exhibit arithmetic density that is similar to their fixed-point counterparts. To further illustrate this point, we synthesized a proof-of-concept FPGA-based accelerator. Figure 2 shows the block diagram of the accelerator.”  Drumond, Figure 2, discloses an “External IO interface”.  IO refers to input and output of data, therefore the accelerator must be in communication with the memory, which is where data is stored.  Drumond, Conclusion, also discloses on-chip and off-chip memory:  “Higher throughput leads to faster and more energy-efficient DNN training/inference, while model compression leads to lower bandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and lower communication bandwidth requirements for distributed training”)
(Drumond, Sec 5.1 Line 6, discloses “In the forward pass, we convert the activations to BFP, giving the x tensor one exponent per training input. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative”.  Here, Drumond discloses BFP, or block floating point (i.e., a floating point format).  Drumond also discloses activations, which are in both the forward and backward pass, comprising two activation values.  As no specific definition is given for “first” or “second”, examiner is considering the activations on the backward pass as “first activation values”).
	with the outlier quantizer, convert one or more of the first activation values to an outlier block floating-point format, producing second activation values comprising: (1) second values in a second, block floating-point format for all of the first activation values (Drumond, Sec 5.1 Line 6, discloses “In the forward pass, we convert the activations to BFP, giving the x tensor one exponent per training input. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative”.  Examiner’s Note:  Each pass of a neural network updates the activation values (i.e., converts them).  As stated above, BFP is carried out on both the backward and forward passes.  So, after a backward pass, the activation is converted into a second BFP format on the next forward pass.  Note that Drumond converts them to FP in the interim, but the net result is the same, a conversion of the activation value from the first to the second BFP format)
However, Drumond does not explicitly teach and with at least one of the processors, storing the second activation values in the memory.
Ling teaches and with at least one of the processors, storing the second activation values in the memory (Ling, Sec 2.2 Last Paragraph, discloses:  “In addition to implementing dot-products in block floating point form, we also can store the data in block floating point form.”  Here, Ling discloses calculating dot products with block floating point form. The dot product calculation in a neural network comprises activations, therefore Ling discloses converting activation values to block floating point form (i.e., second activation values).   Finally, Ling discloses storing the data, which includes activation values, in block floating point form (i.e., storing the second activation values in memory).  Ling, Table 3, discloses “Block size vs no blocking used to store weights and intermediate data (feature maps)”, wherein feature maps are also known in the art as “activation maps” and comprise activations.)  
Drumond and Ling are analogous art because they are in the field of endeavor of quantizing neural networks. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, with storing of quantized activation values of Ling. The modification would have been obvious because one of ordinary skill in the art would be motivated to reduce the memory requires to store the data. (Ling, Sec 2.2 Last Paragraph, discloses:  “In addition to implementing dot-products in block floating point form, we also can store the data in block floating point form. This can lead to a significant reduction in both memory bandwidth to fetch data, and memory capacity to store the data either on or off chip.”)
Drumond also does not explicitly teach with the outlier quantizer, convert one or more of the first activation values to an outlier block floating-point format, producing second activation values comprising (2) outlier values for at least one but not all of the first activation values; and with at least one of the processors, storing the outlier values in the memory.
Frantz teaches with the outlier quantizer, convert one or more of the first activation values to an outlier block floating-point format, producing second activation values comprising (2) outlier values for at least one but not all of the first activation values (Frantz, Para [0062-0063], discloses “Each block is assigned an exponent value depending on the mean brightness of that block. Then each pixel is assigned a relative mantissa value that represents its brightness compared to the mean brightness. For example, if one were to take a picture of a person standing in front of a window, the blocks would be such that the person's body would be represented by a set of blocks with the same exponent value (or similar), and the window would be represented by a set of blocks with another exponent value. In FIG. 5 and FIG. 6, it is evident how the blocks might be created for the picture shown. The very bright areas would have high exponent values, and the dark areas would have low exponent values. In the example, square blocks of pixels are used, but any shape of the blocks would be appropriate--even non-rectangular. The ultimate solution is that the mantissas for all of the pixels in the scene create a normalized picture, and the exponent values indicate the variation of brightness over the whole scene.  If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.”  Here, Frantz discloses a method of conversion to block floating point format, wherein some of the values “saturate the mantissa” (i.e., are outliers)).
and with at least one of the processors, storing the outlier values in the memory (Frantz, Para [0007] Last Sentence, discloses that floating point representations require large memory requirements:  “Floating-point pixel-level ADC image sensors require large memory to store the data, and also require a complex image reconstruction process.”  Frantz, Para [0013], then discloses:  “Therefore, a need exists for a signal and image processor to capture and store this wide dynamic range (WDR) signal in a standard format, as well as to perform the associated signal and image processing operations efficiently.”  Here, Frantz discloses that the data in the standard format they are using will be stored, and Fig. 2 discloses Memory 225.  The partitioning of the outlier values in Frantz Para [0062-0063] is part of the process of converting to the format that will be stored.)
	Drumond and Frantz are analogous art because they are in the field of endeavor of efficient floating point operations. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, with the overflow storage of Frantz. The modification would have been obvious because one of ordinary skill in the art would be motivated to avoid losing data. (Drumond, Section 4, discloses: “In this example, BFP can only represent a accurately if the value distribution of a is not too wide to be captured by ma and the exponent ea is representative of said value distribution. If ea is too large a is too small, then the larger values in a will be saturated, leading to data loss.”)

	As per Claim 2, the combination of Drumond, Ling, and Frantz teaches the computing system of claim 1 as shown above, as well as wherein the first floating-point format is a block floating point format. (Drumond, Section 3 Last Paragraph, discloses “Given these requirements, we identify block floating-point (BFP) as the ideal numeric representation for DNNs. BFP represents numbers with a mantissa and exponent, like floating-point, but exponents are shared across entire tensors, as shown in Figure 1, resulting in dot products that can be computed entirely in fixed-point logic.”  Drumond, Section 5.1, discloses that this format is used in both the backward (i.e., first) and forward pass:  “In the forward pass, we convert
the activations to BFP, giving the x tensor one exponent per training input. Then we execute the target operation in native floating-point arithmetic. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative.”)

	As per Claim 3, the combination of Drumond, Ling, and Frantz teaches the computing system of claim 1 as shown above, as well as wherein each of the second activation values comprises a mantissa having fewer bits than its respective mantissa in the first floating-point format. (Drumond, Section 2 Para 2, discloses:  “Quantization [8] is a widely used technique for DNN inference. BFP [9] has also been proposed for inference. These techniques quantize the weights of DNNs trained with full precision floating point to use fixed-point logic during inference.  We consider the more challenging task of training DNNs with arithmetic density that matches quantized inference.”  Here, Drumond discloses that their work is an improvement on the known technique of only using BFP during inference (the forward pass, i.e., the second activation values).  In this case, the backward pass (i.e., first activation values), would be in FP.  Drumond, Sec 3 Top of Page 4, discloses:  “FP32 representations are easy to use but inefficient. They represent numbers with a 24-bit mantissa and a 8-bit exponent. In terms of precision, the 24-bit mantissa is an overkill for DNNs. Table 1 shows the validation error obtained when training ResNet-20 models on CIFAR10 using floating-point
representations with various mantissas and exponent widths. We observed convergence without loss of precision with 8-bit mantissas, convergence with a small loss of precision with 4-bit mantissas, and divergence only when using 2-bit mantissas.”  Here, Drumond discloses that standard floating point format, FP32, has a 24-bit mantissa.  Drumond then discloses that in their method, which uses BFP, they use an 8-bit mantissa.  Therefore, Drumond’s disclosure of only using BFP during inference, amounts to second activation values comprising a mantissa having fewer bits than its respective mantissa in the first floating-point format.    Examiner’s Note:  Drumond states “quantize the weights…to use fixed-point logic during inference”.  In order to “use fixed-point logic” for dot product operations, the activation values must also be quantized, not only the weights.)

As per Claim 4, the combination of Drumond, Ling, and Frantz teaches the computing system of claim 1 as shown above, as well as wherein each of the outlier values comprises a respective outlier exponent. (Frantz, Para [0062-0063], discloses “Each block is assigned an exponent value depending on the mean brightness of that block. Then each pixel is assigned a relative mantissa value that represents its brightness compared to the mean brightness. For example, if one were to take a picture of a person standing in front of a window, the blocks would be such that the person's body would be represented by a set of blocks with the same exponent value (or similar), and the window would be represented by a set of blocks with another exponent value. In FIG. 5 and FIG. 6, it is evident how the blocks might be created for the picture shown. The very bright areas would have high exponent values, and the dark areas would have low exponent values. In the example, square blocks of pixels are used, but any shape of the blocks would be appropriate--even non-rectangular. The ultimate solution is that the mantissas for all of the pixels in the scene create a normalized picture, and the exponent values indicate the variation of brightness over the whole scene.  If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.”  Here, Frantz discloses that some of the block floating point values, which comprise exponents, saturate the mantissa (I.e., are outliers).

As per Claim 5, the combination of Drumond, Ling, and Frantz teaches the computing system of claim 1 as shown above, as well as wherein the outlier quantizer comprises: an outlier selector configured to select a shared exponent for the second values.  (Drumond, Section 3 Last Paragraph Lines 3-4, discloses a shared exponent:  “However, BFP logic is denser because exponents are shared across entire tensors, resulting in dot products that can be computed entirely in fixed-point logic.”  Drumond, Section 2 Last Paragraph Lines 8-10, discloses “Our approach computes exponents more frequently and it does so in-device, without requiring any additional stat collection, and accommodating dynamic dataflows naturally.”  Here, Drumond discloses, an in-device (i.e., in the hardware, an outlier selector in the outlier quantizer) selector that selects a shared exponent.  Note that in the 112(f) analysis above, examiner is interpreting “outlier selector” as a device that, per [0105], “For example, the selector can identify a shared exponent by determining that at least one of a mean (average), a median, and/or a mode for at least a portion of the activation values.”  Frantz, Para [0062], discloses using the mean (average):  “Each block is assigned an exponent value depending on the mean brightness of that block.”)

As per Claim 18, Drumond teaches One or more computer-readable storage devices or media storing computer- executable instructions, which when executed by a computer, cause the computer to perform a method of configuring a computer system to implement an artificial neural network, the instruction comprising (Drumond, Abstract, discloses “computing requirements”:  “The wide adoption of DNNs has given birth to unrelenting computing requirements, forcing datacenter operators to adopt domain-specific accelerators to train them” and Drumond, Conclusion, Ends with “Higher throughput leads to faster and more energy-efficient DNN training/inference, while model compression leads to lower bandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and lower communication bandwidth requirements for distributed training”.  Here, Drumond discloses memory (i.e. computer-readable storage device)).
instructions that cause the computer system to produce first activation values in a first floating-point format (Drumond, Sec 5.1 Line 6, discloses “In the forward pass, we convert the activations to BFP, giving the x tensor one exponent per training input. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative”.  Here, Drumond discloses BFP, or block floating point (i.e., a floating point format).  Drumond also discloses activations, which are in both the forward and backward pass, comprising two activation values.  As no specific definition is given for “first” or “second”, examiner is considering the activations on the backward pass as “first activation values”).
However, Drumond does not explicitly teach instructions that cause the computer system to generate second activation values for each of the first activation values by: determining that some of the first activation values will be stored as outlier values having a first mantissa associated with a shared exponent and a respective second mantissa associated with a respective outlier exponent; and determining that the remaining values of the first activation values will be stored as a first mantissa associated with the shared exponent, without an associated second mantissa
Frantz teaches instructions that cause the computer system to generate second activation values for each of the first activation values by: determining that some of the first activation values will be stored as outlier values having a first mantissa associated with a shared exponent and a respective second mantissa associated with a respective outlier exponent; and determining that the remaining values of the first activation values will be stored as a first (Frantz, Para [0062-0063], discloses “Each block is assigned an exponent value depending on the mean brightness of that block. Then each pixel is assigned a relative mantissa value that represents its brightness compared to the mean brightness. For example, if one were to take a picture of a person standing in front of a window, the blocks would be such that the person's body would be represented by a set of blocks with the same exponent value (or similar), and the window would be represented by a set of blocks with another exponent value. In FIG. 5 and FIG. 6, it is evident how the blocks might be created for the picture shown. The very bright areas would have high exponent values, and the dark areas would have low exponent values. In the example, square blocks of pixels are used, but any shape of the blocks would be appropriate--even non-rectangular. The ultimate solution is that the mantissas for all of the pixels in the scene create a normalized picture, and the exponent values indicate the variation of brightness over the whole scene.  If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.”  Here, Frantz discloses that some values, when converted to BFP, will be outliers and need to be stored in pieces.  They will have at least a first mantissa with a shared exponent, and a second mantissa with an outlier exponent, which in Frantz’s case, is the same as the shared exponent, as Frantz has started a new block with a shared exponent.)
Drumond also does not explicitly teach and storing the second activation values in a computer-readable storage device or memory 
Ling teaches and storing the second activation values in a computer-readable storage device or memory (Ling, Sec 2.2 Last Paragraph, discloses:  “In addition to implementing dot-products in block floating point form, we also can store the data in block floating point form.”  Here, Ling discloses calculating dot products with block floating point form. The dot product calculation in a neural network comprises activations, therefore Ling discloses converting activation values to block floating point form (i.e., second activation values).   Finally, Ling discloses storing the data, which includes activation values, in block floating point form (i.e., storing the second activation values in memory).  Ling, Table 3, discloses “Block size vs no blocking used to store weights and intermediate data (feature maps)”, wherein feature maps are also known in the art as “activation maps” and comprise activations.)   

As per Claim 19, the combination of Drumond, Ling, and Frantz teaches the computer-readable storage devices or media of 18.  Frantz teaches further comprising: instructions that cause the computer system to determine the shared exponent by: identifying a group of largest outliers in the first activation values; and for a remaining group of the first activation values not identified to be in the group of the largest outliers, determining the shared exponent by selecting the [largest] exponent of the remaining group (Frantz, Para [0063], discloses “If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.”  Here, Frantz identifies a group of values that saturate, or overflow, the mantissa (i.e., a group of largest outliers), and partitions these values further.  As for the remaining group, which were not in the largest outliers, Frantz Para [0062] discloses determining the shared exponent for those, by using the mean (average):  “Each block is assigned an exponent value depending on the mean brightness of that block”.
However, Frantz does not explicitly teach determining the shared exponent by selecting the largest exponent.
Drumond teaches determining the shared exponent by selecting the largest exponent  (Drumond Sec 4.1, Last sentence, discloses:  “We convert tensors to BFP before every dot product, using the exponent of the largest tensor value, and convert the result back to floating point afterwards”).

As per Claim 20, the combination of Drumond, Ling, and Frantz teaches the computer-readable storage devices or media of 18, further comprising: instructions that cause the computer system to perform forward propagation for the artificial neural network to produce the first activation values (Drumond, Sec 5.1, discloses “In the forward pass, we convert the activations to BFP, giving the x tensor one exponent per training input”)
and instructions that cause the computer system to perform backward propagation for the artificial neural network by converting at least some of the stored, second activation values to the first floating point format by: for the outlier values, adding a first value described by the outlier value's respective first mantissa and the shared exponent to a second value described by (Drumond, Section 5.1, discloses backward propagation:  “In the forward pass, we convert the activations to BFP, giving the x tensor one exponent per training input. Then we execute the target operation in native floating-point arithmetic. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative.”  Frantz, Para [0063], discloses “If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.”  Here, Frantz splits an outlier in BFP into constituent blocks of BFP to represent a piece of data, the constituent blocks comprising at least a first mantissa with a shared exponent and a second mantissa with an outlier exponent (the shared exponent).  In the combination of Drumond and Frantz, in order to reconstitute this data for future use (i.e., the back propagation step), then the constituent values must be added back up to equal the representation of the original data.)

Claims 6 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Drumond, Ling, and Frantz further in view of Hongxiang et. al. (“Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation”; hereinafter Hongxiang).
As per Claim 6, the combination of Drumond, Ling, and Frantz teaches the computing system of claim 5 as shown above.  However, the combination of Drumond, Ling, and Frantz does not teach wherein the outlier quantizer further comprises: a shift controller coupled to a shifter, the shifter being configured to, based on the selected shared exponent, shift the first activation values to produce mantissas for the second values and mantissas for the outlier values. 
Hongxiang teaches wherein the outlier quantizer further comprises: a shift controller coupled to a shifter, the shifter being configured to, based on the selected shared exponent, shift the first activation values to produce mantissas for the second values and mantissas for the outlier values. (Hongxiang, Sec II B I, discloses “Similar to floating-point (FP), BFP representation utilizes a mantissa and an exponent to represent a wide range of value. However, BFP separates the data into different blocks. The numbers in the same block have a joint scaling factor that corresponds to the largest exponent value within that block.”  Here, Hongxiang discloses that a shared exponent is selected as the largest exponent value.  Hongxiang, Sec III D “Accumulator” Lines 5-10, discloses “Essentially, reordering is first performed on the number to find the maximal exponent. Then the accumulator calculates the discrepancies between the maximum and the two other smaller exponents. The results are fed into shift module to complete the mantissa alignment.”  Here, Hongxiang discloses a shift module (i.e., a shift controller coupled to a shifter) that shifts values to produce mantissas based on a selected shared exponent.  Hongxiang, Figure 8, shows this shift module as a hardware component integrated into the circuit (i.e., outlier quantizer)

It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, with the shifter of Hongxiang. The modification would have been obvious because it amounts to combining prior art elements according to known methods to yield predictable results.  The prior art includes each element claimed, but not combined in a single art reference.  One of ordinary skill in the art could have combined the elements as claimed by known methods, and that in combination, each element merely performs the same function as it does separately.  Drumond’s accelerator and Hongxiang’s shifter perform the same functions apart as they do together.  As a shifter is a well-known basic computer component that performs a well-known function, which is simply to shift a numerical value by a number of bits, one of ordinary skill in the art would have recognized that the results of the combination were predictable (see MPEP 2143 KSR A).

 	As per Claim 9, the combination of Drumond, Ling, Frantz, and Hongxiang teaches the computing system of claim 1 as shown above.  Hongxiang teaches the outlier quantizer comprises a shifter configured to shift mantissas of the first activation values according to a shared exponent selected for second values; a portion of each of the shifted first activation values forms a mantissa for a respective one of the second values 
 (Hongxiang, Sec II B I, discloses “Similar to floating-point (FP), BFP representation utilizes a mantissa and an exponent to represent a wide range of value. However, BFP separates the data into different blocks. The numbers in the same block have a joint scaling factor that corresponds to the largest exponent value within that block.”  Here, Hongxiang discloses that a shared exponent is selected as the largest exponent value.  Hongxiang, Sec III D “Accumulator” Lines 5-10, discloses “Essentially, reordering is first performed on the number to find the maximal exponent. Then the accumulator calculates the discrepancies between the maximum and the two other smaller exponents. The results are fed into shift module to complete the mantissa alignment.”  Here, Hongxiang discloses a shift module (i.e., a shift controller coupled to a shifter) that shifts values to produce mantissas based on a selected shared exponent, a portion (which may be the entirety) of the result thereby producing a mantissa for second values.)
	However, Hongxiang does not explicitly teach and a different portion of the shifted first activation values forms an outlier mantissa for the second activation values having an outlier value
	Frantz teaches and a different portion of the [shifted] first activation values forms an outlier mantissa for the second activation values having an outlier value (Frantz, Para [0063], discloses “If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks”.  Here, a portion of the shifted first values forms a mantissa for the second value.  It is a “portion” because it overflows, or saturates, the bits for the mantissa (i.e., is an outlier).  It is broken up into different portions, which also have mantissas (i.e., outlier mantissa).)  *Hongxiang above discloses that the first activation values are shifted
 
Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Drumond, Ling, and Frantz further in view of Koster et. al. (“Flexpoint: An adaptive numerical format for efficient training of deep neural networks”; hereinafter Koster).
As per Claim 7, the combination of Drumond, Ling and Frantz teaches the computing system of claim 1 as shown above.  However, the combination of Drumond, Ling, and Frantz does not teach wherein the outlier quantizer comprises a comparator to identify whether a particular one of the first activation values is selected as one of the outlier values.
Koster teaches wherein the outlier quantizer comprises a comparator to identify whether a particular one of the first activation values is selected as one of the outlier values.  (Koster, Sec 3.2 Last Paragraph, discloses implementation of their system in hardware “Finally, to implement Flexpoint efficiently in hardware, the output exponent has to be determined before the operation is actually performed. Otherwise the intermediate result needs to be stored in high precision, before reading the new exponent and quantizing the result, which would negate much of the potential savings in hardware. Therefore, intelligent management of the exponents is required.”  Koster, Section 3.3 Para 3, defines Gamma as a maximum absolute value based on the activations: “The Autoflex algorithm tracks the maximum absolute value Gamma, of the mantissa of every tensor, by using a dequeue to store a bounded history of these values.”   Koster, Section 3.4, discloses identifying which values are outliers: “At the beginning of training, the statistics queue is empty, so we use a simple trial-and-error scheme described in Algorithm 1 to initialize the exponents. We perform each operation in a loop, inspecting the output value of Gamma for overflows or underutilization, and repeat until the target exponent is found.”  Algorithm 1 discloses “If Gamma >= 2N-1 – 1 then overflow”.  Here, Koster is using a hardware component to identify outliers, and the >= operation in the algorithm is indicative of the use of a comparator.)
Drumond and Koster are analogous art because they are in the field of endeavor of quantizing neural networks. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, with the comparator to identify overflows of Koster. The modification would have been obvious because one of ordinary skill in the art would be motivated to not lose precision due to overflow so as to not impede convergence of a neural network during training (Koster, Section 2 Para 3 Last sentence: “The main drawback is that this update mechanism only passively reacts to overflows rather than anticipating and preemptively avoiding overflows; this turns out to be catastrophic for maintaining convergence of the training.”)

Claims 8 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Drumond, Ling, and Frantz further in view of Nurvitadhi et. al. (US 2019/0205746 A1”; hereinafter Nurvitadhi).
As per Claim 8, the combination of Drumond, Ling, and Frantz teaches the computing system of claim 1 as shown above.  However, the combination of Drumond, Ling, and Frantz does not teach wherein the outlier quantizer comprises an address register, and wherein the 
Nurvitadhi teaches wherein the outlier quantizer comprises an address register, and wherein the outlier quantizer is configured to store an index in the memory indicating an address for at least one of the outlier values (Nurvitadhi, Para [0128], discloses the use of an address register:  “When indirect register addressing mode is used, the register address of one or more operands may be computed based on an address register value and an address immediate field in the instruction.”  Here, Nurvitadhi’s indirect addressing mode means that when the processor is processing an instruction that accesses an operand, the address immediate field of the instruction is not directly the address of the operand.  Rather, the address immediate field of the instruction is actually the address of the address register, which in turn contains the actual address of the operand.  Therefore, the address register functions as a lookup index in the memory which indicates an address of at least one operand. As for the “operands”, Nurvitadhi discloses that “operands” are used in dot product calculations in [0129]:  “The vector math group performs arithmetic such as dot product calculations on vector operands”, and that these calculations are for a neural network in [0060]:  “In embodiments, mechanisms for performing sparse matrix processing for arbitrary neural networks are disclosed.”  Dot product calculations in neural networks are matrix operations in which weights and activations are the operands.  Therefore, in combination with the outlier activation values established by the combination of Drumond and Frantz, the operands comprise outlier values, and the address register therefore stores an index in memory indicating an address for at least one of the outlier values.)

It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the quantizing neural network accelerator of Drumond, with the address register for operands of Nurvitadhi. The modification would have been obvious because it amounts to combining prior art elements according to known methods to yield predictable results.  The prior art includes each element claimed, but not combined in a single art reference.  One of ordinary skill in the art could have combined the elements as claimed by known methods, and that in combination, each element merely performs the same function as it does separately.  Drumond’s accelerator and Nurvitadhi’s address register perform the same functions apart as they do together.  As an address register is a well-known basic computer component that performs a well-known function, which is simply to store an address in memory, one of ordinary skill in the art would have recognized that the results of the combination were predictable (see MPEP 2143 KSR A).

As per Claim 10, the combination of Drumond, Ling, Frantz, and Nurvitadhi teaches the computing system of claim 1.  Nurvitadhi teaches the processors comprise at least one of the following: a tensor processing unit, a neural network accelerator, a graphics processing unit, or a processor implemented in a reconfigurable logic array (Nurvitadhi, Para [0159], discloses the processor comprises a graphics processing unit:  “FIG. 10 illustrates exemplary graphics software architecture for a data processing system 1000 according to some embodiments. In some embodiments, software architecture includes a 3D graphics application 1010, an operating system 1020, and at least one processor 1030. In some embodiments, processor 1030 includes a graphics processor 1032 and one or more general-purpose processor core(s) 1034. The graphics application 1010 and operating system 1020 each execute in the system memory 1050 of the data processing system.”)
and the memory is situated on a different integrated circuit than the processors, the memory includes dynamic random access memory (DRAM) or embedded DRAM (Nurvitadhi, Para [0069], discloses that the memory includes DRAM:  “The memory device 120 can be a dynamic random access memory (DRAM) device”.  Nurvitadhi, Figure 1, shows the memory device 120 on a different integrated circuit than the processors 102)
 the hardware accelerator memory including static RAM (SRAM) or a register file (Nurvitadhi, Para [0172], discloses “Memory interface may be provided via a memory controller 1265 for access to SDRAM or SRAM memory devices.”  Nurvatadhi, Figure 12, shows memory controller 1265 as being on the integrated circuit (on-chip memory)).
However, Nurvitadhi does not explicitly teach and the system further comprises a hardware accelerator including a memory temporarily storing the first activation values for at least a portion of a layer of the neural network.
Drumond teaches and the system further comprises a hardware accelerator including a memory temporarily storing the first activation values for at least a portion of a layer of the neural network. (Drumond, Sec 5.3, discloses a hardware accelerator:  “To further illustrate this point, we synthesized a proof-of-concept FPGA-based accelerator.”   Drumond, Sec 5.3 Last Line, discloses storing activation values in on-chip memory:  “The proof-of-concept accelerator operates with both weights and activations stored on-chip.”  This will be temporary, as the activation values will change during the forward and backward passes.  Drumond, Conclusion, specifically uses the term “on-chip memory”:  “Higher throughput leads to faster and more energy-efficient DNN training/inference, while model compression leads to lower bandwidth requirements for off-chip memory, lower capacity requirements for on-chip memory and lower communication bandwidth requirements for distributed training”.

Claims 11-15 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Mellempudi et. al. (US 2018/0322607 A1; hereinafter Mellempudi) in view of Frantz.
As per Claim 11, Mellempudi teaches a method of operating a computing system implementing a neural network, the method comprising (Mellempudi, Abstract, discloses “One embodiment provides for a graphics processing unit to perform computations associated with a neural network”
with the computing system: producing first activation values in a first block floating-point format  (Mellempudi, Para [0238], discloses “The activations at each layer can be quantized to a low-precision format, such as a dynamic fixed-point or blocked flow-precision floating-point format.”)
converting at least one but not all of the first activation values to an outlier block floating-point format different than the first block floating-point format (Mellempudi, Para [0231], discloses “FIG. 21A-21D illustrate blocked dynamic multi-precision data operations, according to embodiments described herein. Blocked dynamic multi-precision data operations can be performed for dynamic fixed-point data and can also be generalized to enable block level scaling for any low precision data type. In a training scenario, some tensors can be blocked, while other tensors can be non-blocked. For example, back propagation may require a larger dynamic range, so the computational logic can be configured to block the tensor data using smaller block sizes. For forward propagation computations, blocking may not be required.”  With each pass of a neural network, activation values are updated (i.e., converted).  Here, Mellempudi discloses that the activation values may converted to a different format on the backward pass.  In BFP, a shared exponent is selected for each block.  Mellempudi discloses that the block sizes may be smaller for the back propagation, and in fact, blocking may not even be required in the forward propagation (i.e., one shared exponent for the whole tensor).  This means in the forward and backward passes, a given activation value may be in a different block, and therefore has a different shared exponent.  This can be considered a different “format”.)
and storing the [outlier] activation values in a computer-readable memory or storage device. (Mellempudi, Para [0189], discloses storing the activation values: “Embodiments described herein provide for a dynamic fixed-point representation that can be used to store quantized floating-point data”.  Mellempudi Para [0249] discloses computer-readable memory:  “Memory device 2220 can be a dynamic random-access memory (DRAM) device, a static random-access memory (SRAM) device, flash memory device, phase-change memory device, or some other memory device having suitable performance to serve as process memory”)
However, Mellempudi does not teach thereby generating outlier activation values.
Frantz teaches thereby generating outlier activation values (Frantz, Para [0062-0063], discloses that conversion to BFP can lead to outliers: “Each block is assigned an exponent value depending on the mean brightness of that block. Then each pixel is assigned a relative mantissa value that represents its brightness compared to the mean brightness. For example, if one were to take a picture of a person standing in front of a window, the blocks would be such that the person's body would be represented by a set of blocks with the same exponent value (or similar), and the window would be represented by a set of blocks with another exponent value. In FIG. 5 and FIG. 6, it is evident how the blocks might be created for the picture shown. The very bright areas would have high exponent values, and the dark areas would have low exponent values. In the example, square blocks of pixels are used, but any shape of the blocks would be appropriate--even non-rectangular. The ultimate solution is that the mantissas for all of the pixels in the scene create a normalized picture, and the exponent values indicate the variation of brightness over the whole scene.  If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.”  Here, Frantz discloses a method of conversion to block floating point format, wherein some of the values “saturate the mantissa” (i.e., are outliers)).
Mellempudi and Frantz are analogous art because they are in the field of endeavor of efficient floating point operations. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the neural network quantization of Mellempudi, with the overflow storage of Frantz. The modification would have been obvious because one of ordinary skill in the art 

As per Claim 12, the combination of Mellempudi and Frantz teaches the method of claim 11 as shown above, as well as wherein each of the outlier activation values comprises a first mantissa associated with a shared exponent shared by all of the outlier values and a second mantissa associated with a different exponent than the shared exponent. (Frantz, Para [0063], discloses “If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.” Here, the value comprises a first mantissa associated with a shared exponent shared by all of the outlier values.  This first mantissa, however, is incomplete, as it saturates the bits.  The value also comprises a second (and third, possibly fourth, etc) mantissa associated with a different exponent, as Frantz calculates a new exponent on the deconstructed data.  While Frantz is splitting up the data the value is representing, the overall value of this data still “comprises” the first incomplete value and the smaller constituent values, as an overall property of the data.)

As per Claim 13, the combination of Mellempudi and Frantz teaches the method of claim 12 as shown above, as well as wherein the first mantissa and the second mantissa each comprise the same number of bits. (Frantz, Abstract, discloses that all values will be in the same format:  “Embodiments of the invention provide a 16 bit floating point signal processor”, and Frantz Fig 4 shows an 11 bit mantissa.  Frantz does not change the number of bits in the mantissa when splitting up an outlier.)

As per Claim 14, the combination of Mellempudi and Frantz teaches the method of claim 12 as shown above, as well as further comprising identifying the shared exponent by determining at least one of a mean, a median, and/or a mode for at least a portion of the first activation values.  (Frantz, Para [0062], discloses “Each block is assigned an exponent value depending on the mean brightness of that block.”)

As per Claim 15, the combination of Mellempudi and Frantz teaches the method of claim 11.  Mellempudi teaches generating uncompressed activation values by reading the stored [outlier] activation values from the computer-readable memory or storage device (Mellempudi, Para [0238], discloses activation values “The activations at each layer can be quantized to a low-precision format, such as a dynamic fixed-point or blocked flow-precision floating-point format” and Mellempudi Para [0189] discloses storing quantized values (i.e. activation values) in memory:  “Embodiments described herein provide for a dynamic fixed-point representation that can be used to store quantized floating-point data”.  In order for the activation values to be of any use in calculations like a dot product in Mellempudi [292] “The vector math group performs arithmetic such as dot product calculations on vector operands”, then the values must be retrieved from memory.  It has not been stated that the values have been “compressed”, and “uncompressed” has been given no meaning in this limitation, therefore, these activation values can be considered uncompressed.)
	However, Mellempudi does not teach outlier values;  wherein for each of the uncompressed activation values: when the uncompressed activation value is associated with an outlier activation value, generating a respective one of the uncompressed activation values by: combining a first value defined by a first mantissa for the uncompressed activation value and a shared exponent with a second value defined by a second mantissa associated with an outlier exponent associated with the uncompressed activation value, and when the uncompressed activation value is not associated with an outlier activation value, generating a respective one of the uncompressed activation values by: producing a first value defined by a first mantissa for the uncompressed activation value and a shared exponent.
	Frantz teaches outlier [activation] values; wherein for each of the uncompressed [activation] values: when the uncompressed [activation] value is associated with an outlier [activation] value, generating a respective one of the uncompressed [activation] values by: combining a first value defined by a first mantissa for the uncompressed [activation] value and a shared exponent with a second value defined by a second mantissa associated with an outlier exponent associated with the uncompressed [activation] value, and when the uncompressed [activation] value is not associated with an outlier [activation] value, generating a respective one of the uncompressed [activation] values by: producing a first value defined by a first mantissa for the uncompressed [activation] value and a shared exponent. (Frantz, Para [0063], discloses outlier values, and how they are partitioned and stored: “If any pixel saturates the mantissa, or if the mantissa becomes zero; the block may be split into a subset of blocks, for example, four equal blocks, and the mean and pixel brightnesses are established for each of the smaller blocks. If, in these smaller blocks, saturation or zero occurs again, the block(s) it occurs in may be split again into a subset of blocks, for example, once again into four equal blocks.”  Mellempudi, as shown above, discloses retrieving activation values from memory.  When combined with Frantz, this necessitates retrieving non-outlier BFP representations with a first mantissa and shared exponent, and retrieving outlier BFP representations by combining a first mantissa with a shared exponent and a second mantissa with an outlier exponent, which in Frantz’s case, is the same exponent as the shared exponent used with the first mantissa, as Frantz has split the block off into another BFP block with a shared exponent.  Combining the values in this block reconstructs the original data.)

As per Claim 17, the combination of Mellempudi and Frantz teaches the method of claim 11 as shown above, as well as producing the first activation values by performing forward propagation for at least one layer of the neural network (Mellempudi, Para [0238], discloses activation values: “The activations at each layer can be quantized to a low-precision format, such as a dynamic fixed-point or blocked flow-precision floating-point format” and in [0231] performing forward propagation:  “For forward propagation computations, blocking may not be required.”
performing backward propagation for the at least one layer of the neural network by converting the stored, second activation values to activation values in the first block floating- (Note that “stored, second activation values” lacks antecedent basis in the claim.  Mellempudi, Para [0231], discloses backward propagation:  “For example, back propagation may require a larger dynamic range, so the computational logic can be configured to block the tensor data using smaller block sizes.”  Here, Mellempudi discloses second activation values (those constructed in a different block format in the back propagation as opposed to the forward propagation).  The backward propagation will necessarily lead to the subsequent forward propagation, resulting in conversion back to the first floating point format.  Compression has been given no definition, so values in both block floating point formats are “uncompressed”)
performing a gradient operation with the uncompressed activation values (Mellempudi, Para [0158], discloses “The error values are then propagated backwards until each neuron has an associated error value which roughly represents its contribution to the original output. The network can then learn from those errors using an algorithm, such as the stochastic gradient descent algorithm, to update the weights of the neural network.”)
and updating weights for at least one node of the neural network based on the uncompressed activation values (Mellempudi, Para [0140] discloses “Training a neural network involves selecting a network topology, using a set of training data representing a problem being modeled by the network, and adjusting the weights until the network model performs with a minimal error for all instances of the training data set”)

Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Mellempudi and Frantz further in view of Chong et. al. (US 2019/0373264 A1; hereinafter Chong).
As per Claim 16, the combination of Mellempudi and Frantz teaches the method of claim 11 as shown above.  However, the combination of Mellempudi and Frantz does not teach prior to the storing, compressing the outlier activation values stored in the computer- readable memory or storage device by one or more of the following techniques: entropy compression, zero compression, run length encoding, compressed sparse row compression, or compressed sparse column compression.
Chong teaches prior to the storing, compressing the outlier activation values stored in the computer- readable memory or storage device by one or more of the following techniques: entropy compression, zero compression, run length encoding, compressed sparse row compression, or compressed sparse column compression. (Chong, Para [0102], discloses compressing activation data of a neural network using entropy compression, and storing it:  “To reduce memory access bandwidth requirements for neural network data, the neural network device or neural network component can perform a method to compress data from intermediate nodes in the neural network in a lossless manner. For example, a neural network coding engine of the neural network device or neural network component can be used after each hidden layer of a neural network to compress the activation data output from each hidden layer. The activation data output from a hidden layer can include a 3D volume of data having a width, height, and depth, with the depth corresponding to multiple layers of filters for that hidden layer, and each depth layer having a width and height. For instance, a feature map (with activation or feature data) having a width and height is provided for each depth layer. The compressed data can be stored in a storage device or memory. The storage device or memory can be internal to the neural network device or neural network hardware component, or can be external to the device or hardware component. The neural network coding engine can retrieve the compressed activation data (e.g., read the compressed data and load the data in a local cache), and can decompress the compressed activation data before providing the decompressed activation data as input to a next layer of the neural network. In some examples, a prediction scheme can be applied to the activation data, and residual data can be determined based on the prediction scheme. In one illustrative example, given a block of neural network data (e.g., activation data from a hidden layer), the neural network coding engine can apply a prediction scheme to each sample in the block of neural network data, and residual data can be determined based on the prediction scheme. The residual data can then be coded using a coding technique. Any suitable coding technique can be used, such as variable-length coding (VLC), arithmetic coding, other type of entropy coding, or other suitable technique.”)
Mellempudi and Chong are analogous art because they are in the field of endeavor of neural networks. 
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the neural network quantization of Mellempudi, with the entropy compression of Chong. The modification would have been obvious because one of ordinary skill in the art would be motivated to reduce bandwidth and power consumption. (Chong Para [0003]:  “In some cases, for the intermediate layers of a neural network, either 8 bit or 16 bit fixed or 

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Jo et. al. (“Training Neural Networks with Low Precision Dynamic Fixed-Point”) discloses training CNNs using a dynamic fixed point format with shared exponents
Han et. al. (“Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding”) discloses quantizing a neural network before compressing it with Huffman coding
Pareek et. al. (US 10,747,502 B2) discloses a multiply and accumulate circuit for use with neural networks that performs the dot product using fixed point format with shared exponent
Gibson et al. (US 2017/0323197 A1) discloses a hardware implementation of a CNN, using a fixed point format with a shared exponent per layer
Nair et. al. (US 2019/0042944 A1) discloses hardware to train neural networks utilizing a shared exponent format 


Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710.  The examiner can normally be reached on M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/L.A.S./Examiner, Art Unit 2126  
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126