DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
2.	The information disclosure statements (IDS) submitted on September 13, 2019, May 13, 2020, March 30, 2021, and February 15, 2022 were filed after the mailing date of the application on December 29, 2017.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.
Claim Objections
3.	Claim 20 is objected to because of the following informalities:  Claim 20 recites “GEMV” without reciting what “GEMV” stands for.  Appropriate correction is required.
Claim Interpretation
4.	The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

5.	The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
6.	This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: scheduler in claims 1, 2, 7, and 13, multiplication unit in claim 1, sparsity management unit in claims 12-13, block floating point (FP) management unit in claim 12, and variable and mix precision compute unit in claims 12 and 16.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 103
7.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
8.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

9.	The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

10.	Claims 1-2 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dally (US010528864B2) in view of Huang (US010567494B2).
11.	As per Claim 1, Dally teaches an apparatus to facilitate processing a sparse matrix (sparsity in a layer of a CNN is defined as the fraction of zeros in the layer’s weight and input activation matrices, col. 6, lines 39-40), comprising:  a graphics processing unit (training is often done on graphics processing units, col. 1, lines 33-35), including:  a data management unit (DMU) having a scheduler to schedule matrix operations, an active circuitry to track active input operands (only the non-zero elements of weights and input activations are provided as operands to the multipliers, ensuring that each multiplier within a processing element generates a product that affects an output activation value, col. 3, lines 49-53), and a skip circuitry to track unimportant input operands to be skipped by the scheduler (previous efforts to exploit sparsity in CNN accelerators have focused on reducing energy or saving time, which will invariably also save energy, eliminating the multiplication when an input operand is zero by gating an operand input to a multiplier is a natural way to save energy, gating an operand will save energy, but the number of processing cycles will not be reduced, the SNN accelerator 200 also saves energy by eliminating all the unnecessary multiplications, and when any input operand is zero the circuitry is not even prepared to perform a multiplication operation, thus saving time as well, col. 28, lines 54-64); and processing circuitry coupled to the DMU, the processing circuitry comprising a plurality of processing elements including circuitry to read operands, and a multiplication unit to multiply two or more operands (only the non-zero elements of weights and input activations are provided as operands to the multipliers, ensuring that each multiplier within a processing element generates a product that affects an output activation value, col. 3, lines 49-53; when the weights and input activations are in compact form, only non-zero weights and input activations are transferred from the memory interface to the PEs 210, col. 5, lines 23-26).
However, Dally does not teach processing the sparse matrix for arbitrary graph data, and multiplying the two or more operands for the arbitrary graph data.  However, Huang teaches an apparatus to facilitate processing a sparse matrix for arbitrary graph data (sparse matrix (with nonzero elements stored) generally has three storage formats: COO, CSR, and CSC, zero element in a matrix representation does not need to be stored during storage, therefore, a volume of stored data can be reduced by representing a graph using a matrix, in an adjacency matrix representation, most operations for a graph may be converted to a matrix-vector multiplication operation, or a matrix-matrix multiplication operation, col. 8, lines 57-67), comprising:  the processing circuitry comprising a plurality of processing elements including circuitry to read operands, and a multiplication unit to multiply two or more operands for the arbitrary graph data (multiple computation units used for computation are included in a cluster environment, matrix-vector multiplication operations and matrix-matrix multiplication operations are both performed based on matrix partitioning, after being partitioned, a matrix may be referred to as a distributed matrix, concurrent processing may be performed on the distributed matrix using multiple computation units, col. 9, lines 35-41; sparse matrix (with nonzero elements stored) generally has three storage formats: COO, CSR, and CSC, zero element in a matrix representation does not need to be stored during storage, therefore, a volume of stored data can be reduced by representing a graph using a matrix, in an adjacency matrix representation, most operations for a graph may be converted to a matrix-vector multiplication operation, or a matrix-matrix multiplication operation, col. 8, lines 57-67).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Dally to include processing the sparse matrix for arbitrary graph data, and multiplying the two or more operands for the arbitrary graph data because Huang suggests that a volume of stored data can be reduced by representing a graph using a matrix (col. 8, lines 57-67).
12.	As per Clam 2, Dally teaches wherein the scheduler to schedule non-zero operands at the multiplication unit (col. 3, lines 49-53).
13.	Claim 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dally (US010528864B2) and Huang (US010567494B2) in view of Tanaka (US 20210149983A1).
	Dally and Huang are relied upon for the teachings as discussed above relative to Claim 1.  Dally teaches base pointers for input and output vectors; and memory to store input and output vectors (weight buffer 305 is a FIFO buffer that includes a tail pointer, a channel pointer, and a head pointer, layer sequencer pushing weight vectors into the weight buffer 305, tail pointer is not allowed to advance over the channel pointer, full condition is signaled when the tail pointer will advance past the channel pointer when another write vector is stored, col. 13, lines 60-67; when the processing is not stalled, the destination calculation unit increments the head pointers each processing cycle, outputting another vector of weights each processing cycle, destination calculation unit continues to increment the head pointer, each processing cycle that the processing is not stalled, until the next increment would pass the end of the current channel, when the end of the current channel is reached, the destination calculation unit advances the lAPtr and the head pointer is rolled back to the start of the current channel, lAPtr is then used to read the next vector of 1 input activations and the rolled back head pointer is used to read the first vector of weights, destination calculation unit then sequences all of the weights for another vector of input activations to produce another vector of products, when the last vector of input activations for channel c is processed, the designation calculation unit advances to channel c+1 by setting the channel pointer to point to the first weight vector of the channel c+1, col. 14, line 54-col. 15, line 7).
	However, Dally and Huang do not expressly teach memory having pointer circuitry to store the base pointers.  However, Tanaka teaches further comprising:  memory having pointer circuitry to store base pointers for input and output vectors (row number storing array and column number storing array are respectively used as pointers to output vector and input vector, and the p-th calculator performs a unit multiply-accumulate operation, [0103]; and memory to store input and output vectors (reads a corresponding element of the input vector from dedicated memory 126, L1 cache memory 122, or L2 cache memory 125, and stores it into its corresponding second register, [0079]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Dally and Huang to include memory having pointer circuitry to store the base pointers because Tanaka suggests that the pointers need to be stored in order to be used [0103].
14.	Claim 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dally (US010528864B2) and Huang (US010567494B2) in view of Tanaka (US 20210149983A1) and Ishikawa (US 20110262015A1).
Dally and Huang are relied upon for the teachings as discussed above relative to Claim 1.
However, Dally and Huang do not teach wherein each processing element includes the circuitry to read operands, pointer circuitry for providing a column pointer to a memory address of a coefficient of a matrix, data circuitry to generate and send a coefficient value that is identified by the column pointer to the multiplication unit.  However, Tanaka teaches wherein each processing element includes the circuitry to read operands, pointer circuitry for providing a column pointer to a memory address of a coefficient of a matrix (column number storing array are used as a pointer to coefficient matrix, [0072]), data circuitry to generate and send a coefficient value that is identified by the column pointer to the multiplication unit (based on column number storing array, the unit multiply-accumulate operations to be performed in a coefficient matrix, [0091]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Dally and Huang so that each processing element includes the circuitry to read operands, pointer circuitry for providing a column pointer to a memory address of a coefficient of a matrix, data circuitry to generate and send a coefficient value that is identified by the column pointer to the multiplication unit as suggested by Tanaka.  It is well-known in the art that a system of linear equations is frequently represented by its coefficient matrix.
However, Dally, Huang, and Tanaka do not teach that the coefficient is a weighted coefficient.  However, Ishikawa teaches sending a weighted coefficient value to the multiplication unit (multiplying a norm of a difference between the corresponding point and a product of the transformation matrix and the representative point by a weighted coefficient that is obtained for each representative point, [0052]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Dally, Huang, and Tanaka so that the coefficient is a weighted coefficient as suggested by Ishikawa.  It is well-known in the art to use a weighted coefficient for convolution in image processing.
15.	Claim 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dally (US010528864B2), Huang (US010567494B2), Tanaka (US 20210149983A1), and Ishikawa (US 20110262015A1) in view of Mortensen (US 20140365548A1).
	Dally, Huang, Tanaka, and Ishikawa are relied upon for the teachings as discussed above relative to Claim 4.
	However, Dally, Huang, Tanaka, and Ishikawa do not teach wherein the data circuitry sends an identifier of a memory address or a position of the output vector to the output buffer.  However, Mortensen teaches wherein the data circuitry sends an identifier of a memory address or a position of the output vector to the output buffer (transmission of the result vector elements to the designated memory space, as indicated by the data pointer to the result vector, of the single-ported data memory, [0072]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Dally, Huang, Tanaka, and Ishikawa so that the data circuitry sends an identifier of a memory address or a position of the output vector to the output buffer because Mortensen suggests that this way, the output vector can be retrieved quickly [0072].
16.	Claim 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Dally (US010528864B2) and Huang (US010567494B2) in view of Chen (US 20210182077A1).
	Dally and Huang are relied upon for the teachings as discussed above relative to Claim 1.
	However, Dally and Huang do not teach wherein the graphics processing unit supports arbitrary connections across any layers of the arbitrary irregular neural network.  However, Chen teaches wherein the graphics processing unit supports arbitrary connections across any layers of the arbitrary irregular neural network (arbitrary neural network/neural network layer operations, [0566], arbitrary neural network operation or neural network layer operation, [2844], graphics processor (GPU) to set up and operate a neural network, [3028]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Dally and Huang so that the graphics processing unit supports arbitrary connections across any layers of the arbitrary irregular neural network because Chen suggests that this way, it is not limited to a particular neural network architecture [0566, 2844, 3028].
17.	Claim 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Merrill (US 20160179574A1) in view of Amir (US 20160275395A1) and Kalinin (US 20180322179A1).
Merrill teaches a hardware accelerator to facilitate processing a sparse matrix for a neural network (SpMV, training of convolution neural networks, [0079], sparse matrix vector products (SpMV), [0021]), comprising:  a data management unit (DMU) having a scheduler to schedule matrix operations (in the case of SpMV, the merge-based algorithm may be implemented such that each thread processes an equal number of a combination of non-zero values of the sparse matrix combined with rows of the sparse matrix, [0028]) and an auxiliary buffer to store active input operands (register file 420 provides temporary storage for operands connected to the data paths of the functional units, [0054]); and a plurality of processing elements coupled to the DMU [0025], each processing element includes customizable circuitry to support an input vertex program (SMs 340 may be configured to execute a vertex shader program, [0060]) for the neural network [0079].
	However, Merrill does not teach that the neural network is an arbitrary neural network.  However, Amir teaches a hardware accelerator to facilitate processing a sparse matrix (sparse areas of the matrix representation 100 into a reordered matrix, [0082]) for an arbitrary neural network [0034].
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Merrill so that the neural network is an arbitrary neural network because Amir suggests that this way, it is not limited to a particular neural network architecture [0034].
	However, Merrill and Amir do not teach an input buffer for edge data and message data.  However, Kalinin teaches an input buffer for edge data and message data (graph engine, store the transformed data, include edge data, in the shared memory buffer, Claim 7 of Kalinin, graph engine, storing the relational data as the edge data in the shared memory buffer, Claim 14 of Kalinin).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Merrill and Amir to include an input buffer for edge data and message data because Kalinin suggests that this reduces time and resource usage for the data movement [0014] (Claims 7 and 14 of Kalinin).
18.	Claim 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Merrill (US 20160179574A1), Amir (US 20160275395A1), and Kalinin (US 20180322179A1) in view of Chen (US 20210182077A1).
	Merrill, Amir, and Kalinin are relied upon for the teachings as discussed above relative to Claim 7.
	However, Merrill, Amir, and Kalinin do not teach wherein the hardware accelerator supports arbitrary connections across any layers of the arbitrary irregular neural network.  However, Chen teaches wherein the hardware accelerator supports arbitrary connections across any layers of the arbitrary irregular neural network (arbitrary neural network/neural network layer operations, [0566], arbitrary neural network operation or neural network layer operation, [2844], graphics processor (GPU) to set up and operate a neural network, [3028]).  This would be obvious for the reasons given in the rejection for Claim 6.
19.	Claim 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lee (see citation below).
	Lee teaches a graphics processing unit (Abstract, p. v), comprising:  a sparsity management unit to manage sparsity operations (iterative method is used for sparse matrix, p. 4, 2nd paragraph); a block floating point (FP) management unit 3120 to support block FP operations (in block floating arithmetic a block of data shares an exponent for arithmetic operations so that they can perform fixed-point arithmetic in the block, p. 12, 1st paragraph); and a variable and mix precision compute unit to support variable and mix precision operations (variable precision arithmetic on FPGAs, p. 18, p. 37).  Since all of these elements are taught within Lee, it would have been obvious to one of ordinary skill in the art for these elements to be comprised in the graphics processing unit.
20.	Claim 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lee (see citation below) in view of Judd (US 20190205740A1).
	Lee is relied upon for the teachings as discussed above relative to Claim 12.
	However, Lee does not teach wherein the sparsity management unit comprises:  a value check mechanism to detect unimportant values including zero operands and skip these unimportant values of input vectors, and a scheduler to determine scheduling of computations based on scheduling important values and skipping unimportant values of input vectors that are detected by the value check mechanism.  However, Judd teaches wherein the sparsity management unit comprises:  a value check mechanism to detect unimportant values including zero operands and skip these unimportant values of input vectors, and a scheduler to determine scheduling of computations based on scheduling important values and skipping unimportant values of input vectors that are detected by the value check mechanism (CSR, like most sparse matrix formats that target matrices with extreme levels of sparsity have two goals: store only the non-zero elements and reduce memory footprint, [0079], skip zero-operand multiplications, [0036], effectual activation could be skipped if all corresponding weights are ineffectual, calculates each bit of a Can Skip 16-bit vector, [0136]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Lee so that the sparsity management unit comprises:  a value check mechanism to detect unimportant values including zero operands and skip these unimportant values of input vectors, and a scheduler to determine scheduling of computations based on scheduling important values and skipping unimportant values of input vectors that are detected by the value check mechanism because Judd suggests that this achieves performance and energy improvements by skipping over most ineffectual operations in which an input of a multiplication is zero [0004].
21.	Claims 14-15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lee (see citation below) in view of Bittner (US 20180157465A1).
22.	As per Claim 14, Lee is relied upon for the teachings as discussed above relative to Claim 12.
	However, Lee does not teach wherein the block FP management unit includes select circuitry to select a shared exponent for input vectors if the input vectors have block FP and thus different exponents.  However, Bittner teaches wherein the block FP management unit includes select circuitry to select a shared exponent for input vectors if the input vectors have block FP and thus different exponents (producing a BFP representation of an updated matrix or vector, at least two elements of the updated matrix or vector sharing a common exponent, assign one of a plurality of common exponents to a respective mantissa for each element in a matrix or vector, [0137]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Lee so that the block FP management unit includes select circuitry to select a shared exponent for input vectors if the input vectors have block FP and thus different exponents because Bittner suggests that this allows for reduced memory usage, simplified hardware implementation of multipliers and other floating-point matrix processing circuits, energy reduction, and improved computational performance with little or no loss of precision [0002].
23.	As per Claim 15, Lee does not teach wherein the block FP management unit includes align circuitry to cause alignment of a mantissa for the input vector that has a change in exponent.  However, Bittner teaches wherein the block FP management unit includes align circuitry to cause alignment of a mantissa for the input vector that has a change in exponent (align the bias mantissas to the intermediate result vector mantissas, [0043], output mantissa shifter 150 aligns the elements of a partial result vector to the output exponent and produces the final result vector, [0044]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Lee so that the block FP management unit includes align circuitry to cause alignment of a mantissa for the input vector that has a change in exponent because Bittner suggests that this is a way of calculating the best exponent for a BFP representation [0054].
24.	Claim 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lee (see citation below) in view of Ou (see citation below).
	Lee is relied upon for the teachings as discussed above relative to Claim 12.
	However, Lee does not teach wherein the variable and mix precision compute unit include computations units and accumulators to perform computations for input vectors, wherein the computations include at least one of spatial and temporal computations including any spatial and temporal combinations.  However, Ou teaches wherein the variable and mix precision compute unit (graphics processing units, variable-precision support, p. 9, 1st paragraph; p. 15) include computations units and accumulators (multiplier-accumulator units, p. 9, 3rd paragraph) to perform computations for input vectors, wherein the computations include at least one of spatial and temporal computations including any spatial and temporal combinations (p. 11; vector machines incorporate temporal execution to process long vectors over multiple cycles, supplemented by chaining to overlap dependent operations, p. 13, 2nd paragraph).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Lee so that the variable and mix precision compute unit include computations units and accumulators to perform computations for input vectors, wherein the computations include at least one of spatial and temporal computations including any spatial and temporal combinations because Ou suggests that this provides longer vectors, and to process long vectors, it processes long vectors over multiple cycles, supplemented by chaining to overlap dependent operations (p. 13, 2nd paragraph).
25.	Claim 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ashari (US 20170032487A1).
	Ashari teaches a method for training of data (optimizing machine learning workloads on graphical processing units, [0056]).  For sparse matrices, there is the underlying storage layout (row-major, for compressed sparse row (CSR) representation).  In the case of dense matrices, blocks can be read and kept in shared memory.  The reads from global memory can be coalesced [0065].  Thus, it would have been obvious to one of ordinary skill in the art to obtaining a first sparse matrix encoded with compressed sparse row (CSR) and a second dense matrix [0065].  Ashari teaches offloading the second dense matrix in a coalesced manner from memory to a shared local memory (SLM) (for sparse matrices, underlying storage layout (row-major, for compressed sparse row (CSR) representation), in the case of dense matrices, blocks can be read and kept in shared memory, although the reads from global memory can be coalesced, [0065]); and launching a minimum number of workgroups comprising of approximately a total number of hardware threads supported by a graphics processing unit (GPU) (the number of concurrent warps is calculated, to set the coarsening factor C, the goal is to reduce the number of atomic write accesses to global memory, C is set so that all warps have maximal balanced workload, [0093], optimizing machine learning workloads on graphical processing units, GPU kernel launch parameters are estimated following an analytical model that maximizes thread occupancy and minimizes atomic writes to a GPU global memory, [0056]).
26.	Claims 18-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Ashari (US 20170032487A1) in view of Brothers (US 20160267622A1).
27.	As per Claim 18, Ashari is relied upon for the teachings as discussed above relative to Claim 17.
	However, Ashari does not teach further comprising:  determining a minimum number of workgroups to launch to minimize global memory loads to the SLM and selecting a work group size.  However, Brothers teaches further comprising:  determining a minimum number of workgroups to launch to minimize global memory loads to the SLM and selecting a work group size (minimize a need to read and write kernel data to external memory utilize at least one of resizing workgroups, Abstract).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Ashari to include determining a minimum number of workgroups to launch to minimize global memory loads to the SLM and selecting a work group size because Brothers suggests that reading and writing all of the data for each kernel from the external memory consumes power and has other disadvantages associated with the data traffic to the external memory [0004], and thus it is advantageous to minimize a need to read and write kernel data to external memory (Abstract).
28.	As per Claim 19, Ashari teaches further comprising:  applying a load balancing technique for hardware threads such that each hardware thread completes a first block of data and processes a second block of data that is available (all warps have maximal balanced workload, [0093]).
29.	As per Claim 20, Ashari teaches further comprising:  generating outputs for a Sparse Dense GEMV GPU implementation for training of data (GPUs, with specialized processing to handle both parse and dense matrices, [0058], Listing 2 on p. 10).
Allowable Subject Matter
30.	Claims 8-10 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
31.	The following is a statement of reasons for the indication of allowable subject matter:  The prior art taken singly or in combination do not teach or suggest the combination of all the limitations of Claim 8 and base Claim 7, and in particular, do not teach wherein the customizable circuitry to support an input vertex program supports multiply, accumulate, activate, and send message functions.  Claims 9-10 depend from Claim 8, and therefore also contain allowable subject matter.



Prior Art of Record
1.	Jun Kyu Lee, AIR: Adaptive Dynamic Precision Iterative Refinement, August 2012, University of Tennessee, p. v, 4, 12, 18, 37, https://trace.tennessee.edu/cgi/viewcontent.cgi?article=2671&context=utk_graddiss
2.	Albert Ou, Mixed Precision Vector Processors, December 2015, University of California at Berkeley, p. 9, 11, 13, 15, https://digitalassets.lib.berkeley.edu/techreports/ucb/text/EECS-2015-265.pdf
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONI HSU whose telephone number is (571)272-7785. The examiner can normally be reached M-F 10am-6:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee Tung can be reached on (571)272-7794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





JH
/JONI HSU/Primary Examiner, Art Unit 2611