Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
This Office Action is taken in response to Applicant’s Amendment and Remarks filed on November 4, 2020 regarding Application No. 16/406304 originally filed on May 8, 2019.
All independent Claims were amended.
Claims 1, 2, 4-16 and 21-24 are pending for consideration.

	Response to Remarks and Amendments
Applicants’ amendments and remarks have been fully and carefully considered, with the Examiner’s response set forth below.
Regarding rejection of claim 21, and similarly claim 4, over the combination Lukarski/Yu/Smelyanskiy/Strauss the applicant argues that even assuming arguendo that such teaches storing of elements of a vector in independently addressable banks, it (Strauss) does not teach storing the entire vector in a first one of the addressable banks, a valid data array in a second one of the addressable banks, and a base array in a third one of the addressable banks [Remarks p.10, 2nd to last paragraph]. The examiner respectfully disagrees.
On the one hand, Strauss explicitly discloses that  the resident element data buffer may be banked so that multiple resident elements may be addressed independently; wherein the resident element data buffer may enable each resident element or a subset 
On the other hand, Lukarski shows the sparse matrix vector multiplication on slide 14 and explicitly shows that all three VAL, X and Y arrays are access simultaneously for every multiplication operation. Thus, a person of ordinary skill in the art would know to place all three arrays on separate banks of the multi-banked memory of Strauss enabling parallel access to all three arrays to speed-up the matrix-vector multiplication.
Therefore, the combination Lukarski/Yu/Smelyanskiy/Strauss clearly teach the particular limitation of "the memory cell array includes a plurality of memory banks, and wherein the valid data array, the base array and the target data array are stored in different memory banks among the plurality of memory banks".

Claim Objections
Claims 1 and 24 are objected to because of the following informalities:  claim 1 recites "the sparce matrix that are non-zero"; this is a typo. Claim 24 has similar typographical errors.  Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


Claims 1, 2, 5, 6, 9 and 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lukarski (lecture slides published on April 11, 2013) in view of Yu (US 2016/0188476); and further in view of Smelyanskiy (US 2017/0091103).
Regarding claim 1, Lukarski teaches a matrix multiplication operation of a sparse matrix with a vector [slide 22 shows the multiplication of the test sparse matrix (shown in slide 13) and the vector X; wherein the non-zero values of the sparse matrix multiplication are stored on the VAL array and the whole test sparse matrix is stored using a cooprdinate (COO) format shown in slide 14]; a valid data array [val array on slide 14; showing all non-zero elements of the test sparse matrix shown on slide 13], a base array [row and col arrays on slide 14 form the base array] and a target data array [slide 22 shows the matrix multiplication using the val array and the target vector x], the valid data array sequentially including valid elements among matrix elements of the sparce matrix that are non-zero [val array on slide 14], the base array sequentially including position elements indicating positions of the valid elements within the sparse matrix [row and col arrays on slide 14], the target data array sequentially including target elements of the vector corresponding to columns of the sparse matrix [x array on slide 22].
Lukarski, however, does not explicitly teach a memory device for prefetching data to be used in the matrix multiplication operation of a sparse matrix with a vector, the memory device comprising: a memory cell array configured to store the valid ; an information register configured to store indirect memory access information including a start address of the target data array and a unit size of the target elements; and a prefetch circuit configured to prefetch, based on the indirect memory access information, the target elements corresponding to the position elements that are read from the memory cell array.
Yu, when addressing the issues of storing and a sparse matrix in memory, teaches a memory device for prefetching data to be used in the matrix multiplication operation of a sparse matrix with a vector [matrix, where a matrix can be stored as a two-dimensional array; ¶0022], the memory device comprising: a memory cell array configured to store sparse matrix data [matrix, where a matrix can be stored as a two-dimensional array; ¶0022]; an information register configured to store indirect memory access information [indirect pattern detector 408 keeps track of at least two indices (which are stored in a memory/register(s)) as shown in FIG. 5 as index1 and index2; ¶0052]; and a prefetch circuit configured to prefetch, based on the indirect memory access information, the target elements corresponding to the position elements that are read from the memory cell array [prefetching can begin with the indirect prefetcher 136; when the next element in the stream is accessed (B[n]), the indirect prefetcher can jump a certain distance (k) ahead and read B[n+k]. When that read completes, the indirect prefetcher 136 prefetches address=coeffl'B[n+k]+base_address; ¶0053].
On the one hand, Lukarski shows a standard way to store a sparse matrix in memory using the COO format shown in slide 14. This format explicitly includes a valid data array and a base array. What’s more when directly comparing the sparse matrix, 
On the other hand, Yu shows how to use registers access the elements of a sparse matrix and further using indirect memory access; i.e. wherein the indices access an array that in turn contains the locations of the matrix elements. 
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Lukarski to use registers and indirect memory addressing as disclosed in Yu when accessing the sparse matrix elements; and to further use prefetching as Yu to bring the matrix elements from memory before they are needed to mask the communication delay of the memory system and speedup the memory access during matrix operation. The combination would have be obvious because a person of ordinary skill in the art would know to use a known technique (i.e. using registers, indirect memory access and prefetching) to improve similar devices in the same way.
Smelaynskiy, when addressing the issues of prefetching data for matrix multiplication [¶0150], discloses the information register further configured to include a start address of the target data array and a unit size of the target elements[VGATHERPF may accept as parameters a base address ("base") and an index of a vector ("vind"); wherein VGATHERPF may issue prefetches for uops for every address in index register base+vind[0] ... base+vind[15], or another size, depending upon ; and wherein the prefetch circuit prefetching target elements corresponding to the position elements that are read from the memory cell array [there are indirect accesses to the x vector; wherein VGATHERPF may be used to augment code 1904 so that prefetches are made for values of x and colidx (column identifier in x); wherein in code 1906, calls to VGATHERPF include use of a command "prefetchdist" which prefetches some number of iterations ahead; however, the VGATHERPF may introduce extra instructions or uops as it issues prefetch requests for every element of x[colidx[j]], regardless of whether a given value is in the cache or not; ¶0156-186].
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to further implement indirect access and prefetching to the vectors for a sparse matrix vector multiplication as disclosed in Smelyanskiy. The combination would have be obvious because a person of ordinary skill in the art would know to apply a known technique to a known device ready for improvement to yield predictable results.
Regarding claim 2, Lukarski/Yu/Smelyanskiy teaches the memory device of claim 1, wherein the indirect memory access information is provided from an external memory controller to the memory device [processor 500 may include a memory hierarchy comprising one or more levels of caches within the cores, one or more shared cache units such as caches 506, or external memory (not shown) coupled to the set of integrated memory controller units 552; ¶0093 on Smelyanskiy].
Regarding claim 5, Lukarski/Yu/Smelyanskiy teaches the memory device of claim 1, wherein the prefetch circuit includes: an arithmetic circuit configured to calculate target addresses corresponding to the read position elements based on the read position elements, the start address of the target data array and the unit size of the target elements [the indirect address and index value follow: indirect_address=coeff*index+base_address; wherein coeff is the size of the data element, index is the index array such as B[i], and base_address is the address ofA[0]; ¶0050-51 on Yu];
a target address register configured to store the target addresses calculated by the arithmetic circuit [the coeff value and base_address value is stored next to the stream information in the indirect table entry 412 as shown in FIG. 4; ¶0053]; and
a target data register configured to store the target elements that are read from the target addresses of the memory cell array [destination and source registers/data may be generic terms to represent the source and destination of the corresponding data or operation; ¶0047 on Smelyanskiy].
Regarding claim 6, Lukarski/Yu/Smelyanskiy teaches the memory device of claim 5, wherein the arithmetic circuit calculates an ith one of the target addresses by multiplying the unit size by B(i)-1 and adding the start address to a result of the multiplying, where B(i) indicates an ith one of the position elements in the base array [indirect_address=coeff*index+base_address; ¶0050-51 on Yu].
Regarding claim 9, Lukarski/Yu/Smelyanskiy teaches the memory device of claim 5, wherein the target data register is implemented with a static random access memory (SRAM) [SRAM; ¶0053 and ¶0060].
Regarding claim 10, Lukarski/Yu/Smelyanskiy teaches the memory device of claim 1, wherein the indirect memory access information further includes a start address of the base array [base_address; FIG. 5 and ¶0051-54 on Yu], a unit size of the position elements [coeff is the size of the data element; ¶0051-54 on Yu], a total number of the position elements and a read number of the position elements that are read simultaneously from the memory cell array [the core may support multithreading (executing two or more parallel sets of operations or threads) in a variety of manners; wherein multithreading support may be performed by, for example, including time sliced multithreading, simultaneous multithreading (where a single physical core provides a logical core for each of the threads that physical core is simultaneously multithreading), or a combination thereof (¶0089); and wherein instructions corresponding to the predicted branch of execution may be loaded into an instruction cache. In 1615, one or more such instructions in the instruction cache may be fetched for execution. In 1620, the instructions that have been fetched may be decoded into microcode or more specific machine language. In one embodiment, multiple instructions may be simultaneously decoded (¶0144)].
Claim 4, 21, 22, 23 and 24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lukarski in view of Yu further in view of Smelyanskiy; and still further in view of Strauss (US 2015/0067273).
Regarding claim 4, Lukarski/Yu/Smelyanskiy explicitly teach all the claim limitations except for the memory device of claim 1, wherein the memory cell array includes a plurality of memory banks, and wherein the valid data array, the base array and the target data array are stored in different memory banks among the plurality of memory banks.
wherein the memory cell array includes a plurality of memory banks, and wherein the valid data array, the base array and the target data array are stored in different memory banks among the plurality of memory banks [every bank of the resident element data buffer may be connected to the queue insertion controller and the parallel processing unit may choose which resident elements to copy from the selected resident elements driven from the resident element data buffer; ¶0041].
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to store the elements of the multiplication into separate and parallel banks as disclosed in Strauss. The combination would have be obvious because a person of ordinary skill in the art would know to apply a known technique to a known device ready for improvement to yield predictable results.
Regarding claim 21, these claim(s) limitations are significantly similar to those of claim(s) 1; and, thus, are rejected on the same grounds.
Claim 21, however, further recites a random access memory (RAM); and wherein the memory cell array includes a plurality of memory banks, and wherein the valid data array, the base array and the target data array are stored in different memory banks among the plurality of memory banks.
Strauss, in analogous art, teaches a random access memory (RAM) [the off-chip storage device 120 includes DRAM dedicated to the computation device 110; ¶0021]; and wherein the memory cell array includes a plurality of memory banks, and wherein the valid data array, the base array and the target data array are stored in different memory banks among the plurality of memory banks [the resident element multi-banked buffer that stores each resident element (e.g., value) in an individually addressable storage device location; wherein in the example where the computation device is configured to perform a sparse matrix-vector multiplication computation, each value of the vector may be stored at a different addressable location of the resident element data buffer; ¶0030 and ¶0041].
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to use a multi-banked buffer as disclosed in Strauss to stores the different elements needed for the matrix multiplication of Lukarski. The combination would have be obvious because a person of ordinary skill in the art would know to use a known technique (using a multi-bank memory to enable parallel access to the stored elements) to improve similar devices in the same way.
More particularly, Strauss explicitly discloses that  the resident element data buffer may be banked so that multiple resident elements may be addressed independently; wherein the resident element data buffer may enable each resident element or a subset of resident elements to be independently selectable by a different parallel processing unit performing a different computation in the same clock cycle [¶0011]. And further discloses that to speedup sparse matrix-vector multiplication the resident element data buffer 206 may be a multi-banked buffer that stores each resident element (e.g., value) in an individually addressable storage device location [¶0030].
Finally Lukarski shows the sparse matrix vector multiplication on slide 14 and explicitly shows that all three VAL, X and Y arrays are access simultaneously for every multiplication operation. Thus, a person of ordinary skill in the art would know to place all 
Regarding claims 22 and 23, these claim(s) limitations are significantly similar to those of claim(s) 5 and 6; and, thus, are rejected on the same grounds.
Regarding claim 24, Lukarski/Yu/Smelyanskiy teach the memory device of claim 21, wherein the first data is a sparce matrix, the second data is a vector, the valid elements are non-zero elements of the sparce matrix, the position elements indicate positions of the valid elements within the sparce matrix, and the target elements are elements of the vector corresponding to columns of the sparse matrix [Lukarski slides 13 and 14 show the sparse matrix, VAL, X and Y arrays].
Claim 11 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lukarski in view of  Yu further in view of Smelyanskiy; and still further in view of Morad (US 2016/0224465).
Regarding claim 11, Lukarski/Yu/Smelyanskiy explicitly teach all the claim limitations except for the memory device of claim 1, further comprising: a calculation circuit configured to perform a processing-in-memory operation based on the first data and the second data to provide calculation result data.
Morad, in analogous art, discloses that Sparse Matrix multiplication (SpMM) on GP-SIMD [0030] on a large set of matrices, demonstrated potentially improved power efficiency of 20× relative to a number of GPU designs [¶0370]; and further discloses that This area reduction benefit is much more pronounced in processing-in-memory architectures than in other many-cores [¶0391].

Regarding claim 12, Lukarski/Yu/Smelyanskiy teach the memory device of claim 11, wherein the first data is a sparse matrix [¶0021 on Yu] and the second data is a vector [¶0156-186 on Smelyanskiy]. 
Lukarski/Yu/Smelyanskiy, however, does not explicitly teach wherein the processing-in-memory operation performs a sparse matrix vector multiplication.
Morad, in analogous art, discloses that Sparse Matrix multiplication (SpMM) on GP-SIMD [0030] on a large set of matrices, demonstrated potentially improved power efficiency of 20× relative to a number of GPU designs [¶0370]; and further discloses that This area reduction benefit is much more pronounced in processing-in-memory architectures than in other many-cores [¶0391].
Therefore, it would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to implement the indirect addressing of Lukarski/Yu/Smelyanskiy on a processing-in-memory architecture as disclosed in Morad. The combination would have be obvious because a person of ordinary skill in the art would know to use a known technique (i.e. processing-in-memory) to improve similar devices in the same way (i.e. by reducing the area and power consumption).

Allowable Subject Matter
Claims 7, 8 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claims  13-16 are found to be non-obvious over the prior art of record and hence are allowed.
Claims 7, 8 and 13-16 are directed to a particular parallel architecture for implementing the matrix processing that is the claimed inventions. The examiner finds the claimed particular architecture of claims 7, 8 and 17 to be non-obvious over the prior art of record.


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to RAMON A MERCADO whose telephone number is (571)270-5744.  The examiner can normally be reached on Monday to Friday from 7:00AM to 3:00PM.
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, David Yi, can be reached on 571-270-7519.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://portal.uspto.gov/external/portal. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
/Ramon A. Mercado/Primary Examiner, Art Unit 2132