DETAILED ACTION
This action is responsive to the amendment filed 07/08/2022. Claims 1, 3, 5-6, 8-9, 11-13, 15-17 and 19 are pending and have been examined.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claim 1 objected to because of the following informalities:
The Examiner respectfully notes "data compression module" has been amended to "cache controller" in Claim 1 and other claims. However, there is one "data compression module" left unamended in Claim 1. Looks like the Applicant missed one in the amendment.
Appropriate correction is required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 3, 6, 8-9, 11, 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Pool (US 20190081637 A1) , referred herein as Pool in view of Uhrenholt et al. (US 20210216464 A1), referred herein as Uhrenholt, further in view of  Venkatesh et al. (U S 20210011846 A1), referred herein as Venkatesh, further in view of Sakthivel et al. (US 20180300246 A1), referred herein as Sakthivel.
Regarding Claim 1, Pool teaches 
A system for facilitating machine learning utilizing a multi-core processor, the system comprising: a processor comprising a plurality of processor cores; (Pool [0003] This technology relates to deep learning, machine learning and artificial intelligence. More particularly, the technology herein relates to graphics processing unit (GPU) memory architectures including hardware-based sparse data compression/decompression (CODEC) capabilities for compressing and decompressing sparse data sets of the type use in/generated by deep neural networks (DNNs). [0129] The processing system 600 may be a multicore processing system comprising a GPU(s) and/or CPU(s), but is not so limited)
and a cache controller configured to: retrieve a chunk of data from the data array; (Pool [0041] As one example, it is helpful to include a data codec (compressor/decompressor) in the (e.g., L2) cache of the processing system to compress (in hardware) before saving to memory, and decompressing (again in hardware) upon retrieval from memory for further processing.)
wherein the chunk of data is comprised within an activation map of a neural network, and wherein the activation map comprises intermediate results computed by a stage of the neural network;
 (Pool [[0003] This technology relates to deep learning, machine learning and artificial intelligence, and to sparse data compressors and decompressors in such systems.[0123] … the data may be transformed by removing certain values that are not needed for an expected workload or by modifying the values that are represented by the data. For example, in deep learning a common operation is to throw away values that are below zero. This is a typical activation function used in convolutional networks called ReLU (a rectified linear unit). The result is that all the values that are negative are changed to zero. [0043] In deep learning applications, the data may have a high rate of sparsity (e.g., a preponderance of zeros or ones in binary data). The high rate of sparsity may be in either the data representing the deep learning network itself or the data flowing through the network.)
and compress the chunk of data, (Pool  [0066] The method includes receiving an uncompressed data set for compression 210. The received data set may be a neural network data set.) (i.e. Pool’s compressor performs the same function of instant application’s data compression.)
wherein the chunk of data comprises elements, and wherein, in order to compress the chunk of data, the data compression module is further configured to: calculate a bit mask for the chunk of data; using the bit mask, shift out elements in the chunk of data corresponding to zero values, wherein non-zero value elements in the chunk of data are retained, and Page 2write the bit mask and the non-zero value elements of the chunk of data to a memory, (Pool [0043] One approach is to compress the data by removing essentially all the zeros or all the ones from the data, such that, when the data is fetched, less data needs to be transferred…A simple way to keep track of the removed zeros or ones is with a bitmask, where each bit in the bitmask represents an element in the data. [0041] … in the (e.g., L2) cache of the processing system to compress (in hardware) before saving to memory.) (i.e. using bitmask, zeros are shift out, non-zero values retained and be transferred.)
Pool teaches multicore processing and compressing the data in cache before saving to memory and decompressing upon retrieval compressed data from memory, but Pool does not teach a plurality of cache slices. Pool teaches store decompressed data in cache after decompressing data upon retrieval compressed data from memory. Cache is known to have tag array and data array, and data array is where data values are stored, but Pool does not explicitly teach a data array operable to store decompressed data,  wherein to perform the write of the non-zero value elements, the memory is accessed each time a cache line is filled with the non-zero value elements.
However, Uhrenholt teaches a plurality of cache slices, wherein each processor core of the plurality of processor cores is configured to access each of the plurality of cache slices (Uhrenholt [0185] In both FIGS. 3 and 4, the L2 cache 24 is shown as being configured as respective separate physical cache portions (slices) 30.) and each cache slice comprises: a data array operable to store decompressed data, (Uhrenholt [0070] The data that is stored in the cache (and that the processor is using when performing a processing operation) can comprise any suitable and desired data that a data processor may operate on. The data in an embodiment comprises data of a data array that the processor is processing, the data array comprising an array of data elements each having an associated data value(s).[0088] In an embodiment the processing unit that makes the read request to the cache is a data encoder that is operable to compress data from the cache for writing back to the memory system. In this case, the data encoder (processing unit) is in an embodiment also operable to decompress compressed data from the memory system for writing into the cache in an uncompressed form) wherein to perform the write of the non-zero value elements, the memory is accessed each time a cache line is filled with the non-zero value elements, ([0065] Thus, in an embodiment, each cache line is associated with a “processing unit” flag (bit) to indicate whether that cache line (the data in that cache line) should be processed by the processing unit or not. This indication may be set, for example, on cache line fill or write from lower level caches. [0089] Thus, in an embodiment, the processing unit is a data encoder associated with the cache, that is configured to, when data is to be written from the cache to the memory system, encode uncompressed data from the cache for storing in the memory system in a compressed format and send the data in the compressed format to the memory system for storing, and when data in a compressed format is to be read from the memory system into the cache, decode the compressed data from the memory system and store the data in the cache in an uncompressed format.)
Pool and Uhrenholt are analogous art because they are from the same field of memory control. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Pool and Uhrenholt before him or her to modify the cache memory of Pool to have the plurality of cache slices and data array of Uhrenholt . The motivation for doing so would be the enhanced performance and capacity of the compressor/decompressor data processing in cache.
Pool teaches ReLU function which is one of the activation functions and the activation map is the visual representation of activation function member, but Pool in view of Uhrenholt does not explicitly teach activation map. 
However, Venkatesh teaches activation map (Venkatesh[0040] In some embodiments, the convolutional layer is the core building block of a neural network 114 (e.g., configured as a CNN)… Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map.)
Pool, Uhrenholt, and Venkatesh are analogous art because they are from the same field of data processing with Venkatesh specifically related to sparse data in a neural network. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Pool, Uhrenholt, and Venkatesh before him or her to modify the compressor/decompressor of Pool with the teaching of Uhrenholt and Venkatesh. The motivation for doing so would be the enhanced performance and capacity of the compressor/decompressor data processing in cache, and especially be able to handle the sparse data from the neural network.
Pool-Uhrenholt-Venkatesh does not teach the memory is accessed each time a cache line is filled with the non-zero value elements following a convolution operation of the neural network. 
However, Sakthivel teaches following a convolution operation of the neural network (Sakthivel abst: In an example, an apparatus comprises a plurality of processing unit cores, a plurality of cache memory modules associated with the plurality of processing unit cores, and a machine learning model communicatively coupled to the plurality of processing unit cores, wherein the plurality of cache memory modules share cache coherency data with the machine learning model. [0131] a specialized cache eviction strategy may keep data of frames {N, N+1} in cache after computing a 3D convolution across frames [0133] When convolution is computed with a sliding window approach, tiles in frame N that have no more dependencies in future convolution operations can be evicted from the cache.) (i.e. memory access following convolution operation)
Pool, Uhrenholt, Venkatesh and Sakthivel are analogous art because they are from the same field of data processing with Venkatesh specifically related to sparse data in a neural network. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Pool, Uhrenholt, Venkatesh and Sakthivel before him or her to modify the Pool-Uhrenholt-Venkatesh systeml with the teaching of Sakthivel. The motivation for doing so would be (Sakthivel abst) cache memory modules share cache coherency data with the machine learning model.
Regarding Claim 3, Pool, Uhrenholt, Venkatesh and Sakthivel teach
The system of Claim 1, wherein the elements in the chunk of data comprise bytes of data, wherein a length of the chunk of data is fixed, and further wherein the length of the chunk of data is programmable. (Pool [0003] This technology also relates to data inspection for determining data type and/or compression/decompression unit size; [0067] The granularity may be selected from a plurality of possible granularities. The granularity may be selected based on the distribution of element values (e.g., zero valued bytes) in all or a subset of the received data set.) (i.e. the selection of granularity means different granularity can be selected therefore the length of the chunk of data is programmable; once a  granularity is selected the length of the chunk of data is fixed)
Regarding Claim 6, Pool, Uhrenholt, Venkatesh and Sakthivel teach
The system of Claim 1, wherein to calculate the bit mask the cache controller is further configured to compare each element in the chunk of data with zero, and wherein each bit in the bit mask is operable to indicate whether a corresponding element in the chunk of data comprises zero-values.  (Pool [0043]  One approach is to compress the data by removing essentially all the zeros or all the ones from the data, …A simple way to keep track of the removed zeros or ones is with a bitmask, where each bit in the bitmask represents an element in the data.)
Regarding Claim 8, Pool teaches
A system for performing data decompression in a cache controller of a multi-core processor, the system comprising: a processor comprising a plurality of processor cores; and a cache associated with the processor, (Pool [0003] This technology relates to deep learning, machine learning and artificial intelligence. More particularly, the technology herein relates to graphics processing unit (GPU) memory architectures including hardware-based sparse data compression/decompression (CODEC) capabilities for compressing and decompressing sparse data sets of the type use in/generated by deep neural networks (DNNs). [0041] As one example, it is helpful to include a data codec (compressor/decompressor) in the (e.g., L2) cache of the processing system. [0129] The processing system 600 may be a multicore processing system comprising a GPU(s) and/or CPU(s), but is not so limited)
and a cache controller configured to: retrieve a bit mask corresponding to a compressed chunk of data from a memory associated with the processor; retrieve non-zero value elements corresponding to the compressed chunk of data from the memory, (Pool Fig 3 decompressor [0043] One approach is to compress the data by removing essentially all the zeros or all the ones from the data, such that, when the data is fetched, less data needs to be transferred…A simple way to keep track of the removed zeros or ones is with a bitmask, where each bit in the bitmask represents an element in the data.)
wherein the compressed chunk of data is associated with an activation map of a neural network model, and wherein the activation map comprises intermediate results computed by a stage of the neural network  (Pool [[0003] This technology relates to deep learning, machine learning and artificial intelligence, and to sparse data compressors and decompressors in such systems.[0123] … the data may be transformed by removing certain values that are not needed for an expected workload or by modifying the values that are represented by the data. For example, in deep learning a common operation is to throw away values that are below zero. This is a typical activation function used in convolutional networks called ReLU (a rectified linear unit). The result is that all the values that are negative are changed to zero. [0043] In deep learning applications, the data may have a high rate of sparsity (e.g., a preponderance of zeros or ones in binary data). The high rate of sparsity may be in either the data representing the deep learning network itself or the data flowing through the network.)
wherein the non-zero value elements are retrieved from the memory (Pool [0070] The compressed data set may have been stored in main memory, may have remained resident in the L2 cache, or may have been transmitted between processors. In the example shown, the decompression method 300 reads the mask generated by compression method 200 and determines the inferred granularity that the compression method 200 used to compress the data.) (i.e. retrieve compressed non-zero value data from memory.)
using the bit mask, decompress the compressed chunk of data by shifting the non-zero value elements to insert zero value elements at locations indicated by the bit mask; (Pool [0071] Once the decompression method 300 knows the granularity used by the compression method 200 to compress the data, it uses the mask to insert redundant values (e.g., zeros or ones in the case of binary representations) into positions in the output data as specified by the mask.)
and Page 4write the decompressed chunk of data to the data array.  (Pool [0071] the decompression method 300 outputs the decompressed data set (e.g., by storing it in the L2 cache and thereby making it available to the processor).
Pool teaches multicore processing and compressing the data in cache before saving to memory and decompressing upon retrieval compressed data from memory. Cache memory has tag array and data array, and data array is where data values are stored, but Pool does not teach the cache is distributed between a plurality of cache slices, wherein each processor core of the plurality of processor cores is operable to access each of the plurality of cache slices. Pool teaches store decompressed data in cache after decompressing data upon retrieval compressed data from memory. Cache is known to have tag array and data array, and data array is where data values are stored, but Pool does not explicitly teach each cache slice comprises: a data array operable to store decompressed data. Pool teaches decompressor retrieve non-zero value elements from memory but does not explicitly teach wherein the non-zero value elements are retrieved from the memory a single cache line at a time.
However, Uhrenholt teaches the cache is distributed between a plurality of cache slices, wherein each processor core of the plurality of processor cores is operable to access each of the plurality of cache slices, (Uhrenholt [0185] In both FIGS. 3 and 4, the L2 cache 24 is shown as being configured as respective separate physical cache portions (slices) 30.)  and wherein each cache slice comprises: a data array operable to store decompressed data; (Uhrenholt [0070] The data that is stored in the cache (and that the processor is using when performing a processing operation) can comprise any suitable and desired data that a data processor may operate on. The data in an embodiment comprises data of a data array that the processor is processing, the data array comprising an array of data elements each having an associated data value(s).[0088] In an embodiment the processing unit that makes the read request to the cache is a data encoder that is operable to compress data from the cache for writing back to the memory system. In this case, the data encoder (processing unit) is in an embodiment also operable to decompress compressed data from the memory system for writing into the cache in an uncompressed form.) wherein the non-zero value elements are retrieved from the memory a single cache line at a time (Uhrenholt [0088] In an embodiment the processing unit that makes the read request to the cache is a data encoder that is operable to compress data from the cache for writing back to the memory system. In this case, the data encoder (processing unit) is in an embodiment also operable to decompress compressed data from the memory system for writing into the cache in an uncompressed form. [0067] Thus, in an embodiment, some entries in the cache (cache lines) will be handled by the processing unit, whereas other entries in the cache (cache lines) may be handled in the normal manner for the cache and cache system in question. [0183] such as an indication of whether the data in the cache line is stored in the memory in a compressed or uncompressed form, and if it is compressed, the number of memory transactions needed to fetch the compressed data. [0065] Thus, in an embodiment, each cache line is associated with a “processing unit” flag (bit) to indicate whether that cache line (the data in that cache line) should be processed by the processing unit or not. This indication may be set, for example, on cache line fill or write from lower level caches. [0089] Thus, in an embodiment, the processing unit is a data encoder associated with the cache, that is configured to, when data is to be written from the cache to the memory system, encode uncompressed data from the cache for storing in the memory system in a compressed format and send the data in the compressed format to the memory system for storing, and when data in a compressed format is to be read from the memory system into the cache, decode the compressed data from the memory system and store the data in the cache in an uncompressed format.)
Pool and Uhrenholt are analogous art because they are from the same field of memory control. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Pool and Uhrenholt before him or her to modify the cache memory of Pool to have the plurality of cache slices and data array of Uhrenholt  and the decompressor of Pool to read a single cache line at a time from the teaching of Uhrenholt. The motivation for doing so would be the enhanced performance of the decompressor for data processing in cache.
Pool teaches ReLU function which is one of the activation functions and the activation map is the visual representation of activation function member, but Pool in view of Uhrenholt does not explicitly teach activation map.
However, Venkatesh teaches activation map (Venkatesh[0040] In some embodiments, the convolutional layer is the core building block of a neural network 114 (e.g., configured as a CNN)… Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map.)
	Pool, Uhrenholt, and Venkatesh are analogous art because they are from the same field of data processing with Venkatesh specifically related to sparse data in a neural network. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Pool, Uhrenholt, and Venkatesh before him or her to modify the compressor/decompressor of Pool with the teaching of Uhrenholt and Venkatesh. The motivation for doing so would be the enhanced performance and capacity of the compressor/decompressor data processing in cache, and especially be able to handle the sparse data from the neural network .
Pool-Uhrenholt-Venkatesh teaches the bit mask is retrieved prior to (Pool [0041] To further increase storage capacity, decrease memory latency and reduce amount of data that is transmitted at various phases and stages of processing the data, it is desirable to compress and decompress the data. As one example, it is helpful to include a data codec (compressor/decompressor) in the (e.g., L2) cache of the processing system to compress (in hardware) before saving to memory, and decompressing (again in hardware) upon retrieval from memory for further processing. [0047] Only the non-zero values are explicitly stored or transmitted to the decompressor. (The decompressor is able to populate the zero values based on the bit mask alone.) (i.e. bit mask is retrieved, together with the compressed data, from memory for performing decompressing, prior to processing the data)
Pool-Uhrenholt-Venkatesh does not teach and wherein the bit mask is retrieved prior to a convolution operation of the neural network;
However, Sakthivel teaches Prior to a convolution operation of the neural network (Sakthivel abst: In an example, an apparatus comprises a plurality of processing unit cores, a plurality of cache memory modules associated with the plurality of processing unit cores, and a machine learning model communicatively coupled to the plurality of processing unit cores, wherein the plurality of cache memory modules share cache coherency data with the machine learning model. [0131] a specialized cache eviction strategy may keep data of frames {N, N+1} in cache after computing a 3D convolution across frames [0133] When convolution is computed with a sliding window approach, tiles in frame N that have no more dependencies in future convolution operations can be evicted from the cache.) (i.e. retrieve data from memory prior to convolution operation)
Pool, Uhrenholt, Venkatesh and Sakthivel are analogous art because they are from the same field of data processing with Venkatesh specifically related to sparse data in a neural network. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Pool, Uhrenholt, Venkatesh and Sakthivel before him or her to modify the Pool-Uhrenholt-Venkatesh systeml with the teaching of Sakthivel. The motivation for doing so would be (Sakthivel abst) cache memory modules share cache coherency data with the machine learning model.
Regarding Claim 9, Pool, Uhrenholt, Venkatesh and Sakthivel teach
The system of Claim 8, wherein the non-zero value elements comprise bytes of non-zero value data and the zero value elements comprise bytes of zero value data.  (Pool [0047] Here, the mask data thus specifies 8 bytes (each byte corresponding to a bit in the mask). The “1” values in the data mask indicate non-zero values in the input data block, and “0” values indicate zero values. Only the non-zero values are explicitly stored or transmitted to the decompressor. (The decompressor is able to populate the zero values based on the bit mask alone.)
Regarding Claim 11, Pool, Uhrenholt, Venkatesh and Sakthivel teach
The system of Claim 8, wherein prior to retrieving the non-zero value elements, the cache controller is further configured to use the bit mask to non-zero value elements to be retrieved from the memory.  (Pool [0071] Once the decompression method 300 has parsed the entirety of the mask and has constructed all successive bytes in the output data based on the mask, the decompression method 300 outputs the decompressed data set (e.g., by storing it in the L2 cache and thereby making it available to the processor).)  
Regarding Claim 13, Pool, Uhrenholt, Venkatesh and Sakthivel teach
The system of Claim 8, wherein each bit in the bit mask is operable to indicate whether a corresponding element in the decompressed chunk of data comprises zero- values. (Pool [0047] Here, the mask data thus specifies 8 bytes (each byte corresponding to a bit in the mask). The “1” values in the data mask indicate non-zero values in the input data block, and “0” values indicate zero values. Only the non-zero values are explicitly stored or transmitted to the decompressor. (The decompressor is able to populate the zero values based on the bit mask alone.)
Pool teaches retrieve compressed non-zero value elements from memory, but does not teach calculate a number of cache lines.
However, Uhrenholt teaches the calculate a number of cache lines (Uhrenholt [0183] such as an indication of whether the data in the cache line is stored in the memory in a compressed or uncompressed form, and if it is compressed, the number of memory transactions needed to fetch the compressed data. [0196] each data block is configured to occupy an integer number cache lines in its uncompressed form, when compressed, particularly if using a variable rate compression scheme, the data may compress to a different (and smaller) number of cache lines (and thus corresponding memory transactions)
Pool and Uhrenholt are analogous art because they are from the same field of memory control. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Pool and Uhrenholt before him or her to modify the decompressor of Pool to use the calculation of cache line from the teaching of Uhrenholt. The motivation for doing so would be the enhanced performance of the decompressor for data processing in cache.
Claim(s) 5, 12, 15-17, 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Pool (US 20190081637 A1) , referred herein as Pool in view of Uhrenholt et al. (US 20210216464 A1), referred herein as Uhrenholt, further in view of  Venkatesh et al. (U S 20210011846 A1), referred herein as Venkatesh, further in view of Sakthivel et al. (US 20180300246 A1), referred herein as Sakthivel, further in view of Gaither et al. (US 20070016724 A1), referred herein as Gaither.
Regarding Claim 5, Pool, Uhrenholt, Venkatesh and Sakthivel teach
The system of Claim 1, write the non-zero value elements (Pool [0043] One approach is to compress the data by removing essentially all the zeros or all the ones from the data, such that, when the data is fetched, less data needs to be transferred…A simple way to keep track of the removed zeros or ones is with a bitmask, where each bit in the bitmask represents an element in the data.)
Pool-Uhrenholt-Venkatesh-Sakthivel does not teach wherein the cache controller is further configured to access the memory to write the non-zero value elements a single cache line at a time until the chunk of data is fully compressed. 
However, Gaither teaches wherein the cache controller is further configured to access the memory to write the non-zero value elements a single cache line at a time until the chunk of data is fully compressed.  (Gaither [0024] The RAM controller may then write the compressed data to memory, taking care to align the compressed data line on the same default boundaries as an un-compressed version of the cache line would have used. (i.e. default boundaries is the cache line) [0055] Thus, method 500 may include, at 545, selectively controlling a burst-mode protocol employed to write data to memory. Then, method 500 may include, at 550, writing the compressed block of data to memory as a sub-block(s) of data aligned on default alignment boundaries using the burst-mode protocol. Method 500 may also include, at 560, storing the size of the compressed block of data so that it may be acquired upon a read access targeted at the block of compressed data.)
Pool, Uhrenholt, Venkatesh, Sakthivel and Gaither are analogous art because they are from the same field of memory control. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Pool, Uhrenholt, Venkatesh, Sakthivel and Gaither before him or her to modify the Pool-Uhrenholt-Venkatesh-Sakthivel’s system with Gaither’s teaching. The motivation for doing so would be to have (Gaither [0023]) an efficient mechanism for reading and/or writing the cache line by utilizing the burst-mode boundaries.
Regarding Claim 12, Pool, Uhrenholt, Venkatesh and Sakthivel teach
The system of Claim 11, retrieve the non-zero value elements (Pool Fig. 3 Decompressor  [0041]As one example, it is helpful to include a data codec (compressor/decompressor) in the (e.g., L2) cache of the processing system to compress (in hardware) before saving to memory, and decompressing (again in hardware) upon retrieval from memory for further processing.)
Pool-Uhrenholt-Venkatesh-Sakthivel does not teach wherein the cache controller is further configured to access the memory to retrieve the non-zero value elements until the compressed chunk of data is fully decompressed.
However, Gaither teaches wherein the data decompression module is further configured to access the memory to retrieve the non-zero value elements until the compressed chunk of data is fully decompressed. (Gaither [0024] The RAM controller may then write the compressed data to memory, taking care to align the compressed data line on the same default boundaries as an un-compressed version of the cache line would have used. (i.e. default boundaries is the cache line) [0057] Once the size of the compressed data is known, method 500 may, at 585, selectively manipulate the burst-mode protocol between a memory controller and a memory in which the data is stored to facilitate efficiently retrieving the block of compressed data. Method 500 may also include, at 590, retrieving the block of compressed data from the memory using the burst-mode protocol as controlled, at least in part, by the compressed size.)
Pool, Uhrenholt, Venkatesh, Sakthivel and Gaither are analogous art because they are from the same field of memory control. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Pool, Uhrenholt, Venkatesh, Sakthivel and Gaither before him or her to modify the Pool-Uhrenholt-Venkatesh-Sakthivel’s system with Gaither’s teaching. The motivation for doing so would be to have (Gaither [0023]) an efficient mechanism for reading and/or writing the cache line by utilizing the burst-mode boundaries.
Regarding Claim 15, Pool teaches
A method for performing data compression in a multi-core processor, the method comprising: retrieving a chunk of data from a data array of a cache slice, wherein the cache slice is comprised within a cache associated with the multi-core processor, (Pool [0003] This technology relates to deep learning, machine learning and artificial intelligence. More particularly, the technology herein relates to graphics processing unit (GPU) memory architectures including hardware-based sparse data compression/decompression (CODEC) capabilities for compressing and decompressing sparse data sets of the type use in/generated by deep neural networks (DNNs). [0129] The processing system 600 may be a multicore processing system comprising a GPU(s) and/or CPU(s), but is not so limited. [0041] As one example, it is helpful to include a data codec (compressor/decompressor) in the (e.g., L2) cache of the processing system to compress (in hardware) before saving to memory, and decompressing (again in hardware) upon retrieval from memory for further processing. [0066] The method includes receiving an uncompressed data set for compression 210. The received data set may be a neural network data set.) (i.e. Pool’s compressor performs the same function of instant application’s compression module.) 
and wherein the chunk of data is comprised within an activation map of a neural network, and wherein the activation map comprises intermediate results computed by a stage of the neural network;  (Pool [[0003] This technology relates to deep learning, machine learning and artificial intelligence, and to sparse data compressors and decompressors in such systems.[0123] … the data may be transformed by removing certain values that are not needed for an expected workload or by modifying the values that are represented by the data. For example, in deep learning a common operation is to throw away values that are below zero. This is a typical activation function used in convolutional networks called ReLU (a rectified linear unit). The result is that all the values that are negative are changed to zero. [0043] In deep learning applications, the data may have a high rate of sparsity (e.g., a preponderance of zeros or ones in binary data). The high rate of sparsity may be in either the data representing the deep learning network itself or the data flowing through the network.)
calculating a bit mask for the chunk of data, wherein the chunk of data comprises elements; using the bit mask, shifting out elements in the chunk of data corresponding to zero values, wherein non-zero value elements in the chunk of data are retained; and writing the bit mask and the non-zero value elements to a memory  (Pool [0043] One approach is to compress the data by removing essentially all the zeros or all the ones from the data, such that, when the data is fetched, less data needs to be transferred…A simple way to keep track of the removed zeros or ones is with a bitmask, where each bit in the bitmask represents an element in the data. [0041] … in the (e.g., L2) cache of the processing system to compress (in hardware) before saving to memory.) wherein the writing further comprises: accessing the memory to write of the non-zero value elements for each access (Pool [0041]  As one example, it is helpful to include a data codec (compressor/decompressor) in the (e.g., L2) cache of the processing system to compress (in hardware) before saving to memory.  [0043] One approach is to compress the data by removing essentially all the zeros or all the ones from the data…)
Pool teaches multicore processing and compressing the data in cache before saving to memory and decompressing upon retrieval compressed data from memory, but Pool does not teach the cache is distributed between a plurality of cache slices, and wherein each core of the multi-core processor can access each of the plurality of cache slices. Pool teaches store decompressed data in cache after decompressing data upon retrieval compressed data from memory. Cache is known to have tag array and data array, and data array is where data values are stored, but Pool does not explicitly teach a data array operable to store decompressed data. Pool teaches compression module write compressed non-zero value elements to memory but does not explicitly teach write a single cache line.
However, Uhrenholt teaches the cache is distributed between a plurality of cache slices, and wherein each core of the multi-core processor can access each of the plurality of cache slices (Uhrenholt [0185] In both FIGS. 3 and 4, the L2 cache 24 is shown as being configured as respective separate physical cache portions (slices) 30.) and a data array operable to store decompressed data (Uhrenholt [0070] The data that is stored in the cache (and that the processor is using when performing a processing operation) can comprise any suitable and desired data that a data processor may operate on. The data in an embodiment comprises data of a data array that the processor is processing, the data array comprising an array of data elements each having an associated data value(s).[0088] In an embodiment the processing unit that makes the read request to the cache is a data encoder that is operable to compress data from the cache for writing back to the memory system. In this case, the data encoder (processing unit) is in an embodiment also operable to decompress compressed data from the memory system for writing into the cache in an uncompressed form.) write a single cache line (Uhrenholt [0065] In an embodiment, the data entries in the cache (the cache lines) can be, and are also, associated with an indication of whether the data entry (the cache line) should be processed by the processing unit or not…This indication may be set, for example, on cache line fill or write from lower level caches. [0065] Thus, in an embodiment, each cache line is associated with a “processing unit” flag (bit) to indicate whether that cache line (the data in that cache line) should be processed by the processing unit or not. This indication may be set, for example, on cache line fill or write from lower level caches. [0089] Thus, in an embodiment, the processing unit is a data encoder associated with the cache, that is configured to, when data is to be written from the cache to the memory system, encode uncompressed data from the cache for storing in the memory system in a compressed format and send the data in the compressed format to the memory system for storing, and when data in a compressed format is to be read from the memory system into the cache, decode the compressed data from the memory system and store the data in the cache in an uncompressed format.)
Pool and Uhrenholt are analogous art because they are from the same field of memory control. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Pool and Uhrenholt before him or her to modify the cache memory of Pool to have the plurality of cache slices and data array of Uhrenholt  and the decompressor of Pool to write a single cache line at a time from the teaching of Uhrenholt. The motivation for doing so would be the enhanced performance of the decompressor for data processing in cache.
Pool teaches ReLU function which is one of the activation functions and the activation map is the visual representation of activation function member, but Pool in view of Uhrenholt does not explicitly teach activation map. 
However, Venkatesh teaches activation map (Venkatesh[0040] In some embodiments, the convolutional layer is the core building block of a neural network 114 (e.g., configured as a CNN)… Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map.)
Pool, Uhrenholt, and Venkatesh are analogous art because they are from the same field of data processing with Venkatesh specifically related to sparse data in a neural network. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Pool, Uhrenholt, and Venkatesh before him or her to modify the compressor/decompressor of Pool with the teaching of Uhrenholt and Venkatesh. The motivation for doing so would be the enhanced performance and capacity of the compressor/decompressor data processing in cache, and especially be able to handle the sparse data from the neural network.
Pool-Uhrenholt-Venkatesh does not teach writing comprises accessing the memory following a convolution operation of the neural network. 
However, Sakthivel teaches following a convolution operation of the neural network (Sakthivel abst: In an example, an apparatus comprises a plurality of processing unit cores, a plurality of cache memory modules associated with the plurality of processing unit cores, and a machine learning model communicatively coupled to the plurality of processing unit cores, wherein the plurality of cache memory modules share cache coherency data with the machine learning model. [0131] a specialized cache eviction strategy may keep data of frames {N, N+1} in cache after computing a 3D convolution across frames [0133] When convolution is computed with a sliding window approach, tiles in frame N that have no more dependencies in future convolution operations can be evicted from the cache.) (i.e. memory access following convolution operation)
Pool, Uhrenholt, Venkatesh and Sakthivel are analogous art because they are from the same field of data processing with Venkatesh specifically related to sparse data in a neural network. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Pool, Uhrenholt, Venkatesh and Sakthivel before him or her to modify the Pool-Uhrenholt-Venkatesh systeml with the teaching of Sakthivel. The motivation for doing so would be (Sakthivel abst) cache memory modules share cache coherency data with the machine learning model.
Pool-Uhrenholt-Venkatesh-Sakthivel does not teach writing comprises accessing the memory following a convolution operation of the neural network to write a single cache line of the non-zero value elements for each access until the chunk of data is fully compressed. 
However, Gaither teaches until the chunk of data is fully compressed. (Gaither [0024] The RAM controller may then write the compressed data to memory, taking care to align the compressed data line on the same default boundaries as an un-compressed version of the cache line would have used. (i.e. default boundaries is the cache line) [0055] Thus, method 500 may include, at 545, selectively controlling a burst-mode protocol employed to write data to memory. Then, method 500 may include, at 550, writing the compressed block of data to memory as a sub-block(s) of data aligned on default alignment boundaries using the burst-mode protocol. Method 500 may also include, at 560, storing the size of the compressed block of data so that it may be acquired upon a read access targeted at the block of compressed data.)
Pool, Uhrenholt, Venkatesh, Sakthivel and Gaither are analogous art because they are from the same field of memory control. Before the effective filing date of the invention, it would have been obvious to a person of ordinary skill in the art, having the teaching of Pool, Uhrenholt, Venkatesh, Sakthivel and Gaither before him or her to modify the Pool-Uhrenholt-Venkatesh-Sakthivel’s system with Gaither’s teaching. The motivation for doing so would be to have (Gaither [0023]) an efficient mechanism for reading and/or writing the cache line by utilizing the burst-mode boundaries.
	Regarding Claim 16, Pool, Uhrenholt, Venkatesh, Sakthivel and Gaither teach
The method of Claim 15, wherein the elements in the chunk of data comprise bytes of data.  
(Pool [0032] As one example, the subset to be analyzed can be a relatively small part of the input data set such as the first 32 bytes of a 256-byte data set)
Regarding Claim 17, Pool, Uhrenholt, Venkatesh, Sakthivel and Gaither teach
The method of Claim 15, wherein a length of the chunk of data is fixed, and further wherein the length is programmable.  (Pool [0003] This technology also relates to data inspection for determining data type and/or compression/decompression unit size; [0067] The granularity may be selected from a plurality of possible granularities. The granularity may be selected based on the distribution of element values (e.g., zero valued bytes) in all or a subset of the received data set.) (i.e. the selection of granularity means different granularity can be selected therefore the length of the chunk of data is programmable; once a  granularity is selected the length of the chunk of data is fixed)
Regarding Claim 19, Pool, Uhrenholt, Venkatesh, Sakthivel and Gaither teach
(Pool [0068] The compression method 200 compresses the received data set using the inferred granularity 230. In one example, the data set may be compressed by removing successive or singular (depending on the granularity) elements, with zero (or one) values, from the received data set and instead representing them with a compression mask representing each element in the received data set and indicating which elements in the received data set have zero (or one) values. The compressed data can be stored in memory and/or transmitted to a device requesting the compressed data 240.)
Response to Arguments
The 112 claim rejections to claims 1, 3, 5-7, 8-9, 11-14 have been withdrawn in light of the cancellation of claim 7 and claim 14 and instant amendment to claims 1, 5, 8, 11-12. The Examiner respectfully notes that one “data compression module” left unamended in Claim 1, all other “data compression module” and “data decompression module” amended to “cache controller”. Looks like the Applicant missed one when amending Claim 1, therefore objection to Claim 1.
On page 10-12, the applicant argues:
“Applicants respectfully submit that Pool does not teach or suggest the claimed elements "retrieve a chunk of data from the data array, wherein the chunk of data is comprised within an activation map of a neural network, and wherein the activation map comprises intermediate results computed by a stage of the neural network" and "write the bit mask and the non-zero value elements of the chunk of data to a memory, wherein to perform the write of the non-zero value elements, the memory is accessed each time a cache line is filled with the non-zero value elements following a convolution operation of the neural network." 
Applicants acknowledge that Pool discloses that the technology disclosed therein relates to deep learning and machine learning and that a typical activation function used in convolutional networks is called ReLU (a rectified linear unit). However, the disclosure regarding the ReLU activation function does not automatically teach or suggest the claimed elements "retrieve a chunk of data from the data array, wherein the chunk of data is comprised within an activation map of a neural network, and wherein the activation map comprises intermediate results computed by a stage of the neural network." For example the disclosure regarding the activation function does not teach that the "chunk of data" is comprised within the activation map and that the activation map comprises "intermediate results" computed by a stage of the neural network, as claimed. 
Further, Applicants respectfully submit that Pool does not teach or suggest the claimed elements "write the bit mask and the non-zero value elements of the chunk of data to a memory, wherein to perform the write of the non-zero value elements, the memory is accessed each time a cache line is filled with the non-zero value elements following a convolution operation of the neural network." There is no disclosure in Pool regarding accessing the memory following a convolutional operation of the neural network, as claimed.
Applicants respectfully submit that Uhrenholt does not overcome the shortcomings of Pool. Uhrenholt, for example, also does not teach or suggest the claimed elements "retrieve a chunk of data from the data array, wherein the chunk of data is comprised within an activation map of a neural network, and wherein the activation map comprises intermediate results computed by a stage of the neural network" and "write the bit mask and the non-zero value elements of the chunk of data to a memory, wherein to perform the write of the non-zero value elements, the memory is accessed each time a cache line is filled with the non-zero value elements following a convolution operation of the neural network."
For these reasons, Applicants respectfully submit that independent Claim 1 is not rendered unpatenable by Pool in view of Uhrenholt. Since independent Claims 8 and 15 recite elements similar to those discussed above with respect to independent Claim 1, Applicants respectfully submit that independent Claims 8 and 15 are also not rendered unpatenable by Pool in view of Uhrenholt. 

Applicant’s arguments, see above, filed on 07/08/2022, regarding independent claim 1, 8 and 15 have been fully considered and they are moot.  The Examiner respectfully notes the new grounds of rejection that were necessitated by Applicant’s amendments to the claims.
On page 12-14, the applicant argues:
 “Dependent claims recite further elements of the invention claimed in their respective independent Claims, Applicants respectfully submit that the dependent claims are also not rendered unpatenable by Pool in view of Uhrenholt. Thus, Applicants respectfully submit that Claims 1-6, 8-9, 11 and 13 overcome the 35 U.S.C. §10 rejection of record, and therefore, are allowable.
Applicants respectfully submit that Claims 5, 12, 15-17 and 19 are not rendered obvious by Pool in view of Uhrenholt and further in view of Gaither. Thus, Applicants respectfully submit that Claims 5, 12, 15-17 and 19 overcome the 35 U.S.C. §103 rejection of record, and therefore, are allowable.
Claims 7 and 14 recite further elements of the invention claimed in independent Claim 1, Applicants respectfully submit that Claims 7, 14 and 20 are not rendered obvious by Pool in view of Uhrenholt and further in view of Venkatesh. Thus, Applicants respectfully submit that Claims 7 and 14 overcome the 35 U.S.C. §103 rejection of record, and therefore, are allowable.”
The Examiner respectfully notes the new grounds of rejection that were necessitated by Applicant’s amendments to the claims.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
GEORGIADIS; Georgios – US-20200143226-A1 - Lossy compression of neural network activation maps

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WEI MA whose telephone number is (571)272-2468. The examiner can normally be reached Monday through Friday from 8am to 5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Sanjiv Shah can be reached on 571-272-4098. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/WEI MA/Examiner, Art Unit 2135                                                                                                                                                                                                        
                                                                                                                                                                                             /YAIMA RIGOL/Primary Examiner, Art Unit 2135