DETAILED ACTION
This action is in response to the claims filed 11/30/2021. Claims 26-50 are pending and have been examined.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 26-28, 32-34, 35-37, 41-43, and 44-49 are rejected under 35 U.S.C. 103 as being unpatentable over Woo et al. US Document ID US 10019668 B1 hereinafter Woo. Further in view of Chen et al. “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks” hereinafter Chen. 

Regarding Claim 26
	Woo teaches, An apparatus comprising: memory; and memory management logic circuitry coupled with the memory, (Abstract “The subject matter described in this specification includes systems and methods… determining a partitioning of the neural network layers into a sequence of superlayers. Each superlayer is a partition of the directed graph that includes one or more layers… The method includes processing the batch of inputs using the hardware circuit [logic circuit for determining]” Examiner notes that a partitioning a neural network into a sequence of superlayers corresponds to stages of a cascaded neural network.) accelerator circuitry arranged to execute a cascade neural network (CNN) using the memory to store data to compute an inference with the CNN (Col 10 line 30-35 “a neural network of a machine learning system can include an accelerator architecture that does not impose unnecessary constraints on a minimum or maximum batch size that can be supported by storage units 204 of the hardware circuit's on-chip memory”)
	Woo does not appear to explicitly teach, the memory logic circuitry to:… for two or more stages of a cascaded neural network; determine… a count of concurrent, shared memory block allocations; and determine a size for each of the shared memory block allocations of the count of concurrent, shared memory block allocations, allocate portions of the memory based on the count of concurrent shared memory block allocation and the size for each of the shared memory block allocations to accommodate the data in the memory during execution of the CNN
	Chen, however, when addressing issues related to allocating memory resources based on a count and size of block allocations teaches, the memory logic circuitry to:… for two or more stages of a cascaded neural network (pg 2 ¶01 “A spatial architecture based on a new CNN [deep convolutional neural network] dataflow, called row stationary, which is optimized for throughput and energy efficiency. It works on both convolutional and fully-connected layers, and optimizes all types of data movement in the storage hierarchy” the CNN corresponds to a multilayers network consisting of more than 1 stage or layer) , determine… a count of concurrent, shared memory block allocations; ( pg 3 Data Handling ¶02 “Due to the weight sharing property in CONV layers, a small amount of unique input data can be shared across many operations. Each filter weight is reused E^2 times” pg 4 Section IV ¶02 “Once a weight is fetched from DRAM to the RF of a PE, the PE runs through all NE^2 operations that use the same filter weight” a weight from a block of DRAM memory is shared concurrently for each of the NE^2 operations, the number of operations corresponding to the count.) and determine a size for each of the shared memory block allocations of the count of concurrent, shared memory block allocations, (pg 7 Section VI Weight Stationary: “Each PE holds a single weight in the RF at a time” pg 7 Setup for Dataflow Comparison ¶01-¶02 “all dataflows are given the same number of PEs with the same storage area, which includes the global buffer and RF… In our simulations, a baseline storage area for a given number of PEs is calculated as… the baseline storage area for all dataflows is calculated from the setup with 512B RF/PE and an 128kB global buffer” the size available for the weight in each PE, which corresponds to the shared memory block allocations, is dependent on the storage for each RF and the global buffer size) allocate portions of the memory based on the count of concurrent shared memory block allocation and the size for each of the shared memory block allocations to accommodate the data in the memory during execution of the CNN (pg 6 ¶01 “The exact amount of logical PE sets to fold and to map spatially at each of the three dimensions, i.e., N, M, and C, are determined by the RF [register files] size and physical PE array size [size for each of the block allocations], respectively. It then becomes an optimization problem to determine the best folding by using the framework in Section VI-C to evaluate the results” the folding scheme, allocates the PEs with their associated “concurrent shared memory weight” to RF, or register files, based on the total number of PEs and their memory size, and then the RFs are used for the execution of the CNN, as shown in the results Section VI-C ) 
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method/system for profiling memory block allocation using a memory management architecture as taught by Chen to the disclosed invention of Woo.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a “[a framework] that minimizes energy consumption by maximizing input data reuse (filters and feature maps) and minimizing partial sum accumulation cost simultaneously, and by accounting for the energy cost of different storage levels” (Conclusion Chen)

Regarding Claim 27
	Woo/Chen teaches Claim 26
	Further Woo teaches, wherein the logic circuitry is configured to generate a list of the block allocations and size information for each of the block allocations. (Col 15 line 5-9 “referencing FIG. 5, circuit 100 can determine that: i) a set of parameters for layer A requires 25 MB of memory; ii) a set of parameters for layer B requires 125 MB of memory; and iii) a set of parameters for layer C requires 50 MB of memory.” Examiner notes that the determination for each of the 3 layers corresponds to a list of block allocations with their corresponding size information as denoted by the required memory)
	

Regarding Claim 28
	Woo/Chen teaches Claim 26
	Further Woo teaches, wherein the logic circuitry is configured to determine a batch size per stage of the two or more stages of the cascaded neural network. ( “maximum batch size that can be supported by on-chip memory resources can be determined based on a size of a working set. In particular, the maximum batch size supported by storage units 204 can be determined based, in part, on the largest working set of inputs and parameters that are processed by a given neural network layer.” Examiner notes that the maximum batch size is determined for the set of layers that represent a stage or superlayer. This determination is made on each superlayer or stage in order to efficiently utilize available memory)

Regarding Claim 32
	Woo/Chen teaches Claim 26
	Further Woo teaches, wherein the logic circuitry is configured to determine a count of shared memory block allocations for tensor data to store in the shared memory and size information for the tensor data. (Col 6 line 7-10 “Circuit 100 can be an example compute unit or compute tile and can include additional hardware structures to perform computations associated with multi-dimensional data structures such as tensors, matrices and/or data arrays” Examiner notes that the compute units associated with the storage units and blocks addressed in the prior rejections are associated with tensor data.)

Regarding Claim 33
	Woo/Chen teaches Claim 26
	Further Woo teaches, wherein the logic circuitry is configured to compare counts of …memory block allocations of the cascaded neural network to determine a maximum count of the counts of the shared memory block allocations. (Col 9 line 60-62 “a total on-chip storage capacity associated with memory 102 and 104 may be limited to 20 storage units [maximum count]” Col 9-10 line 63—3 “because a working set of two batch elements processed by layer B requires 16 storage units 204, processing of a third batch element would require 24 units of storage unit 204 and, thus, exceed the 20 storage unit capacity [fixed amount of shared memory]. So, in this example, a neural network may only support a particular maximum working set size that includes two batch elements, when processing each batch element requires at least 8 units of storage.” Examiner notes that in the particular example when the counts of various batch sizes are compared the maximum count or maximum working set size is 8 units of storage per batch, wherein the batch number is restricted to 2 so that the total maximum counts is 16.)

Regarding Claim 34
	Woo/Chen teaches Claim 26
	Further Woo teaches, wherein the logic circuitry is configured to calculate a batch size based on a ratio of an input data size of a first stage of the two or more stages to an input data size of another stage of the two or more stages. (“Circuit 100 can determine total on-chip memory usage based on an equation 1 [Total usage=(working set*N)+parameters] where a variable N of equation 1 is a batch size” Examiner notes that the “total usage” is representative of the data size, including input data, of a plurality of layers. The “working set” corresponds to the input data size of a first stage. The equation can be reformulated as                         
                            
                                
                                    T
                                    o
                                    t
                                    a
                                    l
                                    U
                                    s
                                    a
                                    g
                                    e
                                
                                
                                    w
                                    o
                                    r
                                    k
                                    i
                                    n
                                    g
                                    S
                                    e
                                    t
                                
                            
                            =
                            B
                            a
                            t
                            c
                            h
                             
                            s
                            i
                            z
                            e
                        
                    , Thus the batch size is based on the ratio.)

Regarding Claim 35
	Woo teaches, A computing implemented method comprising:…a memory management logic circuitry, for two or more stages of a cascaded neural network, (Abstract “The subject matter described in this specification includes systems and methods… determining a partitioning of the neural network layers into a sequence of superlayers. Each superlayer is a partition of the directed graph that includes one or more layers… The method includes processing the batch of inputs using the hardware circuit [logic circuit for determining]” Examiner notes that a partitioning a neural network into a sequence of superlayers corresponds to stages of a cascaded neural network.)
	Woo does not appear to explicitly teach, determining… a count of concurrent, shared memory block allocations; determining, by the memory management logic circuitry, a size for each of the shared memory block allocations of the count of concurrent, shared memory block allocations, allocating, by the memory management logic circuitry, portions of the memory, based on the count of concurrent shared memory block allocation and the size for each of the shared memory block allocations, to accommodate the data in the memory during execution of the CNN by accelerator circuitry to compute an inference.
	Chen, however, when addressing issues related to allocating memory resources based on a count and size of block allocations teaches, determining… a count of concurrent, shared memory block allocations ( pg 3 Data Handling ¶02 “Due to the weight sharing property in CONV layers, a small amount of unique input data can be shared across many operations. Each filter weight is reused E^2 times” pg 4 Section IV ¶02 “Once a weight is fetched from DRAM to the RF of a PE, the PE runs through all NE^2 operations that use the same filter weight” a weight from a block of DRAM memory is shared concurrently for each of the NE^2 operations, the number of operations corresponding to the count.) determining, by the memory management logic circuitry, a size for each of the shared memory block allocations of the count of concurrent, shared memory block allocations (pg 7 Section VI Weight Stationary: “Each PE holds a single weight in the RF at a time” pg 7 Setup for Dataflow Comparison ¶01-¶02 “all dataflows are given the same number of PEs with the same storage area, which includes the global buffer and RF… In our simulations, a baseline storage area for a given number of PEs is calculated as… the baseline storage area for all dataflows is calculated from the setup with 512B RF/PE and an 128kB global buffer” the size available for the weight in each PE, which corresponds to the shared memory block allocations, is dependent on the storage for each RF and the global buffer size) allocating, by the memory management logic circuitry, portions of the memory, based on the count of concurrent shared memory block allocation and the size for each of the shared memory block allocations, to accommodate the data in the memory during execution of the CNN by accelerator circuitry to compute an inference. (pg 6 ¶01 “The exact amount of logical PE sets to fold and to map spatially at each of the three dimensions, i.e., N, M, and C, are determined by the RF [register files] size and physical PE array size [size for each of the block allocations], respectively. It then becomes an optimization problem to determine the best folding by using the framework in Section VI-C to evaluate the results” the folding scheme, allocates the PEs with their associated “concurrent shared memory weight” to RF, or register files, based on the total number of PEs and their memory size, and then the RFs are used for the execution of the CNN, as shown in the results Section VI-C ) 
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method/system for profiling memory block allocation using a memory management architecture as taught by Chen to the disclosed invention of Woo.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a “[a framework] that minimizes energy consumption by maximizing input data reuse (filters and feature maps) and minimizing partial sum accumulation cost simultaneously, and by accounting for the energy cost of different storage levels” (Conclusion Chen)

Regarding Claim 36
	Woo/Chen teaches Claim 35
	Further Woo teaches, generating, by the memory management logic circuitry, a list of the block allocations and size information for each of the block allocations. (Col 15 line 5-9 “referencing FIG. 5, circuit 100 can determine that: i) a set of parameters for layer A requires 25 MB of memory; ii) a set of parameters for layer B requires 125 MB of memory; and iii) a set of parameters for layer C requires 50 MB of memory.” Examiner notes that the determination for each of the 3 layers corresponds to a list of block allocations with their corresponding size information as denoted by the required memory)

Regarding Claim 37
	Woo/Chen teaches Claim 35
	Further Woo teaches, determining a batch size per stage of the two or more stages of the cascaded neural network. (“maximum batch size that can be supported by on-chip memory resources can be determined based on a size of a working set. In particular, the maximum batch size supported by storage units 204 can be determined based, in part, on the largest working set of inputs and parameters that are processed by a given neural network layer.” Examiner notes that the maximum batch size is determined for the set of layers that represent a stage or superlayer. This determination is made on each superlayer or stage in order to efficiently utilize available memory)


Regarding Claim 41
	Woo/Chen teaches Claim 35
	Further Woo teaches, wherein determining a count of the concurrent, shared memory block allocations comprises determining the count of shared memory block allocations for tensor data to store in the shared memory and size information for the tensor data.  (Col 6 line 7-10 “Circuit 100 can be an example compute unit or compute tile and can include additional hardware structures to perform computations associated with multi-dimensional data structures such as tensors, matrices and/or data arrays” Examiner notes that the compute units associated with the storage units and memory blocks addressed in the prior rejections are associated with tensor data.)

Regarding Claim 42
	Woo/Chen teaches Claim 41
	Further Woo teaches, wherein determining a size for each of the shared memory block allocations comprises comparing counts of the concurrent, shared memory block allocations of the cascaded neural network to determine a maximum count of the counts of the concurrent, shared memory block allocations. (Col 9 line 60-62 “a total on-chip storage capacity associated with memory 102 and 104 may be limited to 20 storage units [maximum count]” Col 9-10 line 63—3 “because a working set of two batch elements processed by layer B requires 16 storage units 204, processing of a third batch element would require 24 units of storage unit 204 and, thus, exceed the 20 storage unit capacity [fixed amount of shared memory]. So, in this example, a neural network may only support a particular maximum working set size that includes two batch elements, when processing each batch element requires at least 8 units of storage.” Examiner notes that in the particular example when the counts of various batch sizes are compared the maximum count or maximum working set size is 8 units of storage per batch, wherein the batch number is determined to be at most 2 and the maximum count is determined to be 16)

Regarding Claim 43
	Woo/Chen teaches Claim 42
	Further Woo teaches, wherein the logic circuitry is configured to calculate a batch size based on a ratio of an input data size of a first stage of the two or more stages to an input data size of another stage of the two or more stages. (“Circuit 100 can determine total on-chip memory usage based on an equation 1 [Total usage=(working set*N)+parameters] where a variable N of equation 1 is a batch size” Examiner notes that the “total usage” is representative of the data size, including input data, of a plurality of layers. The “working set” corresponds to the input data size of a first stage. The equation can be reformulated as                         
                            
                                
                                    T
                                    o
                                    t
                                    a
                                    l
                                    U
                                    s
                                    a
                                    g
                                    e
                                
                                
                                    w
                                    o
                                    r
                                    k
                                    i
                                    n
                                    g
                                    S
                                    e
                                    t
                                
                            
                            =
                            B
                            a
                            t
                            c
                            h
                             
                            s
                            i
                            z
                            e
                        
                    , Thus the batch size is based on the ratio.)

Regarding Claim 44
Woo teaches, A system to manage memory resources, the system comprising: a communications interface coupled to memory; and a processing component (Abstract “The subject matter described in this specification includes systems and methods… determining a partitioning of the neural network layers into a sequence of superlayers. Each superlayer is a partition of the directed graph that includes one or more layers… The method includes processing the batch of inputs using the hardware circuit [communications interface and a processing component]” Examiner notes that a partitioning a neural network into a sequence of superlayers corresponds to stages of a cascaded neural network. Examiner notes that the hardware circuit that processes the batches and communicates the results to the rest of the system corresponds to the claimed hardware for managing memory resources) wherein the memory comprises dynamic random-access memory. (Col 8-9 line 65-1 “This threshold storage capacity may be less than, or substantially less than, a storage capacity of a dynamic random access memory (DRAM) resource that is associated with off-chip memory of circuit 100.” Examiner notes that the DRAM associated with off-chip memory of circuit 100, wherein the circuit is part of the processing component.)
	Woo does not appear to explicitly teach, a processing component to determine a count and a size of concurrent, shared memory block allocations in multiple stages of a cascaded neural network, and to allocate at least a portion of the memory, based on the count of concurrent shared memory block allocation and the size for each of the shared memory block allocations, to accommodate data-in the memory for execution of a cascaded neural network for performance of inference computations;
	Chen, however, when addressing issues related to allocating memory resources based on a count and size of block allocations teaches, a processing component… in multiple stages of a cascaded neural network (pg 2 ¶01 “A spatial architecture based on a new CNN [deep convolutional neural network] dataflow, called row stationary, which is optimized for throughput and energy efficiency. It works on both convolutional and fully-connected layers, and optimizes all types of data movement in the storage hierarchy” the CNN corresponds to a multilayers network consisting of more than 1 stage or layer) determine a count and a size of concurrent, shared memory block allocations ( pg 3 Data Handling ¶02 “Due to the weight sharing property in CONV layers, a small amount of unique input data can be shared across many operations. Each filter weight is reused E^2 times” pg 4 Section IV ¶02 “Once a weight is fetched from DRAM to the RF of a PE, the PE runs through all NE^2 operations that use the same filter weight” a weight from a block of DRAM memory is shared concurrently for each of the NE^2 operations, the number of operations corresponding to the count. pg 7 Section VI Weight Stationary: “Each PE holds a single weight in the RF at a time” pg 7 Setup for Dataflow Comparison ¶01-¶02 “all dataflows are given the same number of PEs with the same storage area, which includes the global buffer and RF… In our simulations, a baseline storage area for a given number of PEs is calculated as… the baseline storage area for all dataflows is calculated from the setup with 512B RF/PE and an 128kB global buffer” the size available for the weight in each PE, which corresponds to the shared memory block allocations, is dependent on the storage for each RF and the global buffer size) and to allocate at least a portion of the memory, based on the count of concurrent shared memory block allocation and the size for each of the shared memory block allocations, to accommodate data-in the memory for execution of a cascaded neural network for performance of inference computations; (pg 6 ¶01 “The exact amount of logical PE sets to fold and to map spatially at each of the three dimensions, i.e., N, M, and C, are determined by the RF [register files] size and physical PE array size [size for each of the block allocations], respectively. It then becomes an optimization problem to determine the best folding by using the framework in Section VI-C to evaluate the results” pg 2 Section II ¶05 “Overall, the system provides four levels of storage hierarchy for data accesses, including DRAM, global buffer, array (inter-PE communication) and RF” the folding scheme, allocates the PEs with their associated “concurrent shared memory weight” to RF, or register files, based on the total number of PEs and their memory size. The allocation of the memory accommodates the data in the off-chip DRAM to be used by the RFs and PEs for the execution of the CNN, as shown in the results Section VI-C and Figure 1) 
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method/system for profiling memory block allocation using a memory management architecture as taught by Chen to the disclosed invention of Woo.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a “[a framework] that minimizes energy consumption by maximizing input data reuse (filters and feature maps) and minimizing partial sum accumulation cost simultaneously, and by accounting for the energy cost of different storage levels” (Conclusion Chen)

Regarding Claim 45
	Woo/Chen teaches Claim 44
	Further Woo teaches, wherein the processing component is configured to generate a list of the block allocations and size information for each of the block allocations. (Col 15 line 5-9 “referencing FIG. 5, circuit 100 can determine that: i) a set of parameters for layer A requires 25 MB of memory; ii) a set of parameters for layer B requires 125 MB of memory; and iii) a set of parameters for layer C requires 50 MB of memory.” Examiner notes that the determination for each of the 3 layers corresponds to a list of block allocations with their corresponding size information as denoted by the required memory)
	

Regarding Claim 46
	Woo/Chen teaches Claim 44
	Further Woo teaches, wherein the processing component is configured to determine a batch size per stage of the multiple stages of the cascaded neural network. (“maximum batch size that can be supported by on-chip memory resources can be determined based on a size of a working set. In particular, the maximum batch size supported by storage units 204 can be determined based, in part, on the largest working set of inputs and parameters that are processed by a given neural network layer.” Examiner notes that the maximum batch size is determined for the set of layers that represent a stage or superlayer. This determination is made on each superlayer or stage in order to efficiently utilize available memory)


Regarding Claim 47
Woo teaches, A non-transitory machine-readable medium containing instructions, (Col 15-16 line 67-4 “one or more modules of computer program instructions encoded on a tangible non transitory program carrier for execution by, or to control the operation of, data processing apparatus”) which when executed by a processor coupled to memory, cause the processor to perform operations, the operations comprising: determining, by a memory management logic circuitry, for two or more stages of a cascaded neural network, (Abstract “The subject matter described in this specification includes systems and methods… determining a partitioning of the neural network layers into a sequence of superlayers. Each superlayer is a partition of the directed graph that includes one or more layers… The method includes processing the batch of inputs using the hardware circuit [processor]” Examiner notes that a partitioning a neural network into a sequence of superlayers corresponds to stages of a cascaded neural network.)
	Woo does not appear to explicitly teach, determining, by a memory management logic circuitry, for two or more stages of a cascaded neural network, a count of concurrent, shared memory block allocations; and determining, by the memory management logic circuitry, a size for each of the shared memory block allocations of the count of concurrent, shared memory block allocations; and allocating a portion of the memory, based on the count of concurrent shared memory block allocation and the size for each of the shared memory block allocations, to accommodate data in the memory for execution of a cascaded neural network for inference computations.
	Chen, however, when addressing issues related to allocating memory resources based on a count and size of block allocations teaches, by a memory management logic circuitry, for two or more stages of a cascaded neural network (pg 2 ¶01 “A spatial architecture based on a new CNN [deep convolutional neural network] dataflow, called row stationary, which is optimized for throughput and energy efficiency. It works on both convolutional and fully-connected layers, and optimizes all types of data movement in the storage hierarchy” the CNN corresponds to a multilayers network consisting of more than 1 stage or layer) determining, by the memory management logic circuitry… a count of concurrent, shared memory block allocations; ( pg 3 Data Handling ¶02 “Due to the weight sharing property in CONV layers, a small amount of unique input data can be shared across many operations. Each filter weight is reused E^2 times” pg 4 Section IV ¶02 “Once a weight is fetched from DRAM to the RF of a PE, the PE runs through all NE^2 operations that use the same filter weight” a weight from a block of DRAM memory is shared concurrently for each of the NE^2 operations, the number of operations corresponding to the count.) and determining, by the memory management logic circuitry, a size for each of the shared memory block allocations of the count of concurrent, shared memory block allocations (pg 7 Section VI Weight Stationary: “Each PE holds a single weight in the RF at a time” pg 7 Setup for Dataflow Comparison ¶01-¶02 “all dataflows are given the same number of PEs with the same storage area, which includes the global buffer and RF… In our simulations, a baseline storage area for a given number of PEs is calculated as… the baseline storage area for all dataflows is calculated from the setup with 512B RF/PE and an 128kB global buffer” the size available for the weight in each PE, which corresponds to the shared memory block allocations, is dependent on the storage for each RF and the global buffer size) and allocating a portion of the memory, based on the count of concurrent shared memory block allocation and the size for each of the shared memory block allocations, to accommodate data in the memory for execution of a cascaded neural network for inference computations. (pg 6 ¶01 “The exact amount of logical PE sets to fold and to map spatially at each of the three dimensions, i.e., N, M, and C, are determined by the RF [register files] size and physical PE array size [size for each of the block allocations], respectively. It then becomes an optimization problem to determine the best folding by using the framework in Section VI-C to evaluate the results” pg 2 Section II ¶05 “Overall, the system provides four levels of storage hierarchy for data accesses, including DRAM, global buffer, array (inter-PE communication) and RF” the folding scheme, allocates the PEs with their associated “concurrent shared memory weight” to RF, or register files, based on the total number of PEs and their memory size. The allocation of the memory accommodates the data in the off-chip DRAM to be used by the RFs and PEs for the execution of the CNN, as shown in the results Section VI-C and Figure 1) 
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method/system for profiling memory block allocation using a memory management architecture as taught by Chen to the disclosed invention of Woo.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a “[a framework] that minimizes energy consumption by maximizing input data reuse (filters and feature maps) and minimizing partial sum accumulation cost simultaneously, and by accounting for the energy cost of different storage levels” (Conclusion Chen)

Regarding Claim 48
	Woo/Chen teaches Claim 47
	Further Woo teaches, herein the operations further comprise generating, by the memory management logic circuitry, a list of the block allocations and size information for each of the block allocations. (Col 15 line 5-9 “referencing FIG. 5, circuit 100 can determine that: i) a set of parameters for layer A requires 25 MB of memory; ii) a set of parameters for layer B requires 125 MB of memory; and iii) a set of parameters for layer C requires 50 MB of memory.” Examiner notes that the determination for each of the 3 layers corresponds to a list of block allocations with their corresponding size information as denoted by the required memory)
	

Regarding Claim 49
	Woo/Chen teaches Claim 47
	Further Woo teaches, wherein the operations further comprise determining a batch size per stage of the two or more stages of the cascaded neural network. (“maximum batch size that can be supported by on-chip memory resources can be determined based on a size of a working set. In particular, the maximum batch size supported by storage units 204 can be determined based, in part, on the largest working set of inputs and parameters that are processed by a given neural network layer.” Examiner notes that the maximum batch size is determined for the set of layers that represent a stage or superlayer. This determination is made on each superlayer or stage in order to efficiently utilize available memory)


Claims 29-31, 38-40, and 50 are rejected under 35 U.S.C. 103 as being unpatentable over Woo/Chen. Further in view of Shen et al. “Maximizing CNN Accelerator Efficiency Through Resource Partitioning” hereinafter Shen. 

Regarding Claim 29
	Woo/Chen teaches Claim 26
	Woo/Chen does not explicitly teach, wherein the logic circuitry is configured to determine a count of inputs for a stage of the two or more stages that lack interdependencies.
	Shen however, when addressing issues related to distributing computations to dedicated processing elements teaches, wherein the logic circuitry is configured to determine a count of inputs for a stage of the two or more stages that lack interdependencies. (pg 4 Section 4.1 ¶04 and Figure 5 “In each epoch, each CLP [logic circuitry] only consumes data generated during the previous epoch, avoiding data dependencies within a epoch. For example, the output produced by L1 in epoch i will be used as input for L2 in epoch i+1. This means that processing an image requires five epochs, therefore data from five different images will be in flight at a time” Section 4.1 ¶01 “An accelerator for an L-stage CNN would have L CLPs and would operate on L independent input images.” As shown in the image, in each epoch independent inputs are separated among the available processors, the CLP0 has processes in input for L1 first, and the CLP2 processes an output from a previous epoch, i-1, which is independent of the operations on the CLP0 processor. The CLPs only processes input in the epoch that avoids data dependencies corresponding to interdependencies.) 
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to a method/system for determining a number of inputs for stages of a network that lack data interdependencies in a given epoch as taught by Shen to the disclosed invention of Woo/Chen.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement “a new design paradigm that partitions hardware resources among multiple cooperating CLPs [convolutional layer processor]…resulting in better dynamic resource utilization and higher throughput” (Conclusion Shen)

Regarding Claim 30
	Woo/Chen/Shen teaches Claim 29
	Further Woo teaches, wherein the logic circuitry is configured to determine the batch size based on a fixed amount of shared memory available for execution of the two or more stages of the cascaded neural network. (Col 14-15 line 65-2 “a storage capacity, or threshold capacity, of on-chip memory may be 500 megabyte (MB). Circuit 100 can determine total on-chip memory usage based on an equation 1 [Total usage=(working set*N)+parameters] where a variable N of equation 1 is a batch size” Examiner notes the fixed amount of memory available for execution corresponds to the 500 MB available on chip memory, that is shared by each layer in the superlayer that will utilize the memory for computation. The batch size N is a function of the Total Usage which is restricted by the available memory on chip. Thus batch size is based on the fixed amount of memory.) 

Regarding Claim 31
	Woo/Chen/Shen teaches Claim 30
	Further Woo teaches, wherein the logic circuitry is configured to determine the batch size based on the fixed amount of shared memory available for execution of the two or more stages of the cascaded neural network (Examiner notes this was addressed in the rejection of claim 30) based on the maximum count (Col 9 line 60-62 “a total on-chip storage capacity associated with memory 102 and 104 may be limited to 20 storage units [maximum count]”) based on the size determined for each shared memory block allocations, (Col 9-10 line 63—3 “because a working set of two batch elements processed by layer B requires 16 storage units 204, processing of a third batch element would require 24 units of storage unit 204 and, thus, exceed the 20 storage unit capacity [fixed amount of shared memory]. So, in this example, a neural network may only support a particular maximum working set size that includes two batch elements, when processing each batch element requires at least 8 units of storage.” Examiner notes that because the capacity is restricted to 20 storage units, a batch size of two cannot be determined, instead a batch size of 1 is determined.) and based on the count of inputs. (Examiner notes that because the batch size is related to the count of inputs, the determining is based on the count of inputs as well)

Regarding Claim 38
	Woo/Chen teaches Claim 26
	Woo teaches, wherein determining a batch size per stage. (“maximum batch size that can be supported by on-chip memory resources can be determined based on a size of a working set. In particular, the maximum batch size supported by storage units 204 can be determined based, in part, on the largest working set of inputs and parameters that are processed by a given neural network layer.” Examiner notes that a maximum batch size is determined for the set of layers that represent a stage or superlayer.
	Woo/Chen does not explicitly teach, comprises determining a count of inputs for a stage of the two or more stages that lack interdependencies.
	Shen however, when addressing issues related to distributing computations to dedicated processing elements teaches, comprises determining a count of inputs for a stage of the two or more stages that lack interdependencies. (pg 4 Section 4.1 ¶04 and Figure 5 “In each epoch, each CLP only consumes data generated during the previous epoch, avoiding data dependencies within a epoch. For example, the output produced by L1 in epoch i will be used as input for L2 in epoch i+1. This means that processing an image requires five epochs, therefore data from five different images will be in flight at a time” Section 4.1 ¶01 “An accelerator for an L-stage CNN would have L CLPs and would operate on L independent input images.” As shown in the image, in each epoch independent inputs are separated among the available processors, the CLP0 has processes in input for L1 first, and the CLP2 processes an output from a previous epoch, i-1, which is independent of the operations on the CLP0 processor. The CLPs only processes input in the epoch that avoids data dependencies corresponding to interdependencies.) 
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to a method/system for determining a number of inputs for stages of a network that lack data interdependencies in a given epoch as taught by Shen to the disclosed invention of Woo/Chen.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement “a new design paradigm that partitions hardware resources among multiple cooperating CLPs [convolutional layer processor]…resulting in better dynamic resource utilization and higher throughput” (Conclusion Shen)

Regarding Claim 39
	Woo/Chen/Shen teaches Claim 29
	Further Woo teaches, wherein determining a batch size per stage comprises determining the batch size based on a fixed amount of shared memory available for execution of the two or more stages of the cascaded neural network. (Col 14-15 line 65-2 “a storage capacity, or threshold capacity, of on-chip memory may be 500 megabyte (MB). Circuit 100 can determine total on-chip memory usage based on an equation 1 [Total usage=(working set*N)+parameters] where a variable N of equation 1 is a batch size” Examiner notes the fixed amount of memory available for execution corresponds to the 500 MB available on chip memory, that is shared by each layer in the superlayer that will utilize the memory for computation. The batch size N is a function of the Total Usage which is restricted by the available memory on chip. Thus batch size is based on the fixed amount of memory.) 

Regarding Claim 40
	Woo/Chen/Shen teaches Claim 30
	Further Woo teaches, wherein determining the batch size based on a fixed amount of shared memory available for execution of the two or more stages of the cascaded neural network comprises determining the batch size (Examiner notes this was addressed in the rejection of claim 39) based on the maximum count (Col 9 line 60-62 “a total on-chip storage capacity associated with memory 102 and 104 may be limited to 20 storage units [maximum count]”) the size determined for each shared memory block allocations (Col 9-10 line 63—3 “because a working set of two batch elements processed by layer B requires 16 storage units 204, processing of a third batch element would require 24 units of storage unit 204 and, thus, exceed the 20 storage unit capacity [fixed amount of shared memory]. So, in this example, a neural network may only support a particular maximum working set size that includes two batch elements, when processing each batch element requires at least 8 units of storage.” Examiner notes that because the capacity is restricted to 20 storage units, a batch size of two cannot be determined, instead a batch size of 1 is determined.) and the count of inputs (Examiner notes that because the batch size is related to the count of inputs, the determining is based on the count of inputs as well)

Regarding Claim 50
	Woo/Chen teaches Claim 47
	Woo teaches, wherein determining a batch size per stage. (“maximum batch size that can be supported by on-chip memory resources can be determined based on a size of a working set. In particular, the maximum batch size supported by storage units 204 can be determined based, in part, on the largest working set of inputs and parameters that are processed by a given neural network layer.” Examiner notes that a maximum batch size is determined for the set of layers that represent a stage or superlayer.
	Woo/Chen does not explicitly teach, comprises determining a count of inputs for a stage of the two or more stages that lack interdependencies.
	Shen however, when addressing issues related to distributing computations to dedicated processing elements teaches, comprises determining a count of inputs for a stage of the two or more stages that lack interdependencies. (pg 4 Section 4.1 ¶04 and Figure 5 “In each epoch, each CLP only consumes data generated during the previous epoch, avoiding data dependencies within a epoch. For example, the output produced by L1 in epoch i will be used as input for L2 in epoch i+1. This means that processing an image requires five epochs, therefore data from five different images will be in flight at a time” Section 4.1 ¶01 “An accelerator for an L-stage CNN would have L CLPs and would operate on L independent input images.” As shown in the image, in each epoch independent inputs are separated among the available processors, the CLP0 has processes in input for L1 first, and the CLP2 processes an output from a previous epoch, i-1, which is independent of the operations on the CLP0 processor. The CLPs only processes input in the epoch that avoids data dependencies corresponding to interdependencies.) 
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to a method/system for determining a number of inputs for stages of a network that lack data interdependencies in a given epoch as taught by Shen to the disclosed invention of Woo/Chen.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement “a new design paradigm that partitions hardware resources among multiple cooperating CLPs [convolutional layer processor]…resulting in better dynamic resource utilization and higher throughput” (Conclusion Shen)

Response to Arguments
Applicant's arguments filed 11/30/2021 have been fully considered but they are not persuasive.
Applicant States that Chen merely teaches that “at best Chen teaches reuse but does not count shared data”. While Chen teaches filter reuse, examiner notes that the filter is reused E^2 times. The number E^2 is based on the output feature map dimension. The Examiner mapped E^2 to the number or count of shared data, which is determined by the system based on the output feature dimension, which is determined for each convolutional layer. Determining this number corresponds  to a count as claimed. 
Further Applicant states that “Chen does not count concurrent shared memory block allocations for two or more stages of a CNN”. Examiner notes, the count, described previously, is determined for each convolutional layer, therefore across multiple stages.
Applicant states that Chen does not teach “determine a size for each of the shared memory block allocations”, stating the baseline storage area for all processing engines is not shared memory. Further stating 1) “Chen teaches to allocate baseline storage for each PE” 2) “there is no logical connection between shared memory and determining their size”. Examiner agrees that Chen teaches allocating storage for each PE. The data allocated to each PE is the determined shared filter weights. Calculating the amount of filter data to allocate to a PE corresponds to determining a size of an allocation based on the PE memory size. The logical connection between the shared memory and the size determination is that the data allocated to the PEs comes from the previously determined shared filters or shared memory.
Further Applicant states that “Chen does not teach allocate memory based on count and the size.” Further stating that Chen does not “determine a size of memory utilized by the NE2 ops, and that no new memory is allocated based on the number of ops”. Examiner notes that the determination of the size of allocations is not mapped to the NE2 operations, but rather the size of each filter memories mapped to the processing elements. Chen calculates the mapping of E^2 shared filters to processing elements based the number of E^2 filters and size available to each processing element. Thus memory is allocated by folding or mapping shared operation into the register files of the processing elements.
Finally Applicant states that “Folding reused input in processing data flows is not the same as allocating memory”. Examiner notes that the art refers to the processing data flows as the operations performed by the Processing elements. Reused input is folded or mapped to the memory of the processing elements. Applicant provided no explanation as to why this action is not equivalent to allocating memory to the processing elements.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNATHAN R GERMICK whose telephone number is (571)272-8363. The examiner can normally be reached on Monday-Friday 7:30 am – 4:00 pm (EST).
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki, can be reached at telephone number 5712723719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://portal.uspto.gov/external/portal. Should you have questions about access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
	
/J.R.G./Examiner, Art Unit 2122                                                                                                                                                                                                        
/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122