DETAILED ACTION
This action is in response to the claims filed April 17th 2018. Claims 26-50 are pending and have been examined.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 26-50 are rejected under 35 U.S.C 101 because the claimed invention is directed to an abstract idea without significantly more.

Regarding Claim 26
Step 1 Analysis: Claim 26 is directed to a computer system method, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The Claim recites a computer system. Each of the following limitations:
to determine a count of concurrent, shared memory block allocations;  
and determine a size of the shared memory block allocations of the count of concurrent, shared memory block allocation
determining the information of the at least one dedicated hardware device

As drafted, is a process that, under its broadest reasonable interpretation, covers an abstract idea, but for the recitation of a generic computing system method and processor. The above limitations in the context of this claim encompasses determining (mental processes). As such the claim recites an abstract idea.
Step 2A Prong Two Analysis: The judicial exception in not integrated into a practical application. In particular, the claim only recites additional elements that are mere instructions to implement an abstract idea, or merely uses a computer as a tool to perform an abstract idea. The additional elements of “to accommodate data to store in a shared memory”, amounts to adding extra-solution activity, as it describes a process that is a tangential addition to the claim MPEP 2106.05(g). The Additional element of “for the two or more stages of the cascaded neural network for inference computations.” Generally links the use of the judicial exception to a particular environment MPEP 2106.05(h). Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using a generic computer to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer. Mere instructions to apply an exception using a generic computer cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 27-34
Step 1 Analysis: The rejection of Claim 26 is incorporated, therefore Claim 27-34 is directed to a computer system, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The Claim recites a computer system method.
to generate a list of the concurrent, shared memory block allocations
(to generate a) size information for each of the shared memory block allocations.
to determine a batch size 
determine a count of inputs
compare counts of the concurrent, shared memory block allocations
determine a maximum count of the counts of the shared memory block allocations
calculate a batch size

As drafted, is a process that, under its broadest reasonable interpretation, covers an abstract idea, but for the recitation of a generic computing system method. The above limitations in the context of this claim encompasses determining, comparing, and calculating (mental processes). As such the claim recites an abstract idea.
Furthermore, the claim depends on Claim 1. As such the incorporated rejection is directed to an abstract idea. Therefore Claim 27-34 recites an abstract idea.
Step 2A Prong Two Analysis: The judicial exception in not integrated into a practical application. In particular, the claim only recites additional elements that are mere instructions to implement an abstract idea, or merely uses a computer as a tool to perform an abstract idea. The additional elements of “per stage of the two or more stages of the cascaded neural network,” “for a stage of the two or more stages that lack interdependencies,” “based on a fixed amount of shared memory available for execution of the two or more stages of the cascaded neural network,” “for tensor data to store in the shared memory and size information for the tensor data,” “based on a ratio of an input data size of a first stage of the two or more stages to an input data size of another stage of the two or more stages.” Generally links the use of the judicial exception to a particular environment MPEP 2106.05(h). Accordingly, these additional elements do not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using a generic computer to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer. Mere instructions to apply an exception using a generic computer cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 35-50
Examiner notes claims 35, 44, and 47 are rejected under 35 U.S.C 101 for the same reason that claim 26 is rejected.
Examiner notes the claims not yet addressed are rejected under U.S.C. 101 for the reasons provided in the rejection of claim 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 30/39 recites the limitation "the batch size".   There is insufficient antecedent basis for this limitation in the claim. Claim 31/40 are rejected by virtue of dependency.
Claim 31/40 recites the limitation "the maximum count".   There is insufficient antecedent basis for this limitation in the claim.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 26-28, 32-34, 35-37, 41-43, and 44-49 are rejected under 35 U.S.C. 103 as being unpatentable over Woo et al. US Document ID US 10019668 B1 hereinafter Woo. Further in view of Sekiyama et al. “Profile-guided memory optimization for deep neural networks” hereinafter Sekiyama. 

Regarding Claim 26
	Woo teaches, An apparatus to manage memory resources, the apparatus comprising: memory; and logic circuitry coupled with the memory to determine, for two or more stages of a cascaded neural network (Abstract “The subject matter described in this specification includes systems and methods… determining a partitioning of the neural network layers into a sequence of superlayers. Each superlayer is a partition of the directed graph that includes one or more layers… The method includes processing the batch of inputs using the hardware circuit [logic circuit for determining]” Examiner notes that a partitioning a neural network into a sequence of superlayers corresponds to stages of a cascaded neural network.)
	Woo does not appear to explicitly teach, a count of concurrent, shared memory block allocations; and determine a size for each of the shared memory block allocations of the count of concurrent, shared memory block allocations, to accommodate data to store in a shared memory for the two or more stages of the cascaded neural network for inference computations. 
However Sekiyama, when addressing issues related to optimizing memory block allocation teaches, a count of memory block allocations; and determine a size for each of the shared memory block allocations of the count of concurrent, (Section 3 ¶01 “We formulate the memory allocation problem as a special case of the two dimensional rectangle packing problems that is known as the Dynamic Storage Allocation (DSA)” Section 3.1 ¶01 “We suppose that the number of requested memory blocks, times when memory blocks are requested and released, and sizes of memory blocks are given by the profile” Examiner notes that the count and size of each block is assessed by the profile. Some of the blocks include memory blocks that are overlapping, as addressed in the following response to the “concurrent shared memory,” thus the profiling includes size of blocks allocations that are concurrent in the shared memory block) concurrent, shared memory block allocations; (Section 3.1 ¶02 “We do not need to check for all pairs of memory blocks: it suffices to check those with overlapping lifetimes. To this end, we introduce a notion of possible colliding pairs… which is a set of memory block pairs that have overlapping lifetimes. Note that any two memory blocks not in E do not share the same address space at the same time because their lifetimes do not overlap” Examiner notes that the profile determines which blocks are concurrent or overlapping by including them in the set E based on their lifetimes. Wherein in the count and block sizes are described by the set E.) to accommodate data to store in a shared memory for the two or more stages of the cascaded neural network for inference computations. (Introduction ¶003 “We study memory optimization for DNNs… we can profile the memory usage in a sample run and then utilize the profile to find the allocation of memory [shared/utilized by the DNN] to minimize the peak memory usage in the succeeding runs” Examiner notes that the profile allocates/accommodates data in memory a DNN. When combined with Woo the DNN is a cascaded neural network.)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method/system for profiling memory block allocation and concurrency as taught by Sekiyama to the disclosed invention of Woo.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a method is capable of running DNNs that require huge amounts of memory on devices with limited, hard to extend memory (Abstract Sekiyama)

Regarding Claim 27
	Woo/Sekiyama teaches Claim 26
	Further Woo teaches, wherein the logic circuitry is configured to generate a list of the block allocations and size information for each of the block allocations. (Col 15 line 5-9 “referencing FIG. 5, circuit 100 can determine that: i) a set of parameters for layer A requires 25 MB of memory; ii) a set of parameters for layer B requires 125 MB of memory; and iii) a set of parameters for layer C requires 50 MB of memory.” Examiner notes that the determination for each of the 3 layers corresponds to a list of block allocations with their corresponding size information as denoted by the required memory)
	

Regarding Claim 28
	Woo/Sekiyama teaches Claim 26
	Further Woo teaches, wherein the logic circuitry is configured to determine a batch size per stage of the two or more stages of the cascaded neural network. ( “maximum batch size that can be supported by on-chip memory resources can be determined based on a size of a working set. In particular, the maximum batch size supported by storage units 204 can be determined based, in part, on the largest working set of inputs and parameters that are processed by a given neural network layer.” Examiner notes that the maximum batch size is determined for the set of layers that represent a stage or superlayer. This determination is made on each superlayer or stage in order to efficiently utilize available memory)

Regarding Claim 32
	Woo/Sekiyama teaches Claim 26
	Further Woo teaches, wherein the logic circuitry is configured to determine a count of shared memory block allocations for tensor data to store in the shared memory and size information for the tensor data. (Col 6 line 7-10 “Circuit 100 can be an example compute unit or compute tile and can include additional hardware structures to perform computations associated with multi-dimensional data structures such as tensors, matrices and/or data arrays” Examiner notes that the compute units associated with the storage units and blocks addressed in the prior rejections are associated with tensor data.)

Regarding Claim 33
	Woo/Sekiyama teaches Claim 26
	Further Woo teaches, wherein the logic circuitry is configured to compare counts of …memory block allocations of the cascaded neural network to determine a maximum count of the counts of the shared memory block allocations. (Col 9 line 60-62 “a total on-chip storage capacity associated with memory 102 and 104 may be limited to 20 storage units [maximum count]” Col 9-10 line 63—3 “because a working set of two batch elements processed by layer B requires 16 storage units 204, processing of a third batch element would require 24 units of storage unit 204 and, thus, exceed the 20 storage unit capacity [fixed amount of shared memory]. So, in this example, a neural network may only support a particular maximum working set size that includes two batch elements, when processing each batch element requires at least 8 units of storage.” Examiner notes that in the particular example when the counts of various batch sizes are compared the maximum count or maximum working set size is 8 units of storage per batch, wherein the batch number is restricted to 2 so that the total maximum counts is 16.)

Regarding Claim 34
	Woo/Sekiyama teaches Claim 26
	Further Woo teaches, wherein the logic circuitry is configured to calculate a batch size based on a ratio of an input data size of a first stage of the two or more stages to an input data size of another stage of the two or more stages. (“Circuit 100 can determine total on-chip memory usage based on an equation 1 [Total usage=(working set*N)+parameters] where a variable N of equation 1 is a batch size” Examiner notes that the “total usage” is representative of the data size, including input data, of a plurality of layers. The “working set” corresponds to the input data size of a first stage. The equation can be reformulated as                         
                            
                                
                                    T
                                    o
                                    t
                                    a
                                    l
                                    U
                                    s
                                    a
                                    g
                                    e
                                
                                
                                    w
                                    o
                                    r
                                    k
                                    i
                                    n
                                    g
                                    S
                                    e
                                    t
                                
                            
                            =
                            B
                            a
                            t
                            c
                            h
                             
                            s
                            i
                            z
                            e
                        
                    , Thus the batch size is based on the ratio.)

Regarding Claim 35
	Woo teaches, A method to manage memory resources, the method comprising: determining, by a memory management logic circuitry, for two or more stages of a cascaded neural network, (Abstract “The subject matter described in this specification includes systems and methods… determining a partitioning of the neural network layers into a sequence of superlayers. Each superlayer is a partition of the directed graph that includes one or more layers… The method includes processing the batch of inputs using the hardware circuit [logic circuit for determining]” Examiner notes that a partitioning a neural network into a sequence of superlayers corresponds to stages of a cascaded neural network.)
	Woo does not appear to explicitly teach, a count of concurrent, shared memory block allocations; and determining, by the memory management logic circuitry, a size for each of the shared memory block allocations of the count of concurrent, shared memory block allocations, to accommodate data to store in a shared memory for the two or more stages of the cascaded neural network for inference computations.
However Sekiyama, when addressing issues related to optimizing memory block allocation teaches, a count of memory block allocations; and determining, by the memory management logic circuitry, a size for each of the shared memory block allocations of the count of concurrent, shared memory block allocations, (Section 3 ¶01 “We formulate the memory allocation problem as a special case of the two dimensional rectangle packing problems that is known as the Dynamic Storage Allocation (DSA)” Section 3.1 ¶01 “We suppose that the number of requested memory blocks, times when memory blocks are requested and released, and sizes of memory blocks are given by the profile” Examiner notes that the count and size of each block is assessed by the profile. Some of the blocks include memory blocks that are overlapping, as addressed in the following response to the “concurrent shared memory,” thus the profiling includes size of blocks allocations that are concurrent in the shared memory block) concurrent, shared memory block allocations; (Section 3.1 ¶02 “We do not need to check for all pairs of memory blocks: it suffices to check those with overlapping lifetimes. To this end, we introduce a notion of possible colliding pairs… which is a set of memory block pairs that have overlapping lifetimes. Note that any two memory blocks not in E do not share the same address space at the same time because their lifetimes do not overlap” Examiner notes that the profile determines which blocks are concurrent or overlapping by including them in the set E based on their lifetimes. Wherein in the count and block sizes are described by the set E.) to accommodate data to store in a shared memory for the two or more stages of the cascaded neural network for inference computations. (Introduction ¶003 “We study memory optimization for DNNs… we can profile the memory usage in a sample run and then utilize the profile to find the allocation of memory [shared/utilized by the DNN] to minimize the peak memory usage in the succeeding runs” Examiner notes that the profile allocates/accommodates data in memory a DNN. When combined with Woo the DNN is a cascaded neural network.)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method/system for profiling memory block allocation and concurrency as taught by Sekiyama to the disclosed invention of Woo.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a method is capable of running DNNs that require huge amounts of memory on devices with limited, hard to extend memory (Abstract Sekiyama)

Regarding Claim 36
	Woo/Sekiyama teaches Claim 35
	Further Woo teaches, generating, by the memory management logic circuitry, a list of the block allocations and size information for each of the block allocations. (Col 15 line 5-9 “referencing FIG. 5, circuit 100 can determine that: i) a set of parameters for layer A requires 25 MB of memory; ii) a set of parameters for layer B requires 125 MB of memory; and iii) a set of parameters for layer C requires 50 MB of memory.” Examiner notes that the determination for each of the 3 layers corresponds to a list of block allocations with their corresponding size information as denoted by the required memory)

Regarding Claim 37
	Woo/Sekiyama teaches Claim 35
	Further Woo teaches, determining a batch size per stage of the two or more stages of the cascaded neural network. (“maximum batch size that can be supported by on-chip memory resources can be determined based on a size of a working set. In particular, the maximum batch size supported by storage units 204 can be determined based, in part, on the largest working set of inputs and parameters that are processed by a given neural network layer.” Examiner notes that the maximum batch size is determined for the set of layers that represent a stage or superlayer. This determination is made on each superlayer or stage in order to efficiently utilize available memory)


Regarding Claim 41
	Woo/Sekiyama teaches Claim 35
	Further Woo teaches, wherein determining a count of the concurrent, shared memory block allocations comprises determining the count of shared memory block allocations for tensor data to store in the shared memory and size information for the tensor data.  (Col 6 line 7-10 “Circuit 100 can be an example compute unit or compute tile and can include additional hardware structures to perform computations associated with multi-dimensional data structures such as tensors, matrices and/or data arrays” Examiner notes that the compute units associated with the storage units and memory blocks addressed in the prior rejections are associated with tensor data.)

Regarding Claim 42
	Woo/Sekiyama teaches Claim 41
	Further Woo teaches, wherein determining a size for each of the shared memory block allocations comprises comparing counts of the concurrent, shared memory block allocations of the cascaded neural network to determine a maximum count of the counts of the concurrent, shared memory block allocations. (Col 9 line 60-62 “a total on-chip storage capacity associated with memory 102 and 104 may be limited to 20 storage units [maximum count]” Col 9-10 line 63—3 “because a working set of two batch elements processed by layer B requires 16 storage units 204, processing of a third batch element would require 24 units of storage unit 204 and, thus, exceed the 20 storage unit capacity [fixed amount of shared memory]. So, in this example, a neural network may only support a particular maximum working set size that includes two batch elements, when processing each batch element requires at least 8 units of storage.” Examiner notes that in the particular example when the counts of various batch sizes are compared the maximum count or maximum working set size is 8 units of storage per batch, wherein the batch number is determined to be at most 2 and the maximum count is determined to be 16)

Regarding Claim 43
	Woo/Sekiyama teaches Claim 42
	Further Woo teaches, wherein the logic circuitry is configured to calculate a batch size based on a ratio of an input data size of a first stage of the two or more stages to an input data size of another stage of the two or more stages. (“Circuit 100 can determine total on-chip memory usage based on an equation 1 [Total usage=(working set*N)+parameters] where a variable N of equation 1 is a batch size” Examiner notes that the “total usage” is representative of the data size, including input data, of a plurality of layers. The “working set” corresponds to the input data size of a first stage. The equation can be reformulated as                         
                            
                                
                                    T
                                    o
                                    t
                                    a
                                    l
                                    U
                                    s
                                    a
                                    g
                                    e
                                
                                
                                    w
                                    o
                                    r
                                    k
                                    i
                                    n
                                    g
                                    S
                                    e
                                    t
                                
                            
                            =
                            B
                            a
                            t
                            c
                            h
                             
                            s
                            i
                            z
                            e
                        
                    , Thus the batch size is based on the ratio.)


Regarding Claim 44
Woo teaches, A system to manage memory resources, the system comprising: a communications interface; and a processing component (Abstract “The subject matter described in this specification includes systems and methods… determining a partitioning of the neural network layers into a sequence of superlayers. Each superlayer is a partition of the directed graph that includes one or more layers… The method includes processing the batch of inputs using the hardware circuit [communications interface and a processing component]” Examiner notes that a partitioning a neural network into a sequence of superlayers corresponds to stages of a cascaded neural network. Examiner notes that the hardware circuit that processes the batches and communicates the results to the rest of the system corresponds to the claimed hardware for managing memory resources.) wherein the processing component comprises dynamic random-access memory. (Col 8-9 line 65-1 “This threshold storage capacity may be less than, or substantially less than, a storage capacity of a dynamic random access memory (DRAM) resource that is associated with off-chip memory of circuit 100.” Examiner notes that the DRAM associated with off-chip memory of circuit 100, wherein the circuit is part of the processing component.)
	Woo does not appear to explicitly teach, to determine a count and a size of concurrent, shared memory block allocations in multiple stages of a cascaded neural network, to accommodate data to store in a shared memory for the multiple stages of the cascaded neural network for performance of inference computations;
However Sekiyama, when addressing issues related to optimizing memory block allocation teaches, to determine a count and a size of memory block allocations in multiple stages of a cascaded neural network,, (Section 3 ¶01 “We formulate the memory allocation problem as a special case of the two dimensional rectangle packing problems that is known as the Dynamic Storage Allocation (DSA)” Section 3.1 ¶01 “We suppose that the number of requested memory blocks, times when memory blocks are requested and released, and sizes of memory blocks are given by the profile” Examiner notes that the count and size of each block is assessed by the profile. Some of the blocks include memory blocks that are overlapping, as addressed in the following response to the “concurrent shared memory,” thus the profiling includes size of blocks allocations that are concurrent in the shared memory block) concurrent, shared memory block allocations; (Section 3.1 ¶02 “We do not need to check for all pairs of memory blocks: it suffices to check those with overlapping lifetimes. To this end, we introduce a notion of possible colliding pairs… which is a set of memory block pairs that have overlapping lifetimes. Note that any two memory blocks not in E do not share the same address space at the same time because their lifetimes do not overlap” Examiner notes that the profile determines which blocks are concurrent or overlapping by including them in the set E based on their lifetimes. Wherein in the count and block sizes are described by the set E.) to accommodate data to store in a shared memory for the multiple stages of the cascaded neural network for performance of inference computations; (Introduction ¶003 “We study memory optimization for DNNs… we can profile the memory usage in a sample run and then utilize the profile to find the allocation of memory [shared/utilized by the DNN] to minimize the peak memory usage in the succeeding runs” Examiner notes that the profile allocates/accommodates data in memory a DNN. When combined with Woo the DNN is a cascaded neural network.)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method/system for profiling memory block allocation and concurrency as taught by Sekiyama to the disclosed invention of Woo.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a method is capable of running DNNs that require huge amounts of memory on devices with limited, hard to extend memory (Abstract Sekiyama)

Regarding Claim 45
	Woo/Sekiyama teaches Claim 44
	Further Woo teaches, wherein the processing component is configured to generate a list of the block allocations and size information for each of the block allocations. (Col 15 line 5-9 “referencing FIG. 5, circuit 100 can determine that: i) a set of parameters for layer A requires 25 MB of memory; ii) a set of parameters for layer B requires 125 MB of memory; and iii) a set of parameters for layer C requires 50 MB of memory.” Examiner notes that the determination for each of the 3 layers corresponds to a list of block allocations with their corresponding size information as denoted by the required memory)
	

Regarding Claim 46
	Woo/Sekiyama teaches Claim 44
	Further Woo teaches, wherein the processing component is configured to determine a batch size per stage of the multiple stages of the cascaded neural network. (“maximum batch size that can be supported by on-chip memory resources can be determined based on a size of a working set. In particular, the maximum batch size supported by storage units 204 can be determined based, in part, on the largest working set of inputs and parameters that are processed by a given neural network layer.” Examiner notes that the maximum batch size is determined for the set of layers that represent a stage or superlayer. This determination is made on each superlayer or stage in order to efficiently utilize available memory)


Regarding Claim 47
Woo teaches, A non-transitory machine-readable medium containing instructions, (Col 15-16 line 67-4 “one or more modules of computer program instructions encoded on a tangible non transitory program carrier for execution by, or to control the operation of, data processing apparatus”) which when executed by a processor, cause the processor to perform operations, the operations comprising: determining, by a memory management logic circuitry, for two or more stages of a cascaded neural network, (Abstract “The subject matter described in this specification includes systems and methods… determining a partitioning of the neural network layers into a sequence of superlayers. Each superlayer is a partition of the directed graph that includes one or more layers… The method includes processing the batch of inputs using the hardware circuit [processor]” Examiner notes that a partitioning a neural network into a sequence of superlayers corresponds to stages of a cascaded neural network.)
	Woo does not appear to explicitly teach, a count of concurrent, shared memory block allocations; and determining, by the memory management logic circuitry, a size for each of the shared memory block allocations of the count of concurrent, shared memory block allocations, to accommodate data to store in a shared memory for the two or more stages of the cascaded neural network for inference computations.
However Sekiyama, when addressing issues related to optimizing memory block allocation teaches, a count of memory block allocations; and determining, by the memory management logic circuitry, a size for each of the shared memory block allocations of the count of memory block allocations, (Section 3 ¶01 “We formulate the memory allocation problem as a special case of the two dimensional rectangle packing problems that is known as the Dynamic Storage Allocation (DSA)” Section 3.1 ¶01 “We suppose that the number of requested memory blocks, times when memory blocks are requested and released, and sizes of memory blocks are given by the profile” Examiner notes that the count and size of each block is assessed by the profile. Some of the blocks include memory blocks that are overlapping, as addressed in the following response to the “concurrent shared memory,” thus the profiling includes size of blocks allocations that are concurrent in the shared memory block) concurrent, shared memory block allocations; (Section 3.1 ¶02 “We do not need to check for all pairs of memory blocks: it suffices to check those with overlapping lifetimes. To this end, we introduce a notion of possible colliding pairs… which is a set of memory block pairs that have overlapping lifetimes. Note that any two memory blocks not in E do not share the same address space at the same time because their lifetimes do not overlap” Examiner notes that the profile determines which blocks are concurrent or overlapping by including them in the set E based on their lifetimes. Wherein in the count and block sizes are described by the set E.) to accommodate data to store in a shared memory for the two or more stages of the cascaded neural network for inference computations. (Introduction ¶003 “We study memory optimization for DNNs… we can profile the memory usage in a sample run and then utilize the profile to find the allocation of memory [shared/utilized by the DNN] to minimize the peak memory usage in the succeeding runs” Examiner notes that the profile allocates/accommodates data in memory a DNN. When combined with Woo the DNN is a cascaded neural network.)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to incorporate a method/system for profiling memory block allocation and concurrency as taught by Sekiyama to the disclosed invention of Woo.
One of ordinary skill in the arts would have been motivated to make this modification in order to implement a method is capable of running DNNs that require huge amounts of memory on devices with limited, hard to extend memory (Abstract Sekiyama)

Regarding Claim 48
	Woo/Sekiyama teaches Claim 47
	Further Woo teaches, herein the operations further comprise generating, by the memory management logic circuitry, a list of the block allocations and size information for each of the block allocations. (Col 15 line 5-9 “referencing FIG. 5, circuit 100 can determine that: i) a set of parameters for layer A requires 25 MB of memory; ii) a set of parameters for layer B requires 125 MB of memory; and iii) a set of parameters for layer C requires 50 MB of memory.” Examiner notes that the determination for each of the 3 layers corresponds to a list of block allocations with their corresponding size information as denoted by the required memory)
	

Regarding Claim 49
	Woo/Sekiyama teaches Claim 47
	Further Woo teaches, wherein the operations further comprise determining a batch size per stage of the two or more stages of the cascaded neural network. (“maximum batch size that can be supported by on-chip memory resources can be determined based on a size of a working set. In particular, the maximum batch size supported by storage units 204 can be determined based, in part, on the largest working set of inputs and parameters that are processed by a given neural network layer.” Examiner notes that the maximum batch size is determined for the set of layers that represent a stage or superlayer. This determination is made on each superlayer or stage in order to efficiently utilize available memory)


Claims 29-31, 38-40, and 50 are rejected under 35 U.S.C. 103 as being unpatentable over Woo/Sekiyama. Further in view of Weber et al. “BrainSlug: Transparent Acceleration of Deep Learning Through Depth-First Parallelism” hereinafter Weber. 

Regarding Claim 29
	Woo/Sekiyama teaches Claim 26
	Woo/Sekiyama does not explicitly teach, wherein the logic circuitry is configured to determine a count of inputs for a stage of the two or more stages that lack interdependencies.
	Weber however, when addressing issues related to searching a neural network for parallelism opportunities based on compute dependencies teaches, wherein the logic circuitry is configured to determine a count of inputs for a stage of the two or more stages that lack interdependencies. (Figure 5 and Section 3.1 ¶03 “The computation graph is the same but the operations are grouped according to the independent computation paths involving the normalization and non-linear operations, which are merged in the pooling layer A stack, therefore, partitions independent computation paths in the DAG into paralellizable code blocks such that each such block’s intermediate data fits into the device caches.” Examiner notes that the computations that lack interdependencies are pushed to a stack the size of the stack is proportional to the count of the inputs for a stage, some of the layers, whose computations are independent of each other.)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to a method/system for determining a number of computations including inputs of a directed graph that can be parallelizable as a result of their lack of interdependency as taught by Weber to the disclosed invention of Woo/Sekiyama.
One of ordinary skill in the arts would have been motivated to make this modification in order to “transparently accelerate neural network workloads by changing the default layer-by-layer processing to a depth-first approach, reducing the amount of data required by the computations and thus improving the performance of the available hardware caches” (Abstract Weber)

Regarding Claim 30
	Woo/Sekiyama/Weber teaches Claim 29
	Further Woo teaches, wherein the logic circuitry is configured to determine the batch size based on a fixed amount of shared memory available for execution of the two or more stages of the cascaded neural network. (Col 14-15 line 65-2 “a storage capacity, or threshold capacity, of on-chip memory may be 500 megabyte (MB). Circuit 100 can determine total on-chip memory usage based on an equation 1 [Total usage=(working set*N)+parameters] where a variable N of equation 1 is a batch size” Examiner notes the fixed amount of memory available for execution corresponds to the 500 MB available on chip memory, that is shared by each layer in the superlayer that will utilize the memory for computation. The batch size N is a function of the Total Usage which is restricted by the available memory on chip. Thus batch size is based on the fixed amount of memory.) 

Regarding Claim 31
	Woo/Sekiyama/Weber teaches Claim 30
	Further Woo teaches, wherein the logic circuitry is configured to determine the batch size based on the fixed amount of shared memory available for execution of the two or more stages of the cascaded neural network (Examiner notes this was addressed in the rejection of claim 30) based on the maximum count (Col 9 line 60-62 “a total on-chip storage capacity associated with memory 102 and 104 may be limited to 20 storage units [maximum count]”) based on the size determined for each shared memory block allocations, (Col 9-10 line 63—3 “because a working set of two batch elements processed by layer B requires 16 storage units 204, processing of a third batch element would require 24 units of storage unit 204 and, thus, exceed the 20 storage unit capacity [fixed amount of shared memory]. So, in this example, a neural network may only support a particular maximum working set size that includes two batch elements, when processing each batch element requires at least 8 units of storage.” Examiner notes that because the capacity is restricted to 20 storage units, a batch size of two cannot be determined, instead a batch size of 1 is determined.) and based on the count of inputs. (Examiner notes that because the batch size is related to the count of inputs, the determining is based on the count of inputs as well)

Regarding Claim 38
	Woo/Sekiyama teaches Claim 26
	Woo/Sekiyama teaches, wherein determining a batch size per stage. (“maximum batch size that can be supported by on-chip memory resources can be determined based on a size of a working set. In particular, the maximum batch size supported by storage units 204 can be determined based, in part, on the largest working set of inputs and parameters that are processed by a given neural network layer.” Examiner notes that a maximum batch size is determined for the set of layers that represent a stage or superlayer.
	Woo/Sekiyama does not explicitly teach, comprises determining a count of inputs for a stage of the two or more stages that lack interdependencies.
	Weber however, when addressing issues related to searching a neural network for parallelism opportunities based on compute dependencies teaches, comprises determining a count of inputs for a stage of the two or more stages that lack interdependencies. (Figure 5 and Section 3.1 ¶03 “The computation graph is the same but the operations are grouped according to the independent computation paths involving the normalization and non-linear operations, which are merged in the pooling layer A stack, therefore, partitions independent computation paths in the DAG into paralellizable code blocks such that each such block’s intermediate data fits into the device caches.” Examiner notes that the computations that lack interdependencies are pushed to a stack the size of the stack is proportional to the count of the inputs for a stage, some of the layers, whose computations are independent of each other.)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to a method/system for determining a number of computations including inputs of a directed graph that can be parallelizable as a result of their lack of interdependency as taught by Weber to the disclosed invention of Woo/Sekiyama.
One of ordinary skill in the arts would have been motivated to make this modification in order to “transparently accelerate neural network workloads by changing the default layer-by-layer processing to a depth-first approach, reducing the amount of data required by the computations and thus improving the performance of the available hardware caches” (Abstract Weber)

Regarding Claim 39
	Woo/Sekiyama/Weber teaches Claim 29
	Further Woo teaches, wherein determining a batch size per stage comprises determining the batch size based on a fixed amount of shared memory available for execution of the two or more stages of the cascaded neural network. (Col 14-15 line 65-2 “a storage capacity, or threshold capacity, of on-chip memory may be 500 megabyte (MB). Circuit 100 can determine total on-chip memory usage based on an equation 1 [Total usage=(working set*N)+parameters] where a variable N of equation 1 is a batch size” Examiner notes the fixed amount of memory available for execution corresponds to the 500 MB available on chip memory, that is shared by each layer in the superlayer that will utilize the memory for computation. The batch size N is a function of the Total Usage which is restricted by the available memory on chip. Thus batch size is based on the fixed amount of memory.) 

Regarding Claim 40
	Woo/Sekiyama/Weber teaches Claim 30
	Further Woo teaches, wherein determining the batch size based on a fixed amount of shared memory available for execution of the two or more stages of the cascaded neural network comprises determining the batch size (Examiner notes this was addressed in the rejection of claim 39) based on the maximum count (Col 9 line 60-62 “a total on-chip storage capacity associated with memory 102 and 104 may be limited to 20 storage units [maximum count]”) the size determined for each shared memory block allocations (Col 9-10 line 63—3 “because a working set of two batch elements processed by layer B requires 16 storage units 204, processing of a third batch element would require 24 units of storage unit 204 and, thus, exceed the 20 storage unit capacity [fixed amount of shared memory]. So, in this example, a neural network may only support a particular maximum working set size that includes two batch elements, when processing each batch element requires at least 8 units of storage.” Examiner notes that because the capacity is restricted to 20 storage units, a batch size of two cannot be determined, instead a batch size of 1 is determined.) and the count of inputs (Examiner notes that because the batch size is related to the count of inputs, the determining is based on the count of inputs as well)

Regarding Claim 50
	Woo/Sekiyama teaches Claim 47
	Woo/Sekiyama teaches, wherein determining a batch size per stage. (“maximum batch size that can be supported by on-chip memory resources can be determined based on a size of a working set. In particular, the maximum batch size supported by storage units 204 can be determined based, in part, on the largest working set of inputs and parameters that are processed by a given neural network layer.” Examiner notes that a maximum batch size is determined for the set of layers that represent a stage or superlayer.
	Woo/Sekiyama does not explicitly teach, comprises determining a count of inputs for a stage of the two or more stages that lack interdependencies.
	Weber however, when addressing issues related to searching a neural network for parallelism opportunities based on compute dependencies teaches, comprises determining a count of inputs for a stage of the two or more stages that lack interdependencies. (Figure 5 and Section 3.1 ¶03 “The computation graph is the same but the operations are grouped according to the independent computation paths involving the normalization and non-linear operations, which are merged in the pooling layer A stack, therefore, partitions independent computation paths in the DAG into paralellizable code blocks such that each such block’s intermediate data fits into the device caches.” Examiner notes that the computations that lack interdependencies are pushed to a stack the size of the stack is proportional to the count of the inputs for a stage, some of the layers, whose computations are independent of each other.)
It would have been obvious for one of ordinary skill in the arts before the effective filling date of the claimed invention to a method/system for determining a number of computations including inputs of a directed graph that can be parallelizable as a result of their lack of interdependency as taught by Weber to the disclosed invention of Woo/Sekiyama.
One of ordinary skill in the arts would have been motivated to make this modification in order to “transparently accelerate neural network workloads by changing the default layer-by-layer processing to a depth-first approach, reducing the amount of data required by the computations and thus improving the performance of the available hardware caches” (Abstract Weber)


Conclusion
Prior art
Zhang et al. “Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks” Used loop tiling and transformation to identify optimal configurations for best performance under a restricted bandwidth, 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JOHNATHAN R GERMICK whose telephone number is (571)272-8363. The examiner can normally be reached on Monday-Friday 7:30 am – 4:00 pm (EST).
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki, can be reached at telephone number 5712723719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://portal.uspto.gov/external/portal. Should you have questions about access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
	
/J.R.G./Examiner, Art Unit 2122                                                                                                                                                                                                        
/KAKALI CHAKI/Supervisory Patent Examiner, Art Unit 2122