DETAILED ACTION: 
1.	This action is in response to preliminary amendments filed 9 January 2020 for application 16/738038, filed on 9 January 2020. Currently, claims 1-17, and 22-24 are pending.  Claims 18-21 and 25 have been canceled. All references in the IDS have been considered. It is noted that certified copies of the Korean applications (KR10-2019-0007583 and KR10-2019-0088529) to which the instant application claims priority was filed on 20 February 2020; however, an English translation of each certified copy is not currently on file as required. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are:
At least one resource in Claim 24
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform 

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-10, 12-17, and 22-24 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Yufei Ma (“Hardware Acceleration of Deep Convolutional Neural Networks on FPGA”, PhD Dissertation, Arizona State University, December 2018, pp. 1-153), hereinafter referred to as Ma.

A neural network system for processing a neural network model comprising an operation processing graph that comprises a plurality of operations, the neural network system comprising: an operation processor comprising an internal memory storing a first module input feature map, ([p. 43, Section 3.5.1, Figure 1.1, Figure 2.5, Figure 3.10, Figure 4.2],The coarse-grained dataflow is shown in Figure 3.11 at feature map row level for stride = 1 and stride = 2. The data flow in Figure 3.11(a) is the same as Figure 3.10, where more clock cycles of operation is shown after cycle 8. In Figure 3.11(b), the dataflow with stride = 2 and zero padding = 3 is shown, which follows the same pattern as the case with stride = 1. The buffer storage pattern is adjusted according to different stride and padding settings. Three rows of zeros are added to the buffer due to the north zero padding of 3. With stride = 2, every two rows of pixels are continuously distributed across Poy buffer banks. … Since the data movement within a register array or a feature map row is different for different settings of stride and zero padding, various BUF2PE data buses are needed for each dataflow, and the set of data buses are called data router., wherein a framework (operation processor) for an FPGA-based implementation of a neural network (e.g., CNN) represents its functional/processing operations using a DAG (an operation processing graph) including the processing of feature maps from one layer to the next in which the feature map output from a previous layer is stored in an internal memory (on-chip buffers/BRAM) for routing/processing on processing elements (MAC units with the operation processor) and wherein the general representation of data flow (e.g., Figure 4.2) is also an operation processing graph.) wherein the operation processor is configured to: obtain a first branch output feature map by performing a first operation among the plurality of operations, based on the stored first module input feature map; ([p. 61, Section 4.3.1, Figure 1.1, Figure 2.5, Figure 3.10, Figure In conventional CNN algorithms, different layers are connected in sequence, which allows for a straightforward layer-by-layer serial computation. The recent CNN algorithms (e.g. ResNet He et al. (2016a)) are DAGs, with combinations of serial and parallel branches. A reconfigurable layer-by-layer execution schedule is designed to handle the different combinations of stacked layers and the DAG as shown in Figure 4.2. Therefore, the present mapping of a DAG onto an FPGA still results in a serial computation of the layers., wherein the data flow/processing flow in the neural network implementation framework spans a set of branches over different layers (serial or parallel) in the neural network topology (e.g., Figure 1.1) such that the feature map outputs from any particular preceding layer/processing node are passed through various memory elements but wherein the branches include, in a more general sense, the processing over successive sub-components within any given layer (batch norm, Eltwise, pooling, etc. – Figures 3.13, 4.2).)  and obtain a second branch output feature map by performing a second operation among the plurality of operations after the first operation is performed, based on the stored first module input feature map, ([p. 44, Section 3.5.1, Figure 1.1, Figure 2.5, Figure 3.10, Figure 3.11, Figure 4.2], After Nkx × Nky cycles, we complete one kernel window sliding (Loop-1) and move to the next input feature map with the same dataflow until the last one as shown in Figure 3.11. After Nkx × Nky × Nif cycles, both Loop-1 and Loop-2 are completed and we obtain P ox × P oy × P of final output pixels., wherein FPGA-based processing elements generate output at each computational step in the neural network processing (including the output of any layer or intermediate outputs in the processing within a given layer) according to the storage of the feature map elements in the on-chip/BRAM/internal memory buffer (e.g., Figures 3.10 and 3.11) such as received as the output from a previous layer (or previous processing component) and wherein this process thereby generates an output feature and wherein the internal memory maintains storage of the first module input feature map while the first operation is performed.  ([Figure 1.1, Figure 2.5, Figure 3.10, Figure 3.11, Figure 4.2],
wherein, as previously noted, the elements of the feature maps are stored into on-chip memory (internal memory/BRAM) for FPGA processing of those elements.)

In regards to claim 2, the rejection of claim 1 is incorporated and Ma further teaches further comprising a post processor configured to partition the operation processing graph into a plurality of modules, wherein each of the plurality of modules comprises a plurality of branches configured to receive one input feature map via one node.  
([p. 58, Section 4.1, p. 61, Section 4.3.1, Figure 1.1, Figure 2.5, Figure 3.10, Figure 3.11, Figure 4.2], The various dimensional parameters of the CNN algorithm and the accelerator design variables, e.g. loop unrolling and tiling sizes as shown in Figure 4.1 (described in detail in Section 4.2), can be tuned by the user to balance the performance and required hardware resources. Then, a layer-by-layer execution schedule (see Figure 4.2(a) and Figure 4.2(b)) is generated from the CNN graph representation. The execution schedule is translated into the global control logic on the FPGA, and it also determines the order of the reads and writes of certain kernel weights or pixels from different layers that are stored in external memory. The associated read and write addresses are generated and sorted to control the transactions between external and on-chip memories., A reconfigurable layer-by-layer execution schedule is designed to handle the different combinations of stacked layers and the DAG as shown in Figure 4.2. Therefore, the present mapping of a DAG onto an FPGA still results in a serial computation of the layers., wherein a (RTL) compiler (a component of the post processor) maps the DAG 

In regards to claim 3, the rejection of claim 2 is incorporated and Ma further teaches wherein the plurality of branches comprise a first branch comprising the first operation, and a second branch comprising the second operation, and wherein the operation processor is further configured to: obtain the first branch output feature map by performing the first operation comprised in the first branch, based on the stored first module input feature map; and obtain the second branch output feature map by performing the second operation comprised in the second branch, based on the stored first module input feature map.   ([p. 44, Section 3.5.1, Figure 1.1, Figure 2.5, Figure 3.10, Figure 3.11, Figure 4.2], After Nkx × Nky cycles, we complete one kernel window sliding (Loop-1) and move to the next input feature map with the same dataflow until the last one as shown in Figure 3.11. After Nkx × Nky × Nif cycles, both Loop-1 and Loop-2 are completed and we obtain P ox × P oy × P of final output pixels., wherein, FPGA-based processing elements generate output at each computational step in the neural network processing (including the output of any layer or intermediate outputs in the processing within a given layer) according to the storage of the feature map elements in the on-chip/BRAM buffer (e.g., Figures 3.10 and 3.11) such as received as the output from a previous layer (or previous processing component) and wherein this process thereby generates an output feature map that may undergo further processing such as in a 

In regards to claim 4, the rejection of claim 2 is incorporated and Ma further teaches wherein the post processor is further configured to obtain a processing order of the plurality of branches, based on memory areas for processing of each of the plurality of branches.  ([p. 58, Section 4.1, p. 61, Section 4.3.1, p. 63, Section 4.3.1, Figure 1.1, Figure 2.5, Figure 3.10, Figure 3.11, Figure 4.2], The various dimensional parameters of the CNN algorithm and the accelerator design variables, e.g. loop unrolling and tiling sizes as shown in Figure 4.1 (described in detail in Section 4.2), can be tuned by the user to balance the performance and required hardware resources. Then, a layer-by-layer execution schedule (see Figure 4.2(a) and Figure 4.2(b)) is generated from the CNN graph representation. The execution schedule is translated into the global control logic on the FPGA, and it also determines the order of the reads and writes of certain kernel weights or pixels from different layers that are stored in external memory. The associated read and write addresses are generated and sorted to control the transactions between external and on-chip memories., A reconfigurable layer-by-layer execution schedule is designed to handle the different combinations of stacked layers and the DAG as shown in Figure 4.2. Therefore, the present mapping of a DAG onto an FPGA still results in a serial computation of the layers., The example DAG shown in Figure 1.1 has six clusters, numbered 1 through 6 . The Conv1( 1 ), Pooling( 2 ) and FC( 6 ) layers in Figure 1.1 are individual key-layers (i.e. clusters with only a 62 key-layer) whereas cluster 5 has one key-layer (Conv4) and three affiliated-layers (Batchnorm, Eltwise and ReLu)…. The order of computation of the clusters is set before compilation, and the only rule is to ensure that all the predecessors of any key-layer is executed prior to that key-layer., wherein a (RTL) compiler (a component of the post processor) maps the DAG (CNN/ResNET graph representations) into a set of (FPGA) execution modules for implementing the data and execution flow across the various components/branches of the DAG (layer-to-layer processing, sub-component processing such as ReLU, Pool, etc.)  and wherein the compiler also determines/optimizes/obtains an order for that data flow) and corresponding processing steps (e.g., loop unrolling) and associated memory areas allocated for each corresponding process but wherein the compiler also (and alternatively) obtains the ordering of layer clusters with, for example, the topology shown in Figure 1.1.)

In regards to claim 5, the rejection of claim 4 is incorporated and Ma further teaches wherein the plurality of branches comprises a first branch that is processed using a first memory area, and a second branch that is processing using a second memory area larger than the first memory area, and wherein the post processor is further configured to obtain the processing order of the plurality of branches such that the second branch is processed earlier than the first branch.  ([pp. 27-28, Section 3.2.2, Figure 3.6, Figure 3.11, Figure 4.4],A partial sum (psum) is the intermediate result of the inner product operation that needs to be accumulated over several cycles to obtain one final output data. 27 Therefore, partial sums need to be stored in memory for the next few cycles and sometimes have to be moved between PEs……. If the loop tile cannot include all data for Loop-1 and Loop-2, partial sums from one tile need to be stored in on-chip or off-chip memory until it is consumed by another tile as in (9.6) – (9.9) inside Figure 3.6. In this case, the partial sums need to be stored in on-chip buffers ((9.6) inside Figure 3.6) or even in external memory ((9.7) inside Figure 3.6)., wherein, the neural network processing framework stores/accumulates partial sums in the on-chip memory (a processing memory area) while executing the data/processing flow of the neural network such that the storage of the partial sums decreases the unused internal memory (i.e., fixed internal memory is consumed) from one execution process/branch (a second branch) relative to a subsequent execution process/branch (first branch), wherein, in a more general sense, different memory allocations (e.g., for weights and feature maps) associated with different layers or deep neural network elements (e.g., Figure 1.1, Figure 4.2) result in differential unused memory space sizes resulting from the mapping of the associated topology onto the FPGA-processing flow (that do not exclude the recited differential memory areas), and wherein the accumulations of data in the output buffers during the processing of the on-chip data over successive cycles is being interpreted as also generally decreasing the available memory area as the output buffers are populated.)

In regards to claim 6, the rejection of claim 2 is incorporated and Ma further comprising an external memory connected to the operation processor through direct memory access (DMA),  ([p. 66, Section 4.4.2], The DMA engine is used to communicate data between DRAM and on-chip BRAMs., wherein a DMA engine manages the interchange of data between external and internal memory.) wherein the plurality of modules comprises a first module and a second module, wherein the operation processing graph comprises a skip connection operation connected between a first operation node comprised in the first module and a second operation node comprised in the second module, and wherein the post processor is further configured to exclude the skip connection operation 19from the operation processing graph such that the skip connection operation is processed using the external memory.   ([p. 43, Section 3.5.1, p. 77, Section 4.5.4, p. 78, Section 4.5.5, Figure 3.11, Figure 4.2, Figure 4.3, Figure 4.4], Therefore, the BUF2PE bus in Figure 3.11(b) can be applied for conv1 in ResNet with kernel size = 7 × 7, stride = 2 and zero padding = 3., The Eltwise layer performs element-wise addition to connect two branches of layers in ResNet CNNs as shown in Figure 1.1…. Eltwise is performed after its previous layer in the same branch has stored all the results into the output buffers. Then, the pixels from the other branch are read from DRAM and written into the input pixel buffers. Subsequently, the pixels from the two branches are element-wise added by the adders and finally stored back into the output pixel buffers, as illustrated in Figure 4.9., If the inputs of one layer is from Concat, the compiler generates DMA descriptors that control DMA to read multiple layers of the Concat from different DRAM addresses as the inputs., wherein the neural network processing framework implements ResNets (neural networks with skip connections – Figure 1.1) such that different layers or cluster layers (different operation nodes) which do not include the skip connections are processed individually/separately with the result stored in the external memory for subsequent use in a concatenation operation over the externally stored data (i.e., once transferred by DMA into on-chip memory) that combines the results from the disparate layers/nodes/cluster layers (concat function as well as the Eltwise function).) 

In regards to claim 7, the rejection of claim 1 is incorporated and Ma further teaches wherein the operation processing graph comprises a plurality of modules, and wherein the operation processor is further configured to: obtain a first module output feature map by performing a third operation comprised in a first module among the plurality of modules, based on the stored first module input feature map; store the obtained first module output feature map in the internal memory; and obtain a second module output feature map by performing a fourth operation comprised in a second module among the plurality of modules, based on the stored first module output feature map.  ([p. 44, Section 3.5.1, p. 78, Section 4.5.5,  Figure 1.1, Figure 3.9, Figure 4.2, Figure 4.4b, Figure 4.9], After Nkx × Nky cycles, we complete one kernel window sliding (Loop-1) and move to the next input feature map with the same dataflow until the last one as shown in Figure 3.11., 
If the inputs of one layer is from Concat, the compiler generates DMA descriptors that control DMA to read multiple layers of the Concat from different DRAM addresses as the inputs., wherein the neural network processing framework processes one input feature map at a time based upon the mapping of existing pixel/feature map data (i.e., including previously output feature maps such as from a previous layer) to the on-chip memory such that the execution of any operation (convolution, pooling, ReLU, etc) resulting from this mapping/transference of feature map inputs is the creation of an output feature map that is stored in either on-chip or external memory (but, if stored in external memory it is transferred to internal memory when required for subsequent processing) and wherein the concatenation/eltwise operations are another example of generating an output feature map based on previous stored input/output feature maps.) 

In regards to claim 8, the rejection of claim 1 is incorporated and Ma further teaches wherein the operation processor is further configured to: obtain a second feature map by performing a third operation among the plurality of operations, based on a first feature map; store the obtained second feature map in the internal memory; and obtain a third feature map by performing a fourth operation among the plurality of operations, based on the stored second feature map stored.  ([p. 44, Section 3.5.1, p. 78, Section 4.5.5,  Figure 1.1, Figure 3.9, Figure 4.2, Figure 4.4b, Figure 4.9], After Nkx × Nky cycles, we complete one kernel window sliding (Loop-1) and move to the next input feature map with the same dataflow until the last one as shown in Figure 3.11., If the inputs of one layer is from Concat, the compiler generates DMA descriptors that control DMA to read multiple layers of the Concat from different DRAM addresses as the inputs., wherein the neural network processing framework processes one input feature map at a time based upon the mapping of existing pixel/feature map data (i.e., including previously output feature maps such as from a previous layer) to the on-chip memory such that the execution of any operation (convolution, pooling, ReLU, etc) resulting from this mapping/transference of feature map inputs is the creation of an output/second feature map that is stored in either on-chip/internal or external memory (but, if stored in external memory it is transferred to internal memory for subsequent processing) and wherein the concatenation/eltwise operations are another example of generating an new/output feature map based on previous stored input/ouput feature maps.) 

In regards to claim 9, the rejection of claim 1 is incorporated and Ma further teaches wherein the operation processing graph comprises a plurality of branches sharing a first input node and a first output node, wherein the plurality of branches comprise a last branch that is processed last among the plurality of branches, and wherein the operation processor is further configured to: obtain a third branch output feature map by performing a third operation comprised in the last branch, based on the stored first module input feature map; and overwrite, with the obtained third branch output feature map, an area of the internal memory, in which the first module input feature map is stored.  ([p. 44, Section 3.5.1, p. 78, Section 4.5.5,  Figure 1.1, Figure 3.13, Figure 4.2, Figure 4.4, Figure 4.7],After Nkx × Nky cycles, we complete one kernel window sliding (Loop-1) and move to the next input feature map with the same dataflow until the last one as shown in Figure 3.11., If the inputs of one layer is from Concat, the compiler generates DMA descriptors that control DMA to read multiple layers of the Concat from different DRAM addresses as the inputs., wherein the neural network processing framework processes one input feature map at a time based upon the mapping of existing pixel/feature map data (i.e., including previously output feature maps such as from a previous layer) to the on-chip memory such that the execution of any operation (convolution, pooling, ReLU, etc) resulting from this mapping/transference of feature map inputs is the creation of an output feature map (including processing by a last branch such as concatenation, pooling, batchnorm, or eltwise that is associated with the formation of an output feature map that is used for subsequent processing) is stored in either on-chip or external memory (but, if stored in external memory it is transferred to internal memory when needed for subsequent processing) such that the on-chip memory that contains the input/predecessor feature map has been overwritten either during the course of the generation of the output feature map (e.g., Figure 4.4 shows that the addresses for both the output and the input pixel buffers start from address 0) or during the course of reading and writing a feature map to and from DRAM (i.e., overwritten but with an intermediate step involving external memory access).)  

In regards to claim 10, the rejection of claim 1 is incorporated and Ma further teaches wherein the operation processing graph comprises a plurality of branches sharing a first input node and a first output node, wherein each of the plurality of branches receives the first module input feature map, wherein the plurality of branches comprises a first branch comprising the first operation, and a second branch comprising the second operation, and wherein the operation processor is further configured to: obtain the first branch output feature map by performing the first operation comprised in the first branch, using a first remaining area of the internal memory that excludes a first area of the internal memory, in which the first module input feature map is stored; and store the obtained first branch output feature map in a second area of the internal memory; obtain the second branch output feature map by performing the second operation comprised in the second branch, using a second remaining area of the internal memory 20that excludes the first area and the second area in which the first branch output feature map is stored; and store the obtained second branch output feature map in a third area of the internal memory.  ([p. 46, Section 3.5.2, p. 72, Section 4.5.1, p. 77, Section 4.5.4, Figure 2.4, Figure 3.12, Figure 4.2, Figure 4.3, Figure 5.5], We further serialize the P ox × P of parallel outputs to be P ox× #OUTBUF using multiplexers with neighboring output feature maps stacked in one output buffer, as illustrated in Figure 3.12., There are P ox × P oy × P of parallel outputs from the MAC units, and they are serialized into P oy consecutive clock cycles to reduce the required number of bias adders and the data width of output buffers.  The P ox × P of outputs are further serialized to be P ox× #OUTBUF using multiplexers with output feature maps stacked in the output buffer as shown in Figure 4.7., Eltwise is performed after its previous layer in the same branch has stored all the results into the output buffers. Then, the pixels from the other branch are read from DRAM and written into the input pixel buffers. Subsequently, the pixels from the two branches are element-wise added by the adders and finally stored back into the output pixel buffers, as illustrated in Figure 4.9. The output buffers are implemented as dual-port RAMs so that the adder results can be written back to the output buffers at their addends’ original locations without using additional buffers., wherein the data flow/processing flow in the neural network includes the processing of an input feature map across multiple modules (either through successive applications of CNN functions applied according to particular pixel subsets of a feature map according to the memory-optimized partition of those functions across the set of processing elements or across successive layers/layer clusters such as for Resnets) such that a distinct output buffer or part of an output buffer is allocated to each successive branch/processing cycle (e.g., Figures 3.12 and 5.5 with the overall processing area being interpreted as the total amount of available/unused on-chip memory) and also (and alternatively) for Eltwise processing in which an output buffer (second area in internal memory) contains an output feature map for a current layer while an input buffer contains a previous (input) feature map (first area) to enable the Eltwise adder operation (i.e., this operation is based upon the available on-chip memory accessed by the adders in a dual buffer memory configuration) in which a third area of internal memory, in this case, is being interpreted as corresponding to any subsequent access of the results of the Eltwise operation (such as with any intermediate DRAM write/read step including for forming inputs/populating input buffers into succeeding layers or ResNet blocks – Figure 2.4).)

Claim 12 is also rejected because it is just a method implementation of the same subject matter of claim 1 which can be found in Ma. 

Claim 13/12 is also rejected because it is just a method implementation of the same subject matter of claim 2/1 which can be found in Ma. 



Claim 15/14 is also rejected because it is just a method implementation of the same subject matter of claim 5/4 which can be found in Ma. 

In regards to claim 16, the rejection of claim 13 is incorporated and Ma wherein the plurality of modules comprises a first module and a second module, wherein the operation processing graph comprises a skip connection operation connected between a first operation node comprised in the first module and a second operation node comprised in the second module, wherein the method further comprises excluding the skip connection operation from the operation processing graph, and wherein the partitioning of the operation processing graph comprises partitioning the operation processing graph, from which the skip connection operation is excluded, into the plurality of modules.  ([p. 43, Section 3.5.1, p. 77, Section 4.5.4, p. 78, Section 4.5.5, Figure 3.11, Figure 4.2, Figure 4.3, Figure 4.4], Therefore, the BUF2PE bus in Figure 3.11(b) can be applied for conv1 in ResNet with kernel size = 7 × 7, stride = 2 and zero padding = 3., The Eltwise layer performs element-wise addition to connect two branches of layers in ResNet CNNs as shown in Figure 1.1…. Eltwise is performed after its previous layer in the same branch has stored all the results into the output buffers. Then, the pixels from the other branch are read from DRAM and written into the input pixel buffers. Subsequently, the pixels from the two branches are element-wise added by the adders and finally stored back into the output pixel buffers, as illustrated in Figure 4.9., If the inputs of one layer is from Concat, the compiler generates DMA descriptors that control DMA to read multiple layers of the Concat from different DRAM addresses as the inputs., wherein the neural network processing framework implements ResNets (neural networks with skip connections – Figure 1.1) such that different layers or cluster layers (different operation nodes) which do not include the skip connections are processed individually/separately with the result stored in the external memory for subsequent use in a concatenation operation over the externally stored data (i.e., once transferred by DMA into on-chip memory) that combines the results from the disparate layers/nodes/cluster layers (concat function as well as the Eltwise function).) 

In regards to claim 17, the rejection of claim 16 is incorporated and Ma further teaches further comprising processing the skip connection operation via an external memory connected to the operation processor through direct memory access (DMA).  ([p. 66, Section 4.4.2], The DMA engine is used to communicate data between DRAM and on-chip BRAMs., wherein a DMA engine manages the interchange of data between external and internal memory with the ResNET layers processed using the DMA as noted previously, .)

In regards to claim 22, the rejection of claim 12 is incorporated and Ma further teaches wherein the operation processing graph comprises a first module, and wherein the method further comprises: obtaining an amount of memory for performing first operations comprised in the first module, among the plurality of operations; comparing the obtained amount of memory with an amount of free memory of the internal memory; and establishing a use policy of the internal memory, based on the amount of memory being compared with the amount of free memory.  ([pp. 20-27, Section 3.2.1, p. 33, Section 3.2.5, p. 37, Section 3.4.4, Figure 3.6, Figure 3.7], On-chip memory of FPGAs is not always large enough to store the entire data of deep CNN algorithms. Therefore, it is reasonable to use denser external DRAMs to store the weights and the intermediate pixel results of all layers… The number of external memory accesses primarily relies on the size of on-chip buffers, which is determined by the loop tiling variables T*., Therefore, if the tile size or the on-chip buffer can fully cover either all input pixels or all weights of one layer, the minimum DRAM access can be achieved as (10.8) inside Figure 3.7. By computing Loop-3 first, weights stored in buffer are reused and #DRAM wt is reduced as in (10.1) and (10.5) inside Figure 3.7. Similarly, by computing Loop-4 first, pixels can be reused to reduce #DRAM px as in (10.3) and (10.6) inside Figure 3.7. However, computing Loop-3 or Loop-4 first may postpone the computation of Loop-1 or Loop-2, which would lead to a large number of partial sums., Both pixel and weight buffers need to be large enough to cover the data in one tiling block for all the convolution layers. This is expressed as:…, wherein the data flow/processing flow is configured/optimized based on an analysis of the memory requirements for processing components (i.e., any FPGA-based sub-process computation module) such that given constraints on the on-chip memory (input/output) buffer sizes (interpreted as being free memory/memory available for each respective processing cycle), an execution protocol is derived that optimizes the processing according to those memory limitations (e.g., Figures 3.6 and 3.7 for minimizing external memory accesses).)   

In regards to claim 23, the rejection of claim 12 is incorporated and Ma further teaches obtaining an output feature map by performing at least one operation among the plurality of operations, based on an input feature map; identifying whether to store the obtained output feature map in the internal memory, based on an amount of data of the obtained output feature map and an amount of free memory of the internal memory; based on the output feature map being identified to be stored in the internal memory, storing the output feature map in the internal memory; and based on the output feature map being identified to be not stored in the internal memory, storing the output feature map in an external memory connected to the operation processor through direct memory access (DMA).  ([pp. 20-27, Section 3.2.1, p. 33, Section 3.2.5, p. 37, Section 3.4.4, Figure 3.6, Figure 3.7], On-chip memory of FPGAs is not always large enough to store the entire data of deep CNN algorithms. Therefore, it is reasonable to use denser external DRAMs to store the weights and the intermediate pixel results of all layers… The number of external memory accesses primarily relies on the size of on-chip buffers, which is determined by the loop tiling variables T*., Therefore, if the tile size or the on-chip buffer can fully cover either all input pixels or all weights of one layer, the minimum DRAM access can be achieved as (10.8) inside Figure 3.7. By computing Loop-3 first, weights stored in buffer are reused and #DRAM wt is reduced as in (10.1) and (10.5) inside Figure 3.7. Similarly, by computing Loop-4 first, pixels can be reused to reduce #DRAM px as in (10.3) and (10.6) inside Figure 3.7. However, computing Loop-3 or Loop-4 first may postpone the computation of Loop-1 or Loop-2, which would lead to a large number of partial sums., Both pixel and weight buffers need to be large enough to cover the data in one tiling block for all the convolution layers. This is expressed as:…, wherein the data flow/processing flow is configured/optimized based on an analysis of the memory requirements for processing components (i.e., any FPGA-based sub-process computation module) such that given constraints on the on-chip memory (input/output) buffer sizes (interpreted as being free memory/memory available for each respective processing cycle), an execution protocol is derived that optimizes the processing according to those memory limitations (e.g., Figures 3.6 and 3.7 for minimizing external memory accesses) so that this optimization analysis determines 

In regards to claim 24, Ma teaches A neural network device for processing an operation processing graph comprising a plurality of operations, the neural network device comprising: an internal memory storing a first feature map; and at least one resource configured to: obtain a second feature map by performing a first operation among the plurality of operations, based on the stored first feature map; store the obtained second feature map in the internal memory; and obtain a third feature map by performing a second operation among the plurality of operations, based on the stored second feature map.  ([pp. 43-44, Section 3.5.1, p. 77, Section 4.5.4, p. 78, Section 4.5.5, Figure 1.1, Figure 2.5, Figure 3.10, Figure 4.2],The coarse-grained dataflow is shown in Figure 3.11 at feature map row level for stride = 1 and stride = 2. The data flow in Figure 3.11(a) is the same as Figure 3.10, where more clock cycles of operation is shown after cycle 8. In Figure 3.11(b), the dataflow with stride = 2 and zero padding = 3 is shown, which follows the same pattern as the case with stride = 1. The buffer storage pattern is adjusted according to different stride and padding settings. … Since the data movement within a register array or a feature map row is different for different settings of stride and zero padding, various BUF2PE data buses are needed for each dataflow, and the set of data buses are called data router…. After Nkx × Nky cycles, we complete one kernel window sliding (Loop-1) and move to the next input feature map with the same dataflow until the last one as shown in Figure 3.11., Eltwise is performed after its previous layer in the same branch has stored all the results into the output buffers. Then, the pixels from the other branch are read from DRAM and written into the input pixel buffers. Subsequently, the pixels from the two branches are element-wise added by the adders and finally stored back into the output pixel buffers, as illustrated in Figure 4.9. The output buffers are implemented as dual-port RAMs so that the adder results can be written back to the output buffers at their addends’ original locations without using additional buffers., If the inputs of one layer is from Concat, the compiler generates DMA descriptors that control DMA to read multiple layers of the Concat from different DRAM addresses as the inputs., , wherein a framework (operation processor) for an FPGA-based implementation of a neural network (e.g., CNN) represents its functional/processing operations using a DAG (an operation processing graph) including the (convolutional and other) processing of feature maps from one layer to the next in which the feature map output from a previous layer is stored in an internal memory (on-chip buffers/BRAM) for routing/processing on processing elements (MAC units with the operation processor) to generate at each cycle a new/output (second) feature map that is stored in internal memory (output buffer) such that the new/output/second feature map is used for additional processing without any intervening movement to and from external memory (e.g., ReLU) as with or involving a movement to and from external memory  (e.g., concat or Eltwise processing for Resnets).) As noted above the “at least one resource configured to” in the claims is being interrupted as a generic placeholder without the recitation of sufficient accompanying structure to perform the function; a review of the specification shows that the following appears to be the corresponding structure described in the specification: “[0032] The operation processor 200 may include an operation resource including various operation processing devices such as a central processing unit (CPU), a graphics processing unit (GPU), an application processor (AP), a digital signal processor (DSP), a field-programmable gate array (FPGA), a neural network 


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Ma in view of Aimar et al. (“NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps”, https://arxiv.org/pdf/1706.01406.pdf, arXiv:1706.01406v2 [cs.CV], 6 Mar 2018, pp. 1-13), hereinafter referred to as Aimar.

In regards to claim 11, the rejection of claim 1 is incorporated and Ma further teaches further comprising dynamic random access memory (DRAM) connected to the operation processor through direct memory access (DMA), wherein the internal memory comprises static random access memory (SRAM).  ([pp. 17-18, Section 2.3, Figure 2.5], For them, the block memory (BRAM) on the FPGA chip, which is normally smaller than 8 MByte, is insufficient to store all the data, requiring gigabytes of external off-chip memory (DRAM). Therefore, a typical CNN accelerator consists of three levels of hierarchy: 1) external memory, 2) on-chip buffers, and 3) registers and processing engines (PEs) as shown in Figure 2.5. , wherein the neural network processing framework uses a DMA engine to transfer data between external DRAM memory and on-chip/internal block memory (BRAM) residing on the FPGA.)
However, Ma does not explicitly disclose static … (SRAM). Ma teaches the use of BRAM rather than, explicitly, SRAM.
However, Aimar, in the analogous environment of configuring neural network operations for processing on FPGA’s, teaches further comprising dynamic random access memory (DRAM) connected to the operation processor through direct memory access (DMA), wherein the internal memory comprises static random access memory (SRAM) ([Abstract, pp. 2-3, Section III, p. 4, Section IIIB, Figure 3], We propose a flexible and efficient CNN accelerator architecture called NullHop that implements SOA CNNs useful for low-power and low-latency application scenarios., The input feature maps and the kernel values for the current convolutional layer are stored in two independent SRAM blocks…. The output feature maps produced by the current layer are streamed off-chip to the external memory. They are then streamed back to the accelerator SRAM when the accelerator has finished processing the current layer., The module contains multiple SRAM banks and can start decoding the data from these banks, while the input feature maps are still being loaded. The IDP maintains a pointer to the beginning of each row of the image stored inside the SRAM and uses these row starting addresses to decode the pixels in a sequential manner…, wherein the data is exchanged (via DMA) between SRAM on-chip memory and DRAM external memory in which the SRAM data includes feature maps transferred from DRAM, processed to form an output (including an output in the form of a feature map) and sent into DRAM such that the SRAM includes the data necessary to process a current layer with the neural network processing/accelerator framework configured to process the data through the successive layers of the CNN.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Ma to incorporate the teachings of Aimar for the processing and data flow of a neural network to comprise dynamic random access memory (DRAM) connected to the operation processor through direct memory access (DMA) in which the internal memory comprises static random access memory (SRAM). The modification would have been obvious because one of ordinary skill would have been motivated to achieve superior throughput and power performance efficiency in a flexible FPGA-based implementation of a CNN architecture through optimized allocation of resources over DRAM and SRAM memories particularly when the feature maps are sparse (Aimar, [Abstract, p. 11, Section VIIB, Table XI]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Mingyu Gao (“Scalable Near-Data Processing Systems for Data-Intensive Applications”, PhD Thesis, Stanford University, 2018, pp. 1-182) teaches a FPGA-based framework for configuring 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT LEWIS KULP whose telephone number is (571)272-7983. The examiner can normally be reached M, Th, F 8-5:30; Tu 8-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang, can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ROBERT LEWIS KULP/Examiner, Art Unit 2124                                                                                                                                                                                                        
/BRIAN M SMITH/Primary Examiner, Art Unit 2122