DETAILED ACTION
1.	This action is in response to amendments filed 20 June 2022 for application 16/691130, filed 21 November 2019. Currently, claims 1-3, 5-8, and 10-11 are pending.  Claims 4 and 9 have been canceled. All references in the IDS have been reviewed. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments

Applicant's arguments filed 20 June 2022 have been fully considered but they are not persuasive. 

The Applicants Specifically Argue

Claims 1 -11 are rejected under 35 U.S.C. 101 on the grounds that the claims are directed to an abstract idea, and because the claims as a whole, considering all claim elements both individually and in combination, do not amount to significantly more than the abstract idea. In an effort to overcome the above rejection, …Amended claim 1 recites limitations that cannot be interpreted as being able to be performed in the human mind, or with pen and paper.…Moreover, even if one contends that amended claim 1 could somehow be interpreted as being directed to an abstract idea, amended claim 1 recites significantly more than an abstract idea, and is integrated into a practical application (see M.P.E.P. 2106.04(d)). Specifically, the amended claim 1 recites "performing a forward calculation of a neural network on global data to obtain output data of the forward calculation; storing the output data of the forward calculation in a global Serial No.: 16/691,130Amendment AArt Unit: 2124memory unit, wherein intermediate data is not stored in the global memory unit; performing [[a]]the forward calculation of [[a]]the neural network on the global data again, to obtain the intermediate data for a reverse calculation of the neural network; storing the intermediate data in a buffer unit; reading the intermediate data from the buffer unit; and performing the reverse calculation of the neural network on the intermediate data to obtain a result of the reverse calculation". …Although the number of calculations is increased, the intermediate data are not required to be stored in a global memory in the forward calculation. Since the amount of the global data is smaller than the amount of the intermediate data of the forward calculation, the number of reading the global memory can be reduced in the reverse calculation, thereby reducing the computational time cost, and increasing the data processing speed (see paragraph [0056] of the specification as filed). Amended claim 1 therefore recites "significantly more" than an abstract idea, and is integrated into a practical application, under Step 2B of the test for subject matter eligibility (see M.P.E.P. 2106.04(d)). 

 Examiner Response
The Examiner respectively disagrees.  The assertion that the Examiner claims are directed to a judicial exception in the mental processes group is incorrect; as explained clearly in the 20 April 2022 NOFA, the judicial exception of relevance in the rejection under 35 USC 101 is associated with mathematical calculations/concepts. As set forth in the 20 April 2022 and maintained in the current Office Action, the amended limitation still do not integrate the judicial exception into a practical technological invention. Specifically, relative to the independent claims, the performance of forward and reverse calculations are mathematical steps. In addition, the claims (at Prong 2 Step 2A) recite the retrieval and storing of data (selectively) in various types of memory which is a data gathering step such that the mathematical steps and data gathering is performed using generic computer elements that are recited at a high-level of generality such that it amounts to no more than a mere instructions to apply the exception using a generic computer component. This extra solution activity is not significantly more than the judicial exception (Step 2, Prong 2B) because the mathematical steps are performed by generic computer components and the data gathering is acknowledged to be well-understood, routine, conventional activity (see, e.g., court recognized WURC examples in MPEP 2106.05(d)(II)(i)).  Similarly, the neural network that performs the calculations is recited at a high level of generality that simply links to a field of use. The claim thus recites computing components only at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using generic computer components, where it is further has been noted that the storing of data generated from neural network forward calculations is well-known and understood (for example, see Rhu et al. (“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”,  2016 49th Annual IEEE?ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1-13 ))  in particular with respect to ([p. 2, Section 1], To achieve this goal, vDNN exploits the data dependencies of allocated data structures, particularly the intermediate feature maps that account for the majority of memory usage (Section II-C), and either releases or moves these intermediate data between GPU and CPU memory. Specifically, vDNN either 1) aggressively releases these feature maps from the GPU memory if no further reuse exists, or 2) offloads (and later prefetches) to (from) CPU memory if further reuse does exist but is not immediately required.) and, also see Wang et al. (“Accelerating Recurrent Neural Networks: A Memory-Efficient Approach”, IEEE Transactions On Very Large Scale Integration (VLSI) Systems, VOL. 25, NO. 10, October 2017, pp. 2763-2775, Algorithm 1, Figure 9)  in particular with respect to ([pp. 2763-2764, Section 1], Moreover, RNNs have extensive application prospects when augmented with an external memory [8]–[10]. However, a typical RNN architecture usually requires large memory space and frequent data exchanges, making it hard to be implemented on embedded devices…With the hybrid compression method and the scalable well-optimized hardware architecture, the implementation results demonstrate that the proposed design has higher flexibility and hardware efficiency compared with existing RNN accelerators.)
The Applicant Further Argues:
For at least the following reasons, Applicant respectfully submits that amended claim 1 is not anticipated by Rhu. As disclosed in Rhu, as discussed in Section II-C, deep networks have to keep track of a large number of the intermediate feature maps (Xs) that are extracted during forward propagation. Once a given layer(n)'s forward computation is complete, however, layer(n)'s X is not reused until the GPU comes back to the same layer(n)'s corresponding backward computation. … Further, as noted in the present specification, during the reverse calculation, the intermediate data required for the reverse calculation and so on are read from the global memory to directly perform the reverse calculation. …At least for the above reasons, Rhu does not disclose or teach at least the features (i)"wherein intermediate data is not stored in the global memory'; (ii) "performing [[a]]the forward calculation of [[a]]the neural network on the global data again, to obtain the intermediate data for a reverse calculation of the neural network; storing the intermediate data in a buffer unit; reading the intermediate data from the buffer unit" as recited in amended claim 1. 

Application Serial No.: 15 254,780Art Unit: 2122 Examiner Response
Applicant’s arguments directed to Rhu with respect to claims 1-3, 5-11 have been considered but are moot because the new ground of rejection in view of Gruslys which does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument and which has been necessitated by the amendments.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.



Claims 1-3, 5-8, and 10-11 are rejected under 35 U.S.C. 101. because the claims are directed to an abstract idea; and because the claims as a whole, considering all claim elements both individually and in combination, do not amount to significantly more than the abstract idea, see Alice Corporation Pty. Ltd. v. CLS Bank International, et al, 573 U.S. (2014). In determining whether the claims are subject matter eligible, the Examiner applies the 2019 USPTO Patent Eligibility Guidelines. (2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50, Jan. 7, 2019.)
Step 1: Is the claim to a process, machine, manufacture, or composition of matter? Yes—claim 1 recites a method which is a process. Claims 6 and 11 recite a machine/device/system and product respectively.
Step 2A, prong one: Does claim 1 recite an abstract idea, law of nature or natural phenomenon? Yes—the limitations of “performing a forward calculation of a … on global data to obtain output data of the forward calculation,”  “performing a forward calculation of a … on global data again to obtain intermediate data 5for a reverse calculation of the …”, “performing the reverse calculation of the … on the intermediate data to obtain a result of the reverse calculation” as drafted, are mathematical steps of performing a first (forward) mathematical calculations on data to generate results/data that are used in a second (reverse) mathematical calculation. Under the broadest reasonable interpretation, the claim recites the judicial exception of an abstract idea in the mathematical concepts group. Therefore, claim 1 recites an abstract idea.
Step 2A, prong two: Does the claim recite additional elements that integrate the judicial exception into a practical application? No—the judicial exception is not integrated into a practical application. Although the claim recites that the recited functionality includes “circuits for implementing functions”, and “global memory unit” which are generic computer elements, as well as the functionality “intermediate data is not stored in the global memory unit”, “storing the output data in a global memory unit”, “storing the intermediate data in a buffer unit”, and “reading the intermediate data from the buffer unit” which recite functions of storing and reading data that are mere data gathering steps; the computers that perform those functions and the mathematical steps are recited at a high level of generality that do not impose a meaningful limitation on the judicial exception;  the computer (including the buffer unit) is recited at a high-level of generality such that it amounts to no more than a mere instructions to apply the exception using a generic computer component. 
 Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception? No— The recitation in the preamble is insufficient to transform a judicial exception to a patentable invention because the preamble elements are recited at a high level of generality that simply links to a field of use, see MPEP 2106.05(h). Similarly, the neural network that performs the calculations is recited at a high level of generality that simply links to a field of use and the claimed extra-solution of data gathering is acknowledged to be well-understood, routine, conventional activity (see, e.g., court recognized WURC examples in MPEP 2106.05(d)(II)(i). The claim thus recites computing components only at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using generic computer components. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. It is further noted that the storing of data generated from neural network forward calculations is well-known and understood (for example, see Rhu et al. (“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”,  2016 49th Annual IEEE?ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1-13 ))  in particular with respect to ([p. 2, Section 1], To achieve this goal, vDNN exploits the data dependencies of allocated data structures, particularly the intermediate feature maps that account for the majority of memory usage (Section II-C), and either releases or moves these intermediate data between GPU and CPU memory. Specifically, vDNN either 1) aggressively releases these feature maps from the GPU memory if no further reuse exists, or 2) offloads (and later prefetches) to (from) CPU memory if further reuse does exist but is not immediately required.) and, also see Wang et al. (“Accelerating Recurrent Neural Networks: A Memory-Efficient Approach”, IEEE Transactions On Very Large Scale Integration (VLSI) Systems, VOL. 25, NO. 10, October 2017, pp. 2763-2775, Algorithm 1, Figure 9)  in particular with respect to ([pp. 2763-2764, Section 1], Moreover, RNNs have extensive application prospects when augmented with an external memory [8]–[10]. However, a typical RNN architecture usually requires large memory space and frequent data exchanges, making it hard to be implemented on embedded devices…With the hybrid compression method and the scalable well-optimized hardware architecture, the implementation results demonstrate that the proposed design has higher flexibility and hardware efficiency compared with existing RNN accelerators.)
Taken alone, their additional elements do not amount to significantly more than the above- identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no indication that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide conventional computer implementation.
For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to independent claims 6 and 11, which recite a system/device and a computer product, respectively.  It is noted that claims 6 and 11 additionally recite additional elements in the following limitations that are insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g): processors (claim 6), storage device to store executable programs (claim 6), and  non-transitory computer readable storage medium comprising instructions that are executed by processors (claim 11).
In addition, claims 2 and 7 each recites additional elements to be addressed at Step 2A, Prong 2 and at Step 2B as follows. Claim 2 and claim 7 each recites the additional elements of  “storing the result of the reverse calculation” which is a data gathering operation and  “a global memory unit” which is a generic computing component with the function of storing the data in a global memory unit recited  at a high level of generality and are no more than mere instructions to apply the exception using a generic computer and, thereby, do not impose a meaningful limit on the judicial exception. These elements are insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity (generic computer system, processing resources).
In addition, claims 3 and 8 recites additional elements to be addressed at Step 2A, Prong 2 and at Step 2B as follows. Claim 3 and claim 8 each recites the additional elements of  wherein the “neural network is a recurrent neural network (RNN)”  which is recited at a high level of generality that simply links to a field of use and  “output data of a previous forward calculation, a weight of the output data of the previous 15forward calculation, input data of the forward calculation, and a weight of the input data of the forward calculation” which are more details on the data that is performed in the data gathering operation and therefore does not impose a meaningful limit on the judicial exception.  It is further noted that the storing of data generated from RNN forward calculations is well-known and understood (for example, see Wang et al. (“Accelerating Recurrent Neural Networks: A Memory-Efficient Approach”, IEEE Transactions On Very Large Scale Integration (VLSI) Systems, VOL. 25, NO. 10, October 2017, pp. 2763-2775, Algorithm 1, Figure 9)  in particular with respect to ([pp. 2763-2764, Section 1], Moreover, RNNs have extensive application prospects when augmented with an external memory [8]–[10]. However, a typical RNN architecture usually requires large memory space and frequent data exchanges, making it hard to be implemented on embedded devices…With the hybrid compression method and the scalable well-optimized hardware architecture, the implementation results demonstrate that the proposed design has higher flexibility and hardware efficiency compared with existing RNN accelerators.) and, also for example, see Diamos et al. (“Persistent RNNs: Stashing Recurrent Weights On-Chip”, Proceedings of the 33rd International Conference on Machine Learning, Vol. 48, 2016. pp. 1-10) with respect to ([p. 1, Section 1, pp. 2-3, Section 3.1, p. 3, Section 4.1, p. 4, Section 4.2], We exploit the largest source of on-chip memory on the GPU—the collective register files of 6144 hardware thread contexts on a TitanX GPU—to cache the RNN parameters and reuse them over multiple timesteps during training., At each level, it describes a collection of processor cores and the associated on-chip memory/cache capacities with four parameters (computational bandwidth, memory capacity, memory bandwidth, and memory latency). We attack the cost of inter-processor synchronization with an optimized assembly level barrier implementation, demonstrating that such barriers implemented in software can reduce latency by approximately 10x compared to relying on repeated kernel launches., Synchronization between GPU processors cores is typically achieved implicitly between dependent kernel calls in both CUDA and OpenCL development frameworks. However, this mechanism for synchronization between timesteps requires launching a new kernel that forces the weights to be reloaded from off-chip memory.)  These elements are insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity (generic computer system, processing resources).
In addition, claims 5 and 10 each recites additional elements to be addressed at Step 2A, Prong 2 and at Step 2B as follows. Claim 2 and claim 7 each recites the additional elements of  “the buffer unit is a register or a cache” which are more details on the data gathering operation in which each of the  “buffer unit”,  “register” , and  “cache” is a generic computer component  recited  at a high level of generality and are no more than mere instructions to apply the exception using a generic computer and, thereby, do not impose a meaningful limit on the judicial exception. It is further noted that the storing of data generated from neural network forward calculations is well-known and understood ((for example, see Diamos et al. (“Persistent RNNs: Stashing Recurrent Weights On-Chip”, Proceedings of the 33rd International Conference on Machine Learning, Vol. 48, 2016. pp. 1-10) with respect to ([p. 1, Section 1, pp. 2-3, Section 3.1, p. 3, Section 4.1, p. 4, Section 4.2], We exploit the largest source of on-chip memory on the GPU—the collective register files of 6144 hardware thread contexts on a TitanX GPU—to cache the RNN parameters and reuse them over multiple timesteps during training., At each level, it describes a collection of processor cores and the associated on-chip memory/cache capacities with four parameters (computational bandwidth, memory capacity, memory bandwidth, and memory latency). We attack the cost of inter-processor synchronization with an optimized assembly level barrier implementation, demonstrating that such barriers implemented in software can reduce latency by approximately 10x compared to relying on repeated kernel launches., Synchronization between GPU processors cores is typically achieved implicitly between dependent kernel calls in both CUDA and OpenCL development frameworks. However, this mechanism for synchronization between timesteps requires launching a new kernel that forces the weights to be reloaded from off-chip memory.)  
In summary, as shown in the analysis above, claims 1-3, 5-8, and 10-11 do not provide any additional elements that when considered individually or as an ordered combination, amount to significantly more than the abstract idea identified. Therefore, as a whole claims 1-3, 5-8, and 10-11 do not recite what have the courts have identified as "significantly more”. In particular, there is no indication that the combination of elements improves the functioning of a computer or improves another technology when claims are considered individually or as an ordered combination.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim 1, 6, and 11 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Gruslys et al. (“Memory-Efficient Backpropagation Through Time”,   Advances in Neural Information Processing Systems 29, 2016, pp. 1-9), hereinafter referred as Gruslys.

In regards to claim 1, Gruslys teaches A data processing method,  implemented by circuits for implementing functions, comprising:  performing a forward calculation of a neural network on global data to obtain output data of the forward calculation; Storing the output data of the forward calculation in a global memory unit, wherein intermediate data is not stored in the global memory unit; ([p. 2, Section 2, p. 3, Section 3.1, Figure 1, Figure 4]  Definition 3. The internal state of the RNN core for a given time-point is all the necessary information required to backpropagate gradients over that time step once an input vector, a gradient with respect to the output vector, and a gradient with respect to the output hidden state is supplied. We define it to also include an output hidden state., Our approach is illustrated in Figure 1. Once we start forward-propagating steps at time t = t0, at any given point y > t0 we can choose to put the current hidden state into memory (step 1). This step has the cost of y forward operations…. Once the state is put into memory at time y = D(t, m), we can reduce the problem into two parts by using a divide-and-conquer approach: running the same algorithm on the t > y side of the sequence while using m − 1 of the remaining memory slots at the cost of C(t − y, m − 1) (step 2), and then reusing m memory slots when backpropagating on the t ≤ y side at the cost of C(y, m) (step 3)., wherein a forward calculation is performed across a deep neural network corresponding to an RNN (unfolded over time to accommodate BPTT training) to generate an output (at each layer/time) such as a hidden state that is stored in an internal memory (interpreted to be global memory) during the forward propagation while the remaining parameters of the internal state of the RNN at any layer/time (interpreted to corresponding to the updateable RNN parameters excluding the hidden state) are not retained in that memory.) performing the forward calculation of the neural network on the global data again to obtain intermediate data 5for a reverse calculation of the neural network; storing the intermediate data in a buffer unit; ([p. 3, Section 3, p. 3, Section 3.1, Figure 3, Figure 4] Every time when the state of the network at time t has to be restored, the algorithm would simply re-evaluate the state by forward-propagating inputs starting from the beginning until time t. As backpropagation happens in the reverse temporal order, results from the previous forward steps can not be reused (as there is no memory to store them). This would require repeating t forward steps before backpropagating gradients one step backwards (we only remember inputs and the initial state)…. When the memory is somewhat limited (but not very scarce) we may store only hidden RNN states at all time points. When errors have to be backpropagated from time t to t − 1, an internal RNN core state can be re-evaluated by executing another forward operation taking the previous hidden state as an input. The backward operation can follow immediately., Our approach is illustrated in Figure 1. Once we start forward-propagating steps at time t = t0, at any given point y > t0 we can choose to put the current hidden state into memory (step 1). This step has the cost of y forward operations…. Once the state is put into memory at time y = D(t, m), we can reduce the problem into two parts by using a divide-and-conquer approach: running the same algorithm on the t > y side of the sequence while using m − 1 of the remaining memory slots at the cost of C(t − y, m − 1) (step 2), and then reusing m memory slots when backpropagating on the t ≤ y side at the cost of C(y, m) (step 3)., wherein, in the backpropagation/reverse step, the full internal state of a given RNN layer/time at which the backprop is acting upon is regenerated by repeating the set of forward calculations such this regeneration includes the hidden state/output data stored in internal/global memory and such that the internal states are being stored in memory temporally (buffer) with that information used to determine successively the full internal states up to and including that of the current layer/time that is undergoing backprop as well as for use in the backprop step itself.) reading the intermediate data from the buffer unit; and performing the reverse calculation of the neural network on the intermediate data to obtain a result of the reverse calculation  ([p. 3, Section 3, p. 3, Section 3.1, Figure 3, Figure 4] Every time when the state of the network at time t has to be restored, the algorithm would simply re-evaluate the state by forward-propagating inputs starting from the beginning until time t. As backpropagation happens in the reverse temporal order, results from the previous forward steps can not be reused (as there is no memory to store them). This would require repeating t forward steps before backpropagating gradients one step backwards (we only remember inputs and the initial state)…. When the memory is somewhat limited (but not very scarce) we may store only hidden RNN states at all time points. When errors have to be backpropagated from time t to t − 1, an internal RNN core state can be re-evaluated by executing another forward operation taking the previous hidden state as an input. The backward operation can follow immediately., Our approach is illustrated in Figure 1. Once we start forward-propagating steps at time t = t0, at any given point y > t0 we can choose to put the current hidden state into memory (step 1). This step has the cost of y forward operations…. Once the state is put into memory at time y = D(t, m), we can reduce the problem into two parts by using a divide-and-conquer approach: running the same algorithm on the t > y side of the sequence while using m − 1 of the remaining memory slots at the cost of C(t − y, m − 1) (step 2), and then reusing m memory slots when backpropagating on the t ≤ y side at the cost of C(y, m) (step 3)., wherein, in response to the completion of the forward computation to regenerate the (full) internal state of the RNN, this information is accessed/read during the reverse computation to perform and propagate the backprop (backprop) computation results, including gradients and any other information, to a next layer/time in the backward direction such that this computation includes/makes use of both the regenerated intermediate results (data buffer) and the hidden memory states (saved in internal memory).) 

Claim 6 is also rejected because it is just system/device implementation of the same subject matter of claim 1 which can be found in Gruslys. It is noted that claim 6 also recites a processor that executes program code which is also found in Gruslys (e.g., [pp. 5-6, Section 3.3, p. 7, Section 3.6, Figure 6]).

Claim 11 is also rejected because it is just a CRM implementation of the same subject matter of claim 1 which can be found in Gruslys. It is noted that claim 11 also recites a processor that executes program code which is also found in Gruslys (e.g., [pp. 5-6, Section 3.3, p. 7, Section 3.6, Figure 6]).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 2, 3, 7, and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Gruslys in view of Wang et al. (“Accelerating Recurrent Neural Networks: A Memory-Efficient Approach”, IEEE Transactions On Very Large Scale Integration (VLSI) Systems, VOL. 25, NO. 10, October 2017, pp. 2763-2775), hereinafter referred as Wang.

In regards to claim 2, the rejection of claim 1 is incorporated and Gruslys does not further teach  further comprising: storing the result of the reverse calculation in a Global memory unit. Although Gruslys teaches the propagation of results to earlier layers/times from the backprop/step he does not clearly specify where those values (e.g., weights, gradients) are stored.  
However, Wang, in the analogous environment of memory-efficient implementation and training of neural networks, teaches storing the result of the reverse calculation in a global memory unit. ([p. 2766, Section IIID, p. 2770, Section VC, p. 2771, Section VD, Figure 8, Figure 9, Algorithm 1], During back propagation, the gradients (∂ L/∂Wt ) are computed with the quantized weights and the original nonlinear functions. The original weights Wt will then be updated with the accumulated gradients (∂ L/∂Wt )., In this paper, a strategy for reading wi and hi is proposed to minimize the number of memory accesses. Specifically, we define two counters c0 and c1 representing that we are processing the submatrix in c0th column (1 ≤ c0 ≤ p) and c1th row (1 ≤ c1 ≤ p) with the chessboard division. According to (16), all we need to fetch from memory now are w((r+c1−c0)%r)+1 and hc0 , which are used to compute W((r+c1−c0)%r)+1 · hc0 when c0 ≤ c1 or Sr W((r+c1−c0)%r)+1 · hc0 otherwise…. Due to the serial nature of memory accesses, fetching each element in vector wi from memory leads to high power consumption and reduces the overall system throughput. Instead, we can concatenate the r elements in vector wi and store only the concatenated value, with which only one data read access is needed to process one sub-MV., wherein weights that are computed/updated during a backward pass (Algorithm 1) are stored in WRAM (interpreted as being a global memory) that is then subsequently read/fetched (Figures 8 and 9 with the weight matrix organized/partitioned into a circulant representation) in order to perform subsequent feedforward operations (Algorithm 1, equations 1-6).)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gruslys to incorporate the teachings of Wang to store the result of the reverse calculation in a global memory unit. The modification would be obvious because one of ordinary skill would be motivated to achieve improved memory efficiency with negligible loss in performance in implementing neural networks on embedded devices with insufficient memory to accommodate the weight matrices updated during backpropagation by minimizing the number of memory accesses required to read through judicious compression of the weight matrices stored externally to on-chip memory, especially when the neural network is an RNN. (Wang, [Abstract, p. 2763, Section 1, p. 2773, Section VII]).

In regards to claim 3, the rejection of claim 1 is incorporated and Gruslys further teaches wherein the neural network is a RNN 15neural network, and the global data comprises: output data of a previous forward calculation, ….  ([p. 2, Section 2, p. 3, Section 3.1, Figure 1, Figure 4]   Definition 3. The internal state of the RNN core for a given time-point is all the necessary information required to backpropagate gradients over that time step once an input vector, a gradient with respect to the output vector, and a gradient with respect to the output hidden state is supplied. We define it to also include an output hidden state., Our approach is illustrated in Figure 1. Once we start forward-propagating steps at time t = t0, at any given point y > t0 we can choose to put the current hidden state into memory (step 1). This step has the cost of y forward operations…. Once the state is put into memory at time y = D(t, m), we can reduce the problem into two parts by using a divide-and-conquer approach: running the same algorithm on the t > y side of the sequence while using m − 1 of the remaining memory slots at the cost of C(t − y, m − 1) (step 2), and then reusing m memory slots when backpropagating on the t ≤ y side at the cost of C(y, m) (step 3)., wherein, as previously pointed out, the memory efficient method for training the RNN retains the hidden states (output data) of the forward (propagation) calculation in an internal (global) memory.)  
However, Gruslys does not explicitly disclose  a weight of the output data of the previous forward calculation, input data of the forward calculation, and a weight of the input data of the forward calculation. Although Gruslys teaches the propagation of results to earlier layers/times from the backprop/step he does not clearly specify where those values (e.g., weights, gradients) are stored and does not clearly disclose where input data for the forward calculation is stored.  
However, Wang, in the analogous environment of memory-efficient implementation and training of neural networks, teaches wherein the neural network is a recurrent 15neural network (RNN), and the global data comprises: output data of a previous forward calculation, a weight of the output data of the previous forward calculation, input data of the forward calculation, and a weight of the input data of the forward calculation. ([p. 2764, Section 1A, p. 2770, Section VC, p. 2771, Section VD, p. 2771, Section VE, Figure 6, Figure 8, Algorithm 1], Considering a sequence learning task and the input sequence S = (s1,s2,...,sT ), where st is the input of the network at time step t ∈ {1,..., T }, a typical LSTM layer is described as follows: <equations 1-6>, In this paper, a strategy for reading wi and hi is proposed to minimize the number of memory accesses. Specifically, we define two counters c0 and c1 representing that we are processing the submatrix in c0th column (1 ≤ c0 ≤ p) and c1th row (1 ≤ c1 ≤ p) with the chessboard division. According to (16), all we need to fetch from memory now are w((r+c1−c0)%r)+1 and hc0 , which are used to compute W((r+c1−c0)%r)+1 · hc0 when c0 ≤ c1 or Sr W((r+c1−c0)%r)+1 · hc0 otherwise., Due to the serial nature of memory accesses, fetching each element in vector wi from memory leads to high power consumption and reduces the overall system throughput. Instead, we can concatenate the r elements in vector wi and store only the concatenated value, with which only one data read access is needed to process one sub-MV., Specifically, we allocate two uniform memory blocks, denoted by HRAM0 and HRAM1, respectively, to store ht and ht−1. At each time step t, ht−1 is fetched from one memory block and the corresponding computation result of ht will be saved to the other one. The pseudocode of the proposed ping-pong memory structure is given in Algorithm 2.,  wherein the memory-efficient neural network implementation is (like Gruslys) also an RNN, wherein the weights that are stored in memory (WRAM – as previously noted) include weights of results from both a previous pass/layer (e.g., U applied to h_t-1 in equations 1-4) as well as weights applied to input data for a current pass (e.g., W applied to x_t), wherein output data in the form of the hidden states for a previous forward calculation (h_t-1) is also stored in HRAM (interpreted as also being global) and wherein the input data for the forward calculation (x_t, s_t) is similarly being read into the GPU (internal memory accessible by the GPU which is also a global memory) from external memory (Algorithm 1).)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gruslys to incorporate the teachings of Wang for the global data to comprise output data of a previous forward calculation, a weight of the output data of the previous forward calculation, input data of the forward calculation, and a weight of the input data of the forward calculation when the neural network is a recurrent 15neural network (RNN). The modification would be obvious because one of ordinary skill would be motivated to achieve improved memory efficiency with negligible loss in performance in implementing neural networks on embedded devices with insufficient memory to accommodate the weight matrices and hidden states updated during backpropagation by minimizing the number of memory accesses required to read through judicious compression of the weight matrices and hidden state stored externally to on-chip memory, especially when the neural network is an RNN. (Wang, [Abstract, p. 2763, Section 1, p. 2773, Section VII]).

Claim 7/6 is also rejected because it is just a system/device implementation of the same subject matter of claim 2/1 which can be found in Gruslys and Wang.

Claim 8/6 is also rejected because it is just a system/device implementation of the same subject matter of claim 3/1 which can be found in Gruslys and Wang.

Claims 5 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Gruslys in view of Diamos et al. (“Persistent RNNs: Stashing Recurrent Weights On-Chip”, Proceedings of the 33rd International Conference on Machine Learning, Vol. 48, 2016. pp. 1-10), hereinafter referred as Diamos.

In regards to claim 5, the rejection of claim 1 is incorporated and Gruslys does not further teach wherein the buffer unit is a register or a cache.  Gruslys does not specify the form of the internal memory relative to register or cache attributes. 
However, Diamos, in the analogous environment of memory-efficient implementation and training of neural networks, teaches wherein the buffer unit is a register or a cache. ([p. 1, Section 1, pp. 2-3, Section 3.1, p. 3, Section 4.1, p. 4, Section 4.2], We exploit the largest source of on-chip memory on the GPU—the collective register files of 6144 hardware thread contexts on a TitanX GPU—to cache the RNN parameters and reuse them over multiple timesteps during training., At each level, it describes a collection of processor cores and the associated on-chip memory/cache capacities with four parameters (computational bandwidth, memory capacity, memory bandwidth, and memory latency)…. We first generate the MBSP model for the target processor or family of processors that we plan to execute our RNN…. These changes balance the computational, communication, synchronization, and memory capacity requirements of the RNN such that no one resource becomes a significant bottleneck. It does so by exploiting the reuse of RNN weights over multiple timesteps to avoid repeatedly loading weights from DRAM, and taking into account the significantly higher cost of synchronization and off-chip memory accesses as compared to floating-point math operations., Our implementation first loads the weight matrix into registers., Each thread in the TitanX GPU has access to approximately 1KB of memory that can be read at high enough bandwidth to saturate the floating point datapath. Out of this, we dedicate 896 bytes to store recurrent weights as shown in Figure 2 , and the rest for intermediate computations., wherein a memory-efficient neural network implementation framework optimizes the usage of the on-chip (GPU) memory by stashing/caching (on-chip cache memory) parameters of that neural network such that these parameters are persisted/reused (in registers) during training (i.e., they include both forward and backward calculations) and wherein it is noted that intermediate values/calculations are also stored in the on-chip memory.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Gruslys to incorporate the teachings of Diamos for the buffer unit to be a register or a cache. The modification would be obvious because one of ordinary skill would be motivated to achieve improved memory efficiency, scalability, and throughput in training deep neural networks by caching neural network parameters used to train that neural network on the on-chip GPU memory for persistence and re-use, by modeling the memory and processing constraints to mitigate computational bottlenecks  (Diamos, [Abstract, p. 3, Section 3.1, p. 8, Section 6, Figure 5, ]).

Claim 10/6 is also rejected because it is just a system/device implementation of the same subject matter of claim 5/1 which can be found in Gruslys and Diamos.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Li et al. (“E-RNN Design Optimization for Efficient Recurrent Neural Networks in FPGAs”, https://arxiv.org/pdf/1812.07106.pdf, arXiv:1812.07106v1 [cs.CV] 12 Dec 2018, pp. 1-12) teach a memory efficient implementation of an RNN on an FPGA which specifies various methods of controlling the storage and movement of RNN parameters used in training around various internal (on-chip) memories.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT LEWIS KULP whose telephone number is (571)272-7983. The examiner can normally be reached M, Th, F 8-5:30; Tu 8-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang, can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ROBERT LEWIS KULP/Examiner, Art Unit 2124                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126