DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
2.	Claims 1-31 are presented for examination.
Claim Rejections - 35 USC § 101
3.	35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

3.1	Claims 1-31 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception/not new (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more. 
Step 2A- Prong One
The claim(s) recite(s) a data processing system for training a neural network, comprising: “for each of at least some of a plurality of training iterations for training the neural network: perform a series of operations on at least part of the training data to calculate output values for the model of the neural network running on the respective set of processing units; which including “calculating first output values for the first model of the neural network from training data; calculating second output values for the second model of the neural network from the training data”; exchange over the at least one interconnect, with the other of the first and second set of processing units, data indicating a state of each of the models running on each of the sets of processing units; determine based on the received data indicating the state, a set of output values calculated for the model of the neural network running on the other set of processing units; evaluate a loss function for the respective training iteration, said loss function including a measure of dissimilarity between the output values calculated for the model of the neural network running on the respective set of processing units and the determined set of output values for the model of the neural network running on the other set of processing units, wherein the measure of dissimilarity is weighted in the evaluation of the loss function in accordance with a parameter”; under the broadest reasonable interpretation, could reasonable fall be under a mathematical concept. Therefore, the claims are directed to an abstract idea, by use of generic computer components and thus are clearly directed to an abstract idea, as constructed. 
Step 2A Prong Two
This judicial exception is not integrated into a practical application because the additional limitation such as “a first set of one or more processing units”, “a second set of one or more processing units”, at least “one data storage”, either alone or in combination, do not add anything more significantly to the judicial exception, but are mere instructions to apply the exception using a generic computer component that are well-known, routine, and conventional activities previously known in the industries and are not sufficient to amount to significantly more than the judicial exception (See MPEP 2106.05(d)(i-iv)), and the additional limitations of: “exchange over the at least one interconnect, with the other of the first and second set of processing units, data indicating a state of each of the models running on each of the sets of processing units”; “update model parameters of the neural network using the respective evaluated loss function; and update the parameter for use in subsequent ones of the training iterations” merely amount to post solution activities and does not add anything meaningful to the recited abstract and thus are not patent eligible under 35 USC 101. It is further noted that to transform an abstract idea, law of nature or natural phenomenon into "a patent-eligible application", the claim must recite more than simply the judicial exception "while adding the words 'apply it.', and”, could clearly amount to post-solution activities.
Step 2B
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As previously discussed above with reference to the integration of abstract idea into a practical application, the additional elements of the using computer components amount to no more than mere instructions to perform the abstract, and thus are not patent eligible under 35 UJSC 101, as constructed.
3.2	Dependent claims 2-18, 20-30 merely include limitations pertaining to: “further mathematical computation similar to that already recited by the independent claims and already addressed above and thus are further not patent eligible under 35 USC 101.
Claim Rejections - 35 USC § 103
4.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

5.	Claims 1-12, 15, 18-23, 25-27, 30-31 are rejected under 35 U.S.C. 103 as being unpatentable over Farber et al. et al. (PARALLEL NEURAL NETWORK TRAINING ON MULTI-SPERT, 1999 (8 pages)), in view of Hegde et al. (Parallel and Distributed Deep Learning, 8 pages (2016)).
	5.1	In considering claims 1, 19, and 31, Farber et al. teaches a data processing system for training a neural network (see title), the data processing system comprising: 
	a first set of one or more processing units (see page 3, fig.1 multiple Spert system), a second set of one or more processing units (see page 3, fig.1 multiple Spert system), at least one data storage, and at least one interconnect between the first set of one or more processing units, the second set of processing units and the at least one data storage, wherein the first set of one or more processing units is configured to run a first model of the neural network and the second set of one or more processing units is configured to run a second model of the neural network (see fig.1, see page 2-3, 2.2 Multi-Spert Hardware To construct a Multi-Spert system, we use commercial SBus expander boxes to enable a moderate number of Spert-II boards to be connected to the same Sun workstation host, as shown in Figure 1. The result is a shared bus master/slave architecture. The Sun host acts as master, controlling all data transfer and synchronization. Multi-Spert required little additional hardware and system software development over that for Spert-II, but the system has two main limitations. First, SBus only supports transactions with one slave device at a time, so broadcasts from the host have to be repeated to each Spert-II board. Second, a board cannot overlap host communication and T0 computation, so the host must ensure the board's T0 processor is idle before initiating a data transfer), wherein the at least one data storage is configured to provide over the at least one interconnect, training data to the first set of one or more processing units and the second set of one more processing units (see fig.1 page 2-3, 2 Multi-Spert Architecture 2.1 Spert-II Board A Spert-II board comprises a 40 MHz T0 vector microprocessor with 8 MB of memory mounted on a double-slot SBus card1 . T0 is capable of sustaining 320 million multiply-accumulate operations per second, with 16-bit fixed-point multiplies and 32-bit fixed-point accumulates. The T0 processor has a byteserial port that allows the host to access T0 memory at a peak bandwidth of 30 MB/s. The Spert-II board contains a field programmable gate array (FPGA) that transparently maps SBus read and write transactions to T0 serial port transactions. The current FPGA design supports data transfer rates of up to 10 MB/s for writes from the host and 4 MB/s for reads; 2.2 Multi-Spert Hardware To construct a Multi-Spert system, we use commercial SBus expander boxes to enable a moderate number of Spert-II boards to be connected to the same Sun workstation host, as shown in Figure 1. The result is a shared bus master/slave architecture. The Sun host acts as master, controlling all data transfer and synchronization. Multi-Spert required little additional hardware and system software development over that for Spert-II, but the system has two main limitations. First, SBus only supports transactions with one slave device at a time, so broadcasts from the host have to be repeated to each Spert-II board. Second, a board cannot overlap host communication and T0 computation, so the host must ensure the board's T0 processor is idle before initiating a data transfer), wherein each of the first and second set of processing units is configured to, for each of at least some of a plurality of training iterations for training the neural network: perform a series of operations on at least part of the training data to calculate output values for the model of the neural network running on the respective set of processing units (see page 5 (see 3.2), 3.2 Network Parallel Training The NP strategy splits the network across all the boards, as shown in Figure 2 (b). Compared with the PP strategy, the NP strategy has the disadvantage that every pattern must be sent to every board. In our application, we use 3-layer networks with many more hidden units than output units. We take advantage of this in our NP implementation by only communicating the final output activations. Hidden unit computations and all weight updates are entirely local to each node. Because the connection weights are now distributed across the slaves, the maximum network size scales with the number of boards, enabling us to train very large networks. As in the PP implementation, we overlap communication of patterns to each node with computation on other nodes, but for NP we use a bunch size of 12 patterns. In addition, we overlap the calculation of output errors on the host. When a slave finishes computing the output activations for a bunch, we immediately send the error values for the previous bunch along with the input values for the next bunch. This means each bunch is calculating errors based on the last but one bunch's weight updates. We found this delayed weight update did not affect convergence); exchange over the at least one interconnect, with the other of the first and second set of processing units, data indicating a state of each of the models running on each of the sets of processing units (see page 4-5 (3.1-3.2), 3.1 Pattern Parallel Training The PP strategy replicates the entire network on each node and presents different patterns in parallel, as illustrated in Figure 2 (a). The PP version does not impose any constraints on the network topology, as long as the weight updates of different patterns may be accumulated independently and combined. After each bunch of patterns, nodes synchronize by exchanging weight update information. Smaller bunch sizes require more frequent weight communication, but larger bunch sizes can affect training convergence. Because the network is replicated on each board, the size of trainable networks is limited to that which will fit into a single board's memory. 3.2 Network Parallel Training The NP strategy splits the network across all the boards, as shown in Figure 2 (b). Compared with the PP strategy, the NP strategy has the disadvantage that every pattern must be sent to every board. In our application, we use 3-layer networks with many more hidden units than output units. We take advantage of this in our NP implementation by only communicating the final output activations. Hidden unit computations and all weight updates are entirely local to each node. Because the connection weights are now distributed across the slaves, the maximum network size scales with the number of boards, enabling us to train very large networks); determine based on the received data indicating the state, a set of output values calculated for the model of the neural network running on the other set of processing units (see fig.1-2, 3.2 Network Parallel Training The NP strategy splits the network across all the boards, as shown in Figure 2 (b). Compared with the PP strategy, the NP strategy has the disadvantage that every pattern must be sent to every board. In our application, we use 3-layer networks with many more hidden units than output units. We take advantage of this in our NP implementation by only communicating the  final output activations. Hidden unit computations and all weight updates are entirely local to each node. Because the connection weights are now distributed across the slaves, the maximum network size scales with the number of boards, enabling us to train very large networks. As in the PP implementation, we overlap communication of patterns to each node with computation on other nodes, but for NP we use a bunch size of 12 patterns. In addition, we overlap the calculation of output errors on the host. When a slave finishes computing the output activations for a bunch, we immediately send the error values for the previous bunch along with the input values for the next bunch. This means each bunch is calculating errors based on the last but one bunch's weight updates. We found this delayed weight update did not a affect convergence). Farber et al. further provides for evaluating training performance (see section 4), updating model parameters, and ascertain an error “the dissimilarity” between the output values calculated by the first and second set of processing units (see section 2.1 and 3.1-3.2). However, he does not expressly state that said evaluation is that of a loss function evaluation. 
Hegde et al. teaches the step to evaluate a loss function for the respective training iteration (see section 4 on page 4, The goal of a learning algorithm is to minimize the loss function in a systematic manner. In the case of neural networks, the total-loss function is a separable and differentiable function of the model parameters. We need to come up with a way to iteratively update these parameters so that the value of the total-loss function reduces. One can visualize the total-loss function as consisting of a bunch of peaks and valleys and the goal is to get to the deepest valley [3]. One of the most popular ways to achieve this is to use a greedy approach: by following a direction opposite to the gradient of the loss function, since this is the direction which is most promising, locally (so to speak). The loss function in the case of neural networks is normally a separable function (i.e. it is average of loss functions for individual data points). So, in order to make the most optimal decision, we need to compute the gradient of the loss for all the images in the data-set with respect to all the parameters of the model. However, doing this is computationally expensive because of the sheer number of images on which we train these neural networks [4]. Therefore, it is necessary to use stochastic gradient descent, which computes the gradient of loss functions of a representative subset of the original data-set. This is repeated for many subsets of the original data-set until all images have been used up. This is called an epoch and the subset of data used for parameter update is called a mini-batch. Let the weights of the model be w. Here is the gradient descent update), said loss function including a metric measuring the dissimilarity between the output values calculated by the first and second set of processing units, wherein the metric is weighted in the evaluation of the loss function in accordance with a parameter (see section 4, 4. Stochastic Gradient Descent The goal of a learning algorithm is to minimize the loss function in a systematic manner. In the case of neural networks, the total-loss function is a separable and differentiable function of the model parameters. We need to come up with a way to iteratively update these parameters so that the value of the total-loss function reduces. One can visualize the total-loss function as consisting of a bunch of peaks and valleys and the goal is to get to the deepest valley [3]. One of the most popular ways to achieve this is to use a greedy approach: by following a direction opposite to the gradient of the loss function, since this is the direction which is most promising, locally (so to speak). The loss function in the case of neural networks is normally a separable function (i.e. it is average of loss functions for individual data points). So, in order to make the most optimal decision, we need to compute the gradient of the loss for all the images in the data-set with respect to all the paramaters of the model. However, doing this is computationally expensive because of the sheer number of images on which we train these neural networks [4]. Therefore, it is necessary to use stochastic gradient descent, which computes the gradient of loss functions of a representative subset of the original data-set. This is repeated for many subsets of the original data-set until all images have been used up. This is called an epoch and the subset of data used for parameter update is called a mini-batch. Let the weights of the model be w. Here is the gradient descent update: w ← w − α∇wLtotal (1) Where Ltotal = 1 n Pn i=1 Li and ∇wLtotal is the gradient of the total loss function with respect to the weights. In the neural network that we trained, we used the logistic loss function for each image, given by: Li = −fyi + logX j e fj (2) where fj means the j th element of the vector of class scores f. In stochastic gradient descent, we have the following weight update rule: w ← w − α∇wLminibatch (3) Here, Lminibatch = 1 m X i∈M Li ); update model parameters of the neural network using the respective evaluated loss function (see page 2, Backpropagation is used to update the parameters of these kernels (also called weights). So both forward and backward propagation is computationally intensive. Page 4, The goal of a learning algorithm is to minimize the loss function in a systematic manner. In the case of neural networks, the total-loss function is a separable and differentiable function of the model parameters. We need to come up with a way to iteratively update these parameters so that the value of the total-loss function reduces. One can visualize the total-loss function as consisting of a bunch of peaks and valleys and the goal is to get to the deepest valley. Therefore, it is necessary to use stochastic gradient descent, which computes the gradient of loss functions of a representative subset of the original data-set. This is repeated for many subsets of the original data-set until all images have been used up. This is called an epoch and the subset of data used for parameter update is called a mini-batch. Let the weights of the model be w. Here is the gradient descent update: w ← w − α∇wLtotal (1) Where Ltotal = 1 n Pn i=1 Li and ∇wLtotal is the gradient of the total loss function with respect to the weights. In the neural network that we trained, we used the logistic loss function for each image, given by: Li = −fyi + logX j e fj (2) where fj means the j th element of the vector of class scores f. In stochastic gradient descent, we have the following weight update rule:); and update the parameter for use in subsequent ones of the training iterations (see page 2, Backpropagation is used to update the parameters of these kernels (also called weights). So both forward and backward propagation is computationally intensive. Page 4, The goal of a learning algorithm is to minimize the loss function in a systematic manner. In the case of neural networks, the total-loss function is a separable and differentiable function of the model parameters. We need to come up with a way to iteratively update these parameters so that the value of the total-loss function reduces. One can visualize the total-loss function as consisting of a bunch of peaks and valleys and the goal is to get to the deepest valley. Therefore, it is necessary to use stochastic gradient descent, which computes the gradient of loss functions of a representative subset of the original data-set. This is repeated for many subsets of the original data-set until all images have been used up. This is called an epoch and the subset of data used for parameter update is called a mini-batch. Let the weights of the model be w. Here is the gradient descent update: w ← w − α∇wLtotal (1) Where Ltotal = 1 n Pn i=1 Li and ∇wLtotal is the gradient of the total loss function with respect to the weights. In the neural network that we trained, we used the logistic loss function for each image, given by: Li = −fyi + logX j e fj (2) where fj means the j th element of the vector of class scores f. In stochastic gradient descent, we have the following weight update rule). 
Farber et al. and Hegde et al. are analogous art because they are from the same field of endeavor and that the model analyzes by Hegde et al. is similar to that of Farber et al. Therefore, it would have been obvious to a person of skilled in the art at the time of filing of the Applicant’s invention to combine the system of Hegde et al. with that of Farber et al. because Hegde et al. teaches the improvement of training times (see abstract) and performance of the whole network (page 2 right column).
5.2	Regarding 2, the combined teachings of Farber et al. and Hegde et al. teach that wherein the data indicating the state of each of the models running on the other of the sets of processing units comprises the determined set of output values, wherein the step of determining based on the received data indicating the state, the set of output values, comprises extracting those output values from the received data indicating the state (see Hegde et al. page 2, . Backpropagation is used to update the parameters of these kernels (also called weights). Page 4, Therefore, it is necessary to use stochastic gradient descent, which computes the gradient of loss functions of a representative subset of the original data-set. This is repeated for many subsets of the original data-set until all images have been used up. This is called an epoch and the subset of data used for parameter update is called a mini-batch. Let the weights of the model be w. Here is the gradient descent update: w ← w − α∇wLtotal (1) Where Ltotal = 1 n Pn i=1 Li and ∇wLtotal is the gradient of the total loss function with respect to the weights. In the neural network that we trained, we used the logistic loss function for each image, given by: Li = −fyi + logX j e fj (2) where fj means the j th element of the vector of class scores f. In stochastic gradient descent, we have the following weight update rule: w ← w − α∇wLminibatch (3) Here, Lminibatch = 1 m X i∈M Li. Section 5.1 “For synchronous update, all loss gradients in a given mini-batch are computed using the same weights and full information of the average loss in a given mini-batch is used to update weights. The synchronization part comes because we wait till loss-gradients for all images in the mini-batch are computed.” Algorithm 3).
5.3	 Regarding claim 3, the combined teachings of Farber et al. and Hegde et al. teach that wherein the data indicating the state of each of the models running on the other of the sets of processing units comprises model parameters of the model running on the other of the sets of processing units, wherein the step of determining the set of output values comprises calculating those values using the received model parameters (see Hegde et al. page 2, . Backpropagation is used to update the parameters of these kernels (also called weights). Page 4, Therefore, it is necessary to use stochastic gradient descent, which computes the gradient of loss functions of a representative subset of the original data-set. This is repeated for many subsets of the original data-set until all images have been used up. This is called an epoch and the subset of data used for parameter update is called a mini-batch. Let the weights of the model be w. Here is the gradient descent update: w ← w − α∇wLtotal (1) Where Ltotal = 1 n Pn i=1 Li and ∇wLtotal is the gradient of the total loss function with respect to the weights. In the neural network that we trained, we used the logistic loss function for each image, given by: Li = −fyi + logX j e fj (2) where fj means the j th element of the vector of class scores f. In stochastic gradient descent, we have the following weight update rule: w ← w − α∇wLminibatch (3) Here, Lminibatch = 1 m X i∈M Li. Section 5.1 “For synchronous update, all loss gradients in a given mini-batch are computed using the same weights and full information of the average loss in a given mini-batch is used to update weights. The synchronization part comes because we wait till loss-gradients for all images in the mini-batch are computed.” Algorithm 3). 
5.4	As per claims 4, 25, 30, the combined teachings of Farber et al. and Hegde et al. teach that wherein the training data provided by the at least one data storage over the interconnect comprises a same set of training data provided to the first set of one or more processing units and the second set of one or more processing units (see Hegde et al. page 2, parallel and distributed method at section 2, – Model parallelism: If the model is too big to be fit into a single machine, it can be split across multiple machines. For example, a single layer can be fit into the memory of a single machine and forward and backward propagation involves communication of output from one machine to another in a serial fashion. We resort to model parallelism only if the model cannot be fit into a single machine and not so much to fasten the training process). 
5.5	With regards to claims 5, 26, the combined teachings of Farber et al. and Hegde et al. teach that wherein the providing training data to the first set of one or more processing units and the second set of one more processing units comprises providing different sets of training data to the first set of one or more processing units and the second set of one or more processing units (see Hegde et al. page 1 “This information of the structure of the data is stored in a distributed fashion. i.e. Information about the model is distributed across different layers in a neural network and in each layer, model information (weights) are distributed in different neurons”, page 2, parallel and distributed method at section 2, – Model parallelism: If the model is too big to be fit into a single machine, it can be split across multiple machines. For example, a single layer can be fit into the memory of a single machine and forward and backward propagation involves communication of output from one machine to another in a serial fashion. We resort to model parallelism only if the model cannot be fit into a single machine and not so much to fasten the training process, also algorithm 2, page 5 “For establishing the distribution of data, each machine passes a sample data of size d to master machine to find the distribution of the data. To achieve this, we do a bit-torrent aggregate communication where in the first round, k 2 machines talk to k 2 other machines and pass d message from one machine to another. In the next k 4 machines communicated 2d data between each other and so on. There are log(k) rounds of communication happen”). 
5.6	As per claims 6, 27, the combined teachings of Farber et al. and Hegde et al. teach that wherein the data indicating the state of each of the models running on the other of the sets of processing units comprises the determined set of output values, wherein at least one of the sets of processing units is configured to: receive over the at least one interconnect from the other of the set of processing units, at least part of the training data received from the at least one data storage by the other of the sets of processing units (see Farber page 4-5 (3.1-3.2), 3.1 Pattern Parallel Training The PP strategy replicates the entire network on each node and presents different patterns in parallel, as illustrated in Figure 2 (a). The PP version does not impose any constraints on the network topology, as long as the weight updates of different patterns may be accumulated independently and combined. After each bunch of patterns, nodes synchronize by exchanging weight update information. Smaller bunch sizes require more frequent weight communication, but larger bunch sizes can affect training convergence. Because the network is replicated on each board, the size of trainable networks is limited to that which will fit into a single board's memory. 3.2 Network Parallel Training The NP strategy splits the network across all the boards, as shown in Figure 2 (b). Compared with the PP strategy, the NP strategy has the disadvantage that every pattern must be sent to every board. In our application, we use 3-layer networks with many more hidden units than output units. We take advantage of this in our NP implementation by only communicating the final output activations. Hidden unit computations and all weight updates are entirely local to each node. Because the connection weights are now distributed across the slaves, the maximum network size scales with the number of boards, enabling us to train very large networks); and perform the calculating output values for the model of the neural network running on the respective set of processing units using the at least part of the training data received from the other of the set of processing units (see Hegde et al. page 2, . Backpropagation is used to update the parameters of these kernels (also called weights). Page 4, Therefore, it is necessary to use stochastic gradient descent, which computes the gradient of loss functions of a representative subset of the original data-set. This is repeated for many subsets of the original data-set until all images have been used up. This is called an epoch and the subset of data used for parameter update is called a mini-batch. Let the weights of the model be w. Here is the gradient descent update: w ← w − α∇wLtotal (1) Where Ltotal = 1 n Pn i=1 Li and ∇wLtotal is the gradient of the total loss function with respect to the weights. In the neural network that we trained, we used the logistic loss function for each image, given by: Li = −fyi + logX j e fj (2) where fj means the j th element of the vector of class scores f. In stochastic gradient descent, we have the following weight update rule: w ← w − α∇wLminibatch (3) Here, Lminibatch = 1 m X i∈M Li. Section 5.1 “For synchronous update, all loss gradients in a given mini-batch are computed using the same weights and full information of the average loss in a given mini-batch is used to update weights. The synchronization part comes because we wait till loss-gradients for all images in the mini-batch are computed.” Algorithm 3). 
5.7	As per claim 7, the combined teachings of Farber et al. and Hegde et al. teach that wherein each of the first set of one or more processing units and the second set of one or more processing units comprises a cluster of processing units, each of the processing units being formed as part of a separate integrated circuit (see Farber fig.2, multi-Spert system and their associate nodes, 2.2 Multi-Spert Hardware To construct a Multi-Spert system, we use commercial SBus expander boxes to enable a moderate number of Spert-II boards to be connected to the same Sun workstation host, as shown in Figure 1. The result is a shared bus master/slave architecture. The Sun host acts as master, controlling all data transfer and synchronization. Multi-Spert required little additional hardware and system software development over that for Spert-II, but the system has two main limitations. First, SBus only supports transactions with one slave device at a time, so broadcasts from the host have to be repeated to each Spert-II board. Second, a board cannot overlap host communication and T0 computation, so the host must ensure the board's T0 processor is idle before initiating a data transfer; further and Hegde et al. fig.5 “Data parallelism for updating parameters”).
5.8	Regarding claim 8, the combined teachings of Farber et al. and Hegde et al. teach that wherein the updating of the parameter comprises at least one of the first and second set of processing units receiving an updated value for the parameter (see Hegde et al. page 6, SGD is computed on each machine locally. Therefore we will not have any communication cost for SGD. Once all the parameters are updated for each machine locally, we need to perform an All-to-One communication to send it to the driver machine where it will be averaged. For this, we will do a BitTorrent aggregate communication. The communication cost for this will be: = L( k 2 + k 4 + . . .) + kp B ( 1 2 + 1 4 + 1 8 + . . .) = O(Lk) + O( kp B ) Communication cost for broadcasting parameters (One-to-All) computing average (All to one) in the last step is O(kp) as pk( 1 2 + 1 4 + 1 8 + . . .). Therefore the total communication cost is O(N k) + O(pk). Once the parameters have been aggregated and updated using data from all machine (this is why it is synchronous), parameters are broadcasted to all machines for the whole procedure to be repeated.).
5.9	With regards to claims 9, 21, the combined teachings of Farber et al. and Hegde et al. teach that wherein the updating the parameter comprises at least one of the first and second set of processing units updating a value of the parameter to one of a set of values predefined before the training of the neural network (see Hegde et al. page 6, Once the parameters have been aggregated and updated using data from all machine (this is why it is synchronous), parameters are broadcasted to all machines for the whole procedure to be repeated).
5.10	As per claims 10, 22, the combined teachings of Farber et al. and Hegde et al. teach that wherein the updating the parameter is performed in dependence upon a learning rate for the neural network (see Hegd et al fig.2-4, 3.2.2 Figure 4 is a graph of the ratio of CPU to GPU times for different matrix sizes).
5.11	As per claims 11, 23, the combined teachings of Farber et al. and Hegde et al. teach that wherein at least one of the first and second set of processing units is configured to calculate the updated parameter in dependence upon values calculated in dependence upon the training data and model parameters used for the respective training iteration (see Hegde et al. page 2, . Backpropagation is used to update the parameters of these kernels (also called weights). Page 4, Therefore, it is necessary to use stochastic gradient descent, which computes the gradient of loss functions of a representative subset of the original data-set. This is repeated for many subsets of the original data-set until all images have been used up. This is called an epoch and the subset of data used for parameter update is called a mini-batch. Let the weights of the model be w. Here is the gradient descent update: w ← w − α∇wLtotal (1) Where Ltotal = 1 n Pn i=1 Li and ∇wLtotal is the gradient of the total loss function with respect to the weights. In the neural network that we trained, we used the logistic loss function for each image, given by: Li = −fyi + logX j e fj (2) where fj means the j th element of the vector of class scores f. In stochastic gradient descent, we have the following weight update rule: w ← w − α∇wLminibatch (3) Here, Lminibatch = 1 m X i∈M Li. Section 5.1 “For synchronous update, all loss gradients in a given mini-batch are computed using the same weights and full information of the average loss in a given mini-batch is used to update weights. The synchronization part comes because we wait till loss-gradients for all images in the mini-batch are computed.” Algorithm 3). 
5.12	As per claim 12, the combined teachings of Farber et al. and Hegde et al. teach that wherein the values calculated in dependence upon the training data comprise an item selected from a list consisting of: the loss function; one or more gradients of the loss function; and a learning rate for a previous training iteration (see Hegde et al. page 2, . Backpropagation is used to update the parameters of these kernels (also called weights). Page 4, Therefore, it is necessary to use stochastic gradient descent, which computes the gradient of loss functions of a representative subset of the original data-set. This is repeated for many subsets of the original data-set until all images have been used up. This is called an epoch and the subset of data used for parameter update is called a mini-batch. Let the weights of the model be w. Here is the gradient descent update: w ← w − α∇wLtotal (1) Where Ltotal = 1 n Pn i=1 Li and ∇wLtotal is the gradient of the total loss function with respect to the weights. In the neural network that we trained, we used the logistic loss function for each image, given by: Li = −fyi + logX j e fj (2) where fj means the j th element of the vector of class scores f. In stochastic gradient descent, we have the following weight update rule: w ← w − α∇wLminibatch (3) Here, Lminibatch = 1 m X i∈M Li. Section 5.1 “For synchronous update, all loss gradients in a given mini-batch are computed using the same weights and full information of the average loss in a given mini-batch is used to update weights. The synchronization part comes because we wait till loss-gradients for all images in the mini-batch are computed.” Algorithm 3).
5.13	As per claim 15, the combined teachings of Farber et al. and Hegde et al. teach that wherein each of the processing units of the first and second sets of processing unit is configured to alternate between operating in a compute phase in which the respective processing unit performs calculations for training the neural network (see section 3.2, 3.2 Network Paral lel Training The NP strategy splits the network across all the boards, as shown in Figure 2 (b). Compared with the PP strategy, the NP strategy has the disadvantage that every pattern must be sent to every board. In our application, we use 3-layer networks with many more hidden units than output units. We take advantage of this in our NP implementation by only communicating the final output activations. Hidden unit computations and all weight updates are entirely local to each node. Because the connection weights are now distributed across the slaves, the maximum network size scales with the number of boards, enabling us to train very large networks. As in the PP implementation, we overlap communication of patterns to each node with computation on other nodes, but for NP we use a bunch size of 12 patterns. In addition, we overlap the calculation of output errors on the host. When a slave finishes computing the output activations for a bunch, we immediately send the error values for the previous bunch along with the input values for the next bunch. This means each bunch is calculating errors based on the last but one bunch's weight updates. We found this delayed weight update did not affect convergence.); and an exchange phase in which data for training the neural network is exchanged with others of the processing units, said data for training the neural network including the output values calculated by the first and second sets of processing units, wherein the step of exchanging, over the at least one interconnect, the output values is performed during one of the exchange phases (see Farber et al. page 4, We have implemented two different strategies for parallelizing backprop training, pattern parallel (PP) and network parallel (NP). 3.1 Pattern Parallel Training The PP strategy replicates the entire network on each node and presents different patterns in parallel, as illustrated in Figure 2 (a). The PP version does not impose any constraints on the network topology, as long as the weight updates of different patterns may be accumulated independently and combined. After each bunch of patterns, nodes synchronize by exchanging weight update information. Smaller bunch sizes require more frequent weight communication, but larger bunch sizes can affect training convergence. Because the network is replicated on each board, the size of trainable networks is limited to that which will fit into a single board's memory. In our PP implementation, we overlap the communication of patterns to one node with computation on other nodes. To reduce the buffer storage required for intermediate values, we only process 24 patterns at a time on each node. Updates are accumulated until the end of the bunch. We overlap reading back weight updates from one node with computation finishing on other nodes. All slaves then wait for the host to broadcast new weight values. See further Hegde page 5, Once the distribution has been established, we perform a One-to-All communication to send this information to each of the k machines. This is again a bit-torrent aggregate pattern. Therefore, the communication cost is similar to All-toOne communication explained about. Hegde Page 5, . In the next k 4 machines communicated 2d data between each other and so on. There are log(k) rounds of communication happening). Therefore, it would have been obvious to a person of skilled in the art at the time of filing of the Applicant’s invention to combine the system of Hegde et al. with that of Farber et al. because Hegde et al. teaches the improvement of training times (see abstract) and performance of the whole network (page 2 right column).
5.14	Regarding claim 18, the combined teachings of Farber et al. and Hegde et al. teach the host system comprising at least one processor (see Farber et al. fig.1) configured to: interface the first and second set of processing units with the at least one data storage and provide the training data to the first and second set of processing units from the at least one data storage (see Farber et al. page 4, We have implemented two different strategies for parallelizing backprop training, pattern parallel (PP) and network parallel (NP). 3.1 Pattern Parallel Training The PP strategy replicates the entire network on each node and presents different patterns in parallel, as illustrated in Figure 2 (a). The PP version does not impose any constraints on the network topology, as long as the weight updates of different patterns may be accumulated independently and combined. After each bunch of patterns, nodes synchronize by exchanging weight update information. Smaller bunch sizes require more frequent weight communication, but larger bunch sizes can affect training convergence. Because the network is replicated on each board, the size of trainable networks is limited to that which will fit into a single board's memory. In our PP implementation, we overlap the communication of patterns to one node with computation on other nodes. To reduce the buffer storage required for intermediate values, we only process 24 patterns at a time on each node. Updates are accumulated until the end of the bunch. We overlap reading back weight updates from one node with computation finishing on other nodes. All slaves then wait for the host to broadcast new weight values.).
As per claim 20, the combined teachings of Farber et al. and Hegde et al. teach the step of determining, based on the data indicating the state of the first model, fourth output values calculated for the first model (see Farber fig.1, page 2 “2.1 Spert-II Board A Spert-II board comprises a 40 MHz T0 vector microprocessor with 8 MB of memory mounted on a double-slot SBus card1 . T0 is capable of sustaining 320 million multiply-accumulate operations per second, with 16-bit fixed-point multiplies and 32-bit  fixed-point accumulates. The T0 processor has a byteserial port that allows the host to access T0 memory at a peak bandwidth of 30 MB/s. The Spert-II board contains a  field programmable gate array (FPGA) that transparently maps SBus read and write transactions to T0 serial port transactions. The current FPGA design supports data transfer rates of up to 10 MB/s for writes from the host and 4 MB/s for reads. For a realistic training run using our optimized training code, a single board achieves the performance listed in Table 1. The second column contains results for online training, where weights are updated after every pattern presentation. The third and fourth columns contain results for bunch mode training, where weights are updated only once after several patterns, or a bunch, have been presented. Bunch mode allows more frequent weight updates than a fully on-line, or batch mode, training where weights are updated only once after all patterns have been presented. Compared to the matrix-vector operations used in online training, bunch mode allows use of more efficient matrix-matrix operations2 . Performance is approximately doubled and reaches over 100 MCUPS. Note that a small bunch size of 12 is sufficient to reap most of the performance benefits. Experimental results show that moderate bunch sizes of about 100 patterns do not significantly change training convergence or accuracy. For larger bunch sizes of several thousand patterns, we've found that the total data set size must be roughly 100 times larger than the bunch size to achieve satisfactory convergence and final classification performance. Practical bunch size can also be limited by other factors, such as buffer memory size and numeric over ow for accumulated errors. The online training performances presented here are somewhat lower than those we reported earlier for a single board1 . The routines measured in this paper store weights to 32-bit precision but use only the most significant 16 bits for forward pass and error backpropagation, whereas previously1 we used 16- bit weights throughout. Although 16-bit weights have frame-level classification equivalent to single precision floating-point, we found that word-level utterance recognition performance suffered in some cases. Training with 32-bit weight updates yields word-level recognition equivalent to single precision floating-point.”); evaluating an additional loss function for the first training iteration, the additional loss function including a measure of dissimilarity between the first output values and the fourth output values, weighted in accordance with the first parameter; and updating the model parameters of the neural network using the additional loss function (see Hegde et al. section 4 on page 4, The goal of a learning algorithm is to minimize the loss function in a systematic manner. In the case of neural networks, the total-loss function is a separable and differentiable function of the model parameters. We need to come up with a way to iteratively update these parameters so that the value of the total-loss function reduces. One can visualize the total-loss function as consisting of a bunch of peaks and valleys and the goal is to get to the deepest valley [3]. One of the most popular ways to achieve this is to use a greedy approach: by following a direction opposite to the gradient of the loss function, since this is the direction which is most promising, locally (so to speak). The loss function in the case of neural networks is normally a separable function (i.e. it is average of loss functions for individual data points). So, in order to make the most optimal decision, we need to compute the gradient of the loss for all the images in the data-set with respect to all the parameters of the model. However, doing this is computationally expensive because of the sheer number of images on which we train these neural networks [4]. Therefore, it is necessary to use stochastic gradient descent, which computes the gradient of loss functions of a representative subset of the original data-set. This is repeated for many subsets of the original data-set until all images have been used up. This is called an epoch and the subset of data used for parameter update is called a mini-batch. Let the weights of the model be w. Here is the gradient descent update; (see section 4, 4. Stochastic Gradient Descent The goal of a learning algorithm is to minimize the loss function in a systematic manner. In the case of neural networks, the total-loss function is a separable and differentiable function of the model parameters. We need to come up with a way to iteratively update these parameters so that the value of the total-loss function reduces. One can visualize the total-loss function as consisting of a bunch of peaks and valleys and the goal is to get to the deepest valley [3]. One of the most popular ways to achieve this is to use a greedy approach: by following a direction opposite to the gradient of the loss function, since this is the direction which is most promising, locally (so to speak). The loss function in the case of neural networks is normally a separable function (i.e. it is average of loss functions for individual data points). So, in order to make the most optimal decision, we need to compute the gradient of the loss for all the images in the data-set with respect to all the paramaters of the model. However, doing this is computationally expensive because of the sheer number of images on which we train these neural networks [4]. Therefore, it is necessary to use stochastic gradient descent, which computes the gradient of loss functions of a representative subset of the original data-set. This is repeated for many subsets of the original data-set until all images have been used up. This is called an epoch and the subset of data used for parameter update is called a mini-batch. Let the weights of the model be w. Here is the gradient descent update: w ← w − α∇wLtotal (1) Where Ltotal = 1 n Pn i=1 Li and ∇wLtotal is the gradient of the total loss function with respect to the weights. In the neural network that we trained, we used the logistic loss function for each image, given by: Li = −fyi + logX j e fj (2) where fj means the j th element of the vector of class scores f. In stochastic gradient descent, we have the following weight update rule: w ← w − α∇wLminibatch (3) Here, Lminibatch = 1 m X i∈M Li ); update model parameters of the neural network using the respective evaluated loss function (see page 2, Backpropagation is used to update the parameters of these kernels (also called weights). So both forward and backward propagation is computationally intensive. Page 4, The goal of a learning algorithm is to minimize the loss function in a systematic manner. In the case of neural networks, the total-loss function is a separable and differentiable function of the model parameters. We need to come up with a way to iteratively update these parameters so that the value of the total-loss function reduces. One can visualize the total-loss function as consisting of a bunch of peaks and valleys and the goal is to get to the deepest valley. Therefore, it is necessary to use stochastic gradient descent, which computes the gradient of loss functions of a representative subset of the original data-set. This is repeated for many subsets of the original data-set until all images have been used up. This is called an epoch and the subset of data used for parameter update is called a mini-batch. Let the weights of the model be w. Here is the gradient descent update: w ← w − α∇wLtotal (1) Where Ltotal = 1 n Pn i=1 Li and ∇wLtotal is the gradient of the total loss function with respect to the weights. In the neural network that we trained, we used the logistic loss function for each image, given by: Li = −fyi + logX j e fj (2) where fj means the j th element of the vector of class scores f. In stochastic gradient descent, we have the following weight update rule:); and update the parameter for use in subsequent ones of the training iterations (see page 2, Backpropagation is used to update the parameters of these kernels (also called weights). So both forward and backward propagation is computationally intensive. Page 4, The goal of a learning algorithm is to minimize the loss function in a systematic manner. In the case of neural networks, the total-loss function is a separable and differentiable function of the model parameters. We need to come up with a way to iteratively update these parameters so that the value of the total-loss function reduces. One can visualize the total-loss function as consisting of a bunch of peaks and valleys and the goal is to get to the deepest valley. Therefore, it is necessary to use stochastic gradient descent, which computes the gradient of loss functions of a representative subset of the original data-set. This is repeated for many subsets of the original data-set until all images have been used up. This is called an epoch and the subset of data used for parameter update is called a mini-batch. Let the weights of the model be w. Here is the gradient descent update: w ← w − α∇wLtotal (1) Where Ltotal = 1 n Pn i=1 Li and ∇wLtotal is the gradient of the total loss function with respect to the weights. In the neural network that we trained, we used the logistic loss function for each image, given by: Li = −fyi + logX j e fj (2) where fj means the j th element of the vector of class scores f. In stochastic gradient descent, we have the following weight update rule)). Therefore, it would have been obvious to a person of skilled in the art at the time of filing of the Applicant’s invention to combine the system of Hegde et al. with that of Farber et al. because Hegde et al. teaches the improvement of training times (see abstract) and performance of the whole network (page 2 right column).
6.	Claims 13-14, 24, 28-29 are rejected under 35 U.S.C. 103 as being unpatentable over Farber et al. et al. (PARALLEL NEURAL NETWORK TRAINING ON MULTI-SPERT, 1999 (8 pages)), in view of Hegde et al. (Parallel and Distributed Deep Learning, 8 pages (2016)), further in view of Rasmus-Vorrath et al. (WO 2020/037055 A1).
6.1	Regarding claims 13, 24, 28, Farber et al., as modified by Hegde et al., teaches most of the instant invention, including calculating the updated parameters (see Hegde page 2, Backpropagation is used to update the parameters of these kernels (also called weights; see Hegde page 4, The loss function in the case of neural networks is normally a separable function (i.e. it is average of loss functions for individual data points). So, in order to make the most optimal decision, we need to compute the gradient of the loss for all the images in the data-set with respect to all the parameters of the model. Page 7, Currently asynchronous update of parameters is done for each layer. When we use back-propagation to update weights of each layer, we move to the previous layer to update its parameters, only after completely updating parameters of the current layer. Therefore, parameter update is still in many ways, synchronous. Truly asynchronous weight update can be achieved if we do the back-propagation also in an asynchronous manner. Further algorithm 1-4). However, he does not expressly teach the used of moving average. Rasmus-Vorrath et al. teaches the used of a moving average in it using previously determined parameter values for a plurality of previous training iterations (see para 256, 262-263, [0262] Training data includes values of model features based on historical data (rolling or otherwise) collected at the site. For example, training data may include the maximum photosensor values and/or the minimum IR sensor values of the historical readings of photosensors and infrared sensors at the site. In another example, training data may include model features based on calculations of rolling windows (e.g. a rolling mean, a rolling media, a roiling minimum, a roiling maximum, a rolling exponentially weighted moving average, and a rolling correlation, etc.) of historical readings of photosensors and infrared sensors collected at the site. Depending on the number and types of weather conditions covered by the training data, the training data might include data obtained over days, weeks, months, or years. [0263] In certain embodiments, the training data fed into a neural network model or other model includes model input features that are based on calculations of multiple rolling windows of historical sensor data such as described above. For example, the set of training data may include six rolling calculations of a roiling mean, a rolling median, a rolling minimum, a rolling maximum, a rolling exponentially weighted moving average, and a rolling correlation for multiple rolling windows of historical data of each of a maximum photosensor value and a minimum IR sensor value where the forecasted output is learned as a function of a time frame of history of these inputs. If the six (6) rolling calculations were used for five (5) rolling windows ranging in length from six (6) to ten (10) minutes for each of the maximum photosensor and minimum IR sensor values where the forecasted output is learned as a function of four (4) minutes of history, the set of input features in the training data is 240). 
Farber et al., Hegde et al., and Rasmus-Vorrath et al. are analogous art because they are from the same field of endeavor and that the model analyzes by Rasmus-Vorrath et al. is similar to that of Farber et al. and Hegde et al. Therefore, it would have been obvious to a person of skilled in the art at the time of filing of the Applicant’s invention to combine the system of Rasmus-Vorrath et al. with that Farber et al. and Hegde et al. because Rasmus-Vorrath et al. teaches the improvement of performance of the system (page 289).
6.2	As per claims 14, 29, the combined teachings of Farber et al., Hegde et al., and Rasmus-Vorrath et al. teach that wherein the moving average is an exponential moving average (see Rasmus-Vorrath para 256, 262-263, [0262] Training data includes values of model features based on historical data (rolling or otherwise) collected at the site. For example, training data may include the maximum photosensor values and/or the minimum IR sensor values of the historical readings of photosensors and infrared sensors at the site. In another example, training data may include model features based on calculations of rolling windows (e.g. a rolling mean, a rolling media, a roiling minimum, a roiling maximum, a rolling exponentially weighted moving average, and a rolling correlation, etc.) of historical readings of photosensors and infrared sensors collected at the site. Depending on the number and types of weather conditions covered by the training data, the training data might include data obtained over days, weeks, months, or years. [0263] In certain embodiments, the training data fed into a neural network model or other model includes model input features that are based on calculations of multiple rolling windows of historical sensor data such as described above. For example, the set of training data may include six rolling calculations of a roiling mean, a rolling median, a rolling minimum, a rolling maximum, a rolling exponentially weighted moving average, and a rolling correlation for multiple rolling windows of historical data of each of a maximum photosensor value and a minimum IR sensor value where the forecasted output is learned as a function of a time frame of history of these inputs. If the six (6) rolling calculations were used for five (5) rolling windows ranging in length from six (6) to ten (10) minutes for each of the maximum photosensor and minimum IR sensor values where the forecasted output is learned as a function of four (4) minutes of history, the set of input features in the training data is 240).
7.	Claims 16-17 are rejected under 35 U.S.C. 103 as being unpatentable over Farber et al. et al. (PARALLEL NEURAL NETWORK TRAINING ON MULTI-SPERT, 1999 (8 pages)), in view of Hegde et al. (Parallel and Distributed Deep Learning, 8 pages (2016), further in view Nounagnon (Using Kullback-Leibler Divergence to Analyze the Performance of Collaborative Positioning, 169 pages).
7.1	Regarding claim 16, Farber et al., as modified by Hegde et al., teaches most of the instant invention; however, he does not expressly use the Kullback-Leibler divergence. Nounagnon provides the use of a Kullback-Leibler divergence between the values (see chapter 3, Kullback-Leibler Divergence (KLD) measures the distance between two distributions. It achieves this by comparing the shapes of two pdfs, one of which being the reference for accuracy based on the data at hand. There are two important features about KLD that we utilize in this chapter. Firstly, KLD encompasses all the statistical information that can be known about each distribution in its comparison. This means that its comparison is not restricted to the average behavior of the distributions (the first two moments). Secondly, it is a relative measure of accuracy. It achieves this by comparing distributions to a reference for accuracy. In this chapter, we use KLD to introduce a performance metric which outperforms a comparison of RMSE, especially when distributions of position error have a heavy-tail. A heavy-tail skews the first two moments, making them poor descriptors of the overall distribution. 4.7.2 The Kullback-Leibler Divergence for multivariate Skew-Normal Distributions Valle et al derived the multivariate Kullback-Leibler Divergence between the pdfs of two k-dimensional vectors such that: fX1(x) ∼ SNk(ξ1 , Ω1, η1 ) and fX2(x) ∼ SNk(ξ2 , Ω2, η2 ) [46] as follow: KLD(fX1||fX2) = KLD(fX01||fX02)+s 2 π (ξ1−ξ2 ) T Ω −1 2 δ1+E[log{2Φ(W1−1)}]−E[log{2Φ(W2−1)}] (4.61) where KLD(fX01||fX02) is the KLD for normal multivariate distributions, with fX01(x) ∼ SNk(ξ1 , Ω1, 0) and with fX02(x) ∼ SNk(ξ2 , Ω2, 0). The thorough derivation of the KLD between normal multivariate distributions KLD(fX01||fX02) was provided in Appendix 4.A.). 
Farber et al., Hegde et al., and Nounagnon are analogous art because they are from the same field of endeavor and that the model analyzes by Nounagnon is similar to that of Farber et al. and Hegde et al. Therefore, it would have been obvious to a person of skilled in the art at the time of filing of the Applicant’s invention to combine the system of Nounagnon with that Farber et al. and Hegde et al. because Nounagnon teaches the improvement of accuracy (see page 2, So, we define a novel theoretical model to analyze the improvement in accuracy due to collaboration. Using this model, we introduce a variational analysis of collaborative positioning to determine factors that affect the improvement in accuracy due to collaboration. We derive range conditions when collaborative positioning starts to degrade the performance of standalone positioning. We derive and test criteria to determine on-the-fly (ahead of time) whether it is worth collaborating or not in order to improve accuracy).
	7.2	As per claim 17, the combined teachings of Farber et al., Hegde et al., and Nounagnon teach that wherein the measure of the dissimilarity comprises a mean squared error between the output values calculated for the model of the neural network running on the respective set of processing units and the determined set of output values for the model of the neural network running on the other set of processing units (see Nounagnon chapter 3, section 3.1, Given a set of M position estimate vectors x ′ k with values {x ′ 1, x ′ 2, ...x ′ M }, and a position vector x of a true position, RMSE is defined as: RMSE = q E[δ 2 k ] = q E[||x ′ k − x)||2 ] (3.1) where δk represents the k th position error for position estimate x ′ k . In practice, the expected value term in the equation above is estimated using the sample mean. RMSE is an absolute error. This means that it measures error in comparison to the truth (the true position). Hence, it is only useful when compared against another absolute error, or against a reference for accuracy (like the CRB). The latter comparison is called relative error assessment. In reference [81], Li et al address the advantages of relative measures of performance. The authors indicate that ’relative error reveals better the inherent error characteristics of an estimator rather than the absolute error’. It is common practice to either compare RMSE to another RMSE or its squared to the Cramer-Rao Bound. In an RMSE-to-CRB comparison, the efficiency of the estimator is evaluated; that is, its performance with respect to (w.r.t) the best performance achievable if the estimator is unbiased ( [1], [21], [84], [85], [31], [86], [87], [12]). In an RMSE-to-RMSE comparison, the estimator with the smallest RMSE is deemed most accurate. Jeannette Nounagnon Chapter 3. Kullback-Leibler Divergence as a Performance Metric 34 RMSE is the square root of the mean of all squared position errors. Thus, a comparison of RMSEs only assesses performance based on the central information (mean and variance) of the error distributions. As a result, a comparison of RMSEs does not account well for end-tail and heavy-tailed distributions [67]. In fact, the presence of a single outlier can offset an RMSE estimate and lead to an erroneous assessment of performance.). 
Conclusion
8.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
	8.1	Langford et al. (WO2016037351A1) teaches a Computing system for training neural networks.
	8.2	Shattil (USPG_PUB No. 2020/0364545) teaches computational efficiency improvements for artificial neural networks.
9.	Claims 1-31 are rejected and this action is non-final. Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANDRE PIERRE-LOUIS whose telephone number is (571)272-8636. The examiner can normally be reached M-F 9:00 AM-5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamini S Shah can be reached on 571-272-2279. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ANDRE PIERRE LOUIS/Primary Patent Examiner, Art Unit 2146                                                                                                                                                                                                        August 13, 2022