DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Applicant’s claim for the benefit of a prior-filed application under 35 U.S.C. 119(e) or under 35 U.S.C. 120, 121, 365(c), or 386(c) is acknowledged

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 05/07/2018, 07/15/2019, 09/10/2019, 10/18/2019, and 01/06/2020 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
7. 	Claims 1-8 and 10-20 are rejected under 35 U.S.C. 103 as being unpatentable over Abadi et al. "Tensorflow: Large-scale machine learning on heterogeneous distributed systems." arXiv preprint arXiv:1603.04467 (2016) in view of Malaya et al (US 2019/0171420 Al, “Malaya”).
Regarding claim 1, Abadi teaches a computer-readable memory storing computer-executable instructions that when executed by a processor, cause the processor to perform a method, the method comprising: using a marker node to identify a subgraph of a neural network model to partition from the neural network model, the marker node located at a boundary of the subgraph(Abadi, pg. 7, sec. 4.2 Partial execution, fig. 6, “Two arguments to the Run call help define the exact subgraph of the computation graph that will be executed. First, the Run call accepts inputs, an optional mapping of name:port names to “fed” tensors values. Second, the Run call accepts output names, a list of output name[:port] specifications indicating which nodes should be executed, and, if the port portion is present in a name, that that particular output tensor value for the node should be returned to the client if the Run call completes successfully.The graph is transformed based on the values of inputs and outputs. Each node:port specified in inputs is replaced with a feed node, which will pick up the provided input tensor from specially-initialized entries in a Rendezvous object used for the Run call. Similarly, each ); compiling the identified subgraph to a neural network accelerator to generate configuration information for the neural network accelerator(Abadi, pg. 7, sec. 4.3, “TensorFlow clients can control the placement of nodes on devices by providing partial constraints for a node about which devices it can execute on. For example, [‘]only place this node on a device of type GPU[’]… [w]ithin the confines of these constraints, the placement algorithm is responsible for choosing an assignment of nodes to devices that provides fast execution of the computation an also satisfies various constraints imposed by the devices themselves, such as limiting the total amount of memory needed on a device in order to execute its subset of graph nodes.”); and configuring the processor in communication with the neural network accelerator to evaluate the neural network model using the neural network accelerator to provide the accelerated version of the subgraph (Abadi, pg. 5, sec. 3.2.2 Cross-Device Communication, fig. 4,  “Once the node placement has been computed, the graph is partitioned into a set of subgraphs, one per device. Any cross-device edge from x to y is removed and replaced by an edge from x to a new Send node in x’s subgraph and an edge from a corresponding Receive node to y in y’s subgraph… [a]t runtime, the implementations of the Send and Receive nodes coordinate to transfer data across devices. This allows us to isolate all communication inside Send and Receive implementations, which simplifies the rest of the runtime.”). 
Abadi does not teach: configuring the neural network accelerator with the configuration information to provide an accelerated version of the subgraph.
However Malaya teaches: configuring the neural network accelerator with the configuration information to provide an accelerated version of the subgraph(Malaya, paras.                         
                            
                                
                                    10
                                
                                
                                    7
                                
                            
                        
                     neurons or more.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Abadi’s computer-readable memory in view of Malaya to teach: configuring the neural network accelerator with the configuration information to provide an accelerated version of the subgraph. The motivation to do so would be to have a device that can perform a wide range of numeral precisions for different machine learning applications applications(Malaya, para. 0013,  “[T]he reconfigurable nature of field programmable gate array (FPGA) devices in a computing system allows the system to support a wide range of numerical precisions and to dynamically vary the precision
for key computations at run time. This capability enables optimally efficient computation and provides a competitive advantage (via reductions in energy use, memory access, and computational cost achieved by reducing unnecessary precision) over conventionally operating devices.”).
Regarding claim 2, Abadi in view of Malaya teaches the computer-readable memory of claim 1, wherein the processor is configured to perform computations at a higher precision than the neural network accelerator(Abadi pg. 9, sec. 5.5 Lossy Compression, “Some machine learning algorithms, including those typically used for training neural networks…we often use lossy compression of higher precision internal representations when sending data between  Note: It is being interpreted that the 32-bit floating point representation represents the precision of the processor and the 16-bit floating point representation represents the precision of the hardware accelerator).
Regarding claim 3, Abadi in view of Malaya teaches the computer-readable memory of claim 1, wherein the neural network model is specified using source code of a machine learning native framework(Abadi, pg. 2, fig. 1, fig. 2,  “We have open-sourced the TensorFlow API and a reference implementation under the Apache 2.0 license… [c]lients typically construct a computational graph using one of the supported frontend languages (C++ or Python). An example fragment to construct and then execute a TensorFlow graph using the Python front end is shown in Figure 1, and the resulting computation graph in Figure 2.”).
Regarding claim 4, Abadi in view of Malaya teaches the computer-readable memory of claim 1, wherein the marker node(Abadi, pg. 7, sec. 4.2 Partial execution, fig. 6, “Two arguments to the Run call help define the exact subgraph of the computation graph that will be executed. First, the Run call accepts inputs, an optional mapping of name:port names to “fed” tensors values. Second, the Run call accepts output names, a list of output name[:port] specifications indicating which nodes should be executed, and, if the port portion is present in a name, that that particular output tensor value for the node should be returned to the client if the Run call completes successfully.The graph is transformed based on the values of inputs and outputs. Each node:port specified in inputs is replaced with a feed node, which will pick up the provided input tensor from specially-initialized entries in a Rendezvous object used for the Run call. Similarly, each output name with a port is connected to a special fetch node that arranges to save the output tensor and return it to the client when the Run call is complete.”) reduces a precision of values passed to the identified subgraph during a fine-tune training model(Abadi pg. 9, sec. 5.5 Lossy Compression, “Some machine learning algorithms, including those typically used for training neural networks…we often use lossy compression of higher precision internal representations when sending data between devices…[f]or example, we often insert special conversion nodes that convert 32-bit floating point representations  into a 16-bit floating point representation.” Note: It is being interpreted that the 32-bit floating point representation represents the precision of the processor and the 16-bit floating point representation represents the precision of the hardware accelerator)implemented on the machine learning native framework executing on the processor(Abadi pg. 7, sec. 4.2 Partial Execution, fig.6,  “Often a client wants to execute just a subgraph of the entire execution graph. To support this, once the client has set up a computation graph in a Session, our Run method allows them to execute an arbitrary subgraph of the whole graph, and to inject arbitrary data along any edge in the graph, and to retrieve data flowing along any edge in the graph.” Note: It is being interpreted that the client is executing the machine learning framework (i.e., Tensor flow) on the processor).
Regarding claim 5, Abadi in view of Malaya teaches the computer-readable memory of claim 1, wherein the marker node(Abadi, pg. 7, sec. 4.2 Partial execution, fig. 6, “Two arguments to the Run call help define the exact subgraph of the computation graph that will be executed. First, the Run call accepts inputs, an optional mapping of name:port names to “fed” tensors values. Second, the Run call accepts output names, a list of output name[:port] specifications indicating which nodes should be executed, and, if the port portion is present in a name, that that particular output tensor value for the node should be returned to the client if the Run call completes successfully.The graph is transformed based on the values of inputs and ) reduces a precision of values output from the identified subgraph during a fine-tune training mode(Abadi pg. 9, sec. 5.5 Lossy Compression, “Some machine learning algorithms, including those typically used for training neural networks…we often use lossy compression of higher precision internal representations when sending data between devices…[f]or example, we often insert special conversion nodes that convert 32-bit floating point representations into a 16-bit floating point representation.” Note: It is being interpreted that the 32-bit floating point representation represents the precision of the processor and the 16-bit floating point representation represents the precision of the hardware accelerator)implemented on the machine learning native framework executing on the processor(Abadi pg. 7, sec. 4.2 Partial Execution, fig.6,  “Often a client wants to execute just a subgraph of the entire execution graph. To support this, once the client has set up a computation graph in a Session, our Run method allows them to execute an arbitrary subgraph of the whole graph, and to inject arbitrary data along any edge in the graph, and to retrieve data flowing along any edge in the graph.” Note: It is being interpreted that the client is executing the machine learning framework (i.e., Tensor flow) on the processor).
Regarding claim 6, Abadi in view of Malaya teaches the computer-readable memory of claim 1, wherein the marker node passes values unchanged between the identified subgraph and the neural network model during an initial training mode(Abadi, pg. 7, sec. 4.2 Partial execution, fig. 6, Figure 6 shows an original [network model] graph on the left, and the b} and outputs=={f:0}. Since we only need to compute the output of node f [of the subgraph], we will not execute nodes d and e [of the neural network model], since they have no contribution to the output of f. Note: It is being interpreted that the marker node executing values only on the subgraph and not on the rest of the neural network model represents the marker node passing vales unchanged between the identified subgraph and the neural network model)implemented on the machine learning native framework executing on the processor(Abadi pg. 7, sec. 4.2 Partial Execution, fig.6,  “Often a client wants to execute just a subgraph of the entire execution graph. To support this, once the client has set up a computation graph in a Session, our Run method allows them to execute an arbitrary subgraph of the whole graph, and to inject arbitrary data along any edge in the graph, and to retrieve data flowing along any edge in the graph.” Note: It is being interpreted that the client is executing the machine learning framework (i.e., Tensor flow) on the processor).
Regarding claim 7, Abadi in view of Malaya teaches the computer-readable memory of claim 1, wherein the identified subgraph comprises a quantization node interposed between a first internal neural node of the subgraph and a second internal neural node of the subgraph, and the quantization node reduces a precision of values passed between the first internal neural node and the second internal neural node during a fine-tune training mode(Abadi pg. 9, sec. 5.5 Lossy Compression, “Some machine learning algorithms, including those typically used for training neural networks…we often use lossy compression of higher precision internal representations when sending data between devices…[f]or example, we often insert special conversion nodes that convert 32-bit floating point representations into a 16-bit floating point representation.” Note: It is being interpreted that the special conversion node is interposed between the first internal neural node  and second internal node of the subgraph and ) implemented on the machine learning native framework executing on the processor(Abadi pg. 7, sec. 4.2 Partial Execution, fig.6,  “Often a client wants to execute just a subgraph of the entire execution graph. To support this, once the client has set up a computation graph in a Session, our Run method allows them to execute an arbitrary subgraph of the whole graph, and to inject arbitrary data along any edge in the graph, and to retrieve data flowing along any edge in the graph.” Note: It is being interpreted that the client is executing the machine learning framework (i.e., Tensor flow) on his or her processor).
Regarding claim 8, Abadi in view of Malaya teaches the computer-readable memory of claim 1, wherein the marker node comprises metadata specifying a format for communicating values between the accelerated version of the subgraph and the neural network model executing on the processor in communication with the neural network accelerator (Abadi, pg. 5, sec. 3.2.2 Cross-Device Communication, fig. 4,  “Once the node placement has been computed, the graph is partitioned into a set of subgraphs, one per device… [a]t runtime, the implementations of the Send and Receive nodes coordinate to transfer data across devices. This allows us to isolate all communication inside Send and Receive implementations, which simplifies the rest of the runtime. When we insert Send and Receive nodes, we canonicalize all users of a particular tensor on a particular device to use a single Receive node, rather than one Receive node per downstream user on a particular device. This ensures that the data for the needed tensor is only transmitted once between a source device                         
                            →
                        
                    destination device pair” Note: It is being interpreted that the source device represents the neural network model executing on the processor and the destination device pair represents the neural network accelerator and the ) 
Regarding claim 10, Abadi teaches a method comprising: receiving source code specifying a neural network model (Abadi, pg. 2, fig. 1, fig. 2, “We have open-sourced the TensorFlow API and a reference implementation under the Apache 2.0 license… [c]lients typically construct a computational graph using one of the supported frontend languages (C++ or Python). An example fragment to construct and then execute a TensorFlow graph using the Python front end is shown in Figure 1, and the resulting computation graph in Figure 2.”), the source code comprising a programming interface marking a subgraph of the neural network model as targeted for hardware acceleration(Abadi, pg. 7, sec. 4.2 Partial execution, fig. 6, “Two arguments to the Run call help define the exact subgraph of the computation graph that will be executed. First, the Run call accepts inputs, an optional mapping of name:port names to “fed” tensors values. Second, the Run call accepts output names, a list of output name[:port] specifications indicating which nodes should be executed, and, if the port portion is present in a name, that that particular output tensor value for the node should be returned to the client if the Run call completes successfully.The graph is transformed based on the values of inputs and outputs. Each node:port specified in inputs is replaced with a feed node, which will pick up the provided input tensor from specially-initialized entries in a Rendezvous object used for the Run call. Similarly, each output name with a port is connected to a special fetch node that arranges to save the output tensor and return it to the client when the Run call is complete.”); compiling the subgraph to the hardware accelerator target to generate configuration information for the hardware accelerator (Abadi, pg. 7, sec. 4.3, “TensorFlow clients can control the placement of nodes on devices by providing partial constraints for a node about which devices it can execute  [w]ithin the confines of these constraints, the placement algorithm is responsible for choosing an assignment of nodes to devices that provides fast execution of the computation an also satisfies various constraints imposed by the devices themselves, such as limiting the total amount of memory needed on a device in order to execute its subset of graph nodes.”) 
	Abadi does not teach: configuring the hardware accelerator to evaluate the subgraph of the neural network model, the hardware accelerator configured using the configuration information.
However Malaya teaches: configuring the hardware accelerator to evaluate the subgraph of the neural network model, the hardware accelerator configured using the configuration information. (Malaya, paras. 0030-0031, fig. 4, “The reconfigurable nature of FPGA devices is also well suited for supporting variable precision calculations. The neural network 400 includes neurons 401-420, which are implemented by configuring the configurable logic blocks of FPGA devices 121-123. Neural network 400 is illustrated as including twenty neurons 401-420 and can represent the entire neural network or a portion of a larger network. In some embodiments, a neural network can include any number of neurons, with some neural networks having up to                         
                            
                                
                                    10
                                
                                
                                    7
                                
                            
                        
                     neurons or more.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Abadi’s method in view of Malaya to teach: configuring the hardware accelerator to evaluate the subgraph of the neural network model, the hardware accelerator configured using the configuration information. The motivation to do so would be to have a device that can perform a wide range of numeral precisions for different machine learning applications applications(Malaya, para. 0013,  “[T]he reconfigurable 
for key computations at run time. This capability enables optimally efficient computation and provides a competitive advantage (via reductions in energy use, memory access, and computational cost achieved by reducing unnecessary precision) over conventionally operating devices.”).
Regarding claim 11, Abadi in view of Malaya teaches the method of claim 10, wherein the programming interface is an application programming interface (API)( Abadi, pg. 2, fig. 1, fig. 2,  “We have open-sourced the TensorFlow API and a reference implementation under the Apache 2.0 license… [c]lients typically construct a computational graph using one of the supported frontend languages (C++ or Python). An example fragment to construct and then execute a TensorFlow graph using the Python front end is shown in Figure 1, and the resulting computation graph in Figure 2.”).
Regarding claim 12, Abadi in view of Malaya teaches the method of claim 10, further comprising: using a processor to train the neural network model to generate training data for the subgraph of the neural network model (Abadi pg. 11, sec. 7 Common Programming Idioms, fig. 7(top portion), “One simple technique for speeding up SGD is to parallelize the computation of the gradient for a mini-batch across mini-batch elements. For example, if we are using a mini-batch size of 1000 [training] elements, we can use 10 replicas of the model to each compute the gradient for 100 [training] elements, and then combine the gradients and apply updates to the parameters synchronously, in order to behave exactly as if we were running the sequential SGD algorithm with a batch size of 1000 [training] elements. In this case, the TensorFlow graph simply has many replicas of the portion of the graph that does the bulk of the ) the processor configured to perform computations at a higher precision than the hardware accelerator(Abadi pg. 9, sec. 5.5 Lossy Compression, “Some machine learning algorithms, including those typically used for training neural networks…we often use lossy compression of higher precision internal representations when sending data between devices… we often insert special conversion nodes that convert 32-bit floating point representations into a 16-bit floating point representation.” Note: It is being interpreted that the 32-bit floating point representation represents the precision of the processor and the 16-bit floating point representation represents the precision of the hardware accelerator). 
Regarding claim 13, Abadi in view of Malaya teaches the method of claim 12, wherein implementing code of the programming interface comprises a marker node at a boundary of the subgraph(Abadi, pg. 7, sec. 4.2 Partial execution, fig. 6, “Two arguments to the Run call help define the exact subgraph of the computation graph that will be executed. First, the Run call accepts inputs, an optional mapping of name:port names to “fed” tensors values. Second, the Run call accepts output names, a list of output name[:port] specifications indicating which nodes should be executed, and, if the port portion is present in a name, that that particular output tensor value for the node should be returned to the client if the Run call completes successfully.The graph is transformed based on the values of inputs and outputs. Each node:port specified in inputs is replaced with a feed node, which will pick up the provided input tensor from specially-initialized entries in a Rendezvous object used for the Run call. Similarly, each output name with a port is connected to a special fetch node that arranges to save the output tensor and return it to the client when the Run call is complete.”), and the marker node passes a value unchanged from the neural node model to the subgraph during a first phase of the training(Abadi, pg. b} and outputs=={f:0}. Since we only need to compute the output of node f [of the subgraph], we will not execute nodes d and e [of the neural network model], since they have no contribution to the output of f. Note: It is being interpreted that the marker node executing values only on the subgraph and not on the rest of the neural network model represents the marker node passing vales unchanged between the identified subgraph and the neural network model).
Regarding claim 14, Abadi in view of Malaya teaches the method of claim 12, wherein implementing code of the programming interface comprises a marker node at a boundary of the subgraph(Abadi, pg. 7, sec. 4.2 Partial execution, fig. 6, “Two arguments to the Run call help define the exact subgraph of the computation graph that will be executed. First, the Run call accepts inputs, an optional mapping of name:port names to “fed” tensors values. Second, the Run call accepts output names, a list of output name[:port] specifications indicating which nodes should be executed, and, if the port portion is present in a name, that that particular output tensor value for the node should be returned to the client if the Run call completes successfully.The graph is transformed based on the values of inputs and outputs. Each node:port specified in inputs is replaced with a feed node, which will pick up the provided input tensor from specially-initialized entries in a Rendezvous object used for the Run call. Similarly, each output name with a port is connected to a special fetch node that arranges to save the output tensor and return it to the client when the Run call is complete.”), and the marker node(Abadi, pg. 7, sec. 4.2 Partial execution, fig. 6, “Two arguments to the Run call help define the exact subgraph of the computation graph that will be executed. First, the Run call accepts inputs, an optional mapping of name:port names to “fed” tensors values. Second, the Run call accepts output names, a list of ) converts a value from the higher precision of the processor to the lower precision of the hardware accelerator(Abadi pg. 9, sec. 5.5 Lossy Compression, “Some machine learning algorithms, including those typically used for training neural networks…we often use lossy compression of higher precision internal representations when sending data between devices…[f]or example, we often insert special conversion nodes that convert 32-bit floating point representations into a 16-bit floating point representation.” Note: It is being interpreted that the 32-bit floating point representation represents the precision of the processor and the 16-bit floating point representation represents the precision of the hardware accelerator)when the value is passed from the neural node model to the subgraph during a second phase of the training(Malaya para. 0050-0052, fig. 2, fig. 4, fig. 6,  “At block 617 the adjustment logic 230 reconfigures each of the computational units in the neural network 400 to use the next number representation as determined for the computational unit at block 609...[a]t block 607, each computational unit performs calculations using the current number representation for the computational unit. The output values 426(2) and 427(2) for the second iteration (i=2) are generated based on these calculations by the computational units in network Note: It is being interpreted that the adjustment logic 230 represents the neural node model and each of the computational units in the neural network 400 represents a node in the subgraph).
Regarding claim 15, Abadi in view of Malaya teaches the method of claim 12, wherein implementing code of the programming interface comprises a quantization node between a first internal neural node and a second internal neural node of the subgraph, and the quantization node converts a value from the higher precision of the processor to the lower precision of the hardware accelerator(Abadi pg. 9, sec. 5.5 Lossy Compression, “Some machine learning algorithms, including those typically used for training neural networks…we often use lossy compression of higher precision internal representations when sending data between devices…[f]or example, we often insert special conversion nodes that convert 32-bit floating point representations  into a 16-bit floating point representation.” Note: It is being interpreted that the special conversion node is interposed between the first internal neural node  and second internal node of the subgraph and the 32-bit floating point representation represents the precision of the processor and the 16-bit floating point representation represents the precision of the hardware accelerator) when the value is generated by the first internal neural node and passed to second internal neural node during a second phase of the training(Malaya para. 0050-0052, fig. 2, fig. 4, fig. 6,  “At block 617 the adjustment logic 230 reconfigures each of the computational units in the neural network 400 to use the next number representation as determined for the computational unit at block 609...[a]t block 607, each computational unit performs calculations using the current number representation for the computational unit. The output values 426(2) and 427(2) for the second iteration (i=2) are generated based on these calculations by the computational units in network 400.” Note: It is being interpreted that each of the computational units in the neural network 400 represents a node in the subgraph).
using a processor to evaluate the neural network model and the subgraph of the neural network model before the hardware accelerator is configured using the configuration information(Abadi pgs. 4-5, sec. 3.2.1 Node Placement, “Given a computation graph, one of the main responsibilities of the TensorFlow implementation is to map the computation onto the set of available devices… one input to the placement algorithm is a cost model, which contains estimates of the sizes (in bytes) of the input and output tensors for each graph node, along with estimates of the computation time required for each node when presented with its input tensors. This cost model is either statically estimated based on heuristics associated with different operation types, or is measured based on an actual set of placement decisions for earlier executions of the graph. The placement algorithm first runs a simulated execution of the graph. The simulation is described below and ends up picking a device for each node in the graph using greedy heuristics. The node to device placement generated by this simulation is also used as the placement for the real execution.”), the processor configured to convert computations of the subgraph from a higher precision of the processor to a lower precision of the hardware accelerator(Abadi pg. 9, sec. 5.5 Lossy Compression, “Some machine learning algorithms, including those typically used for training neural networks…we often use lossy compression of higher precision internal representations when sending data between devices… we often insert special conversion nodes that convert 32-bit floating point representations into a 16-bit floating point representation.” Note: It is being interpreted that the 32-bit floating point representation represents the precision of the processor and the 16-bit floating point representation represents the precision of the hardware accelerator)during the evaluation by the processor(Abadi pg. 5, sec. 3.2.1 Node Placement, “This heuristic takes into account the estimated or measured execution time of the operation on 
Regarding claim 17, Abadi teaches a system, comprising: a neural network server in communication with a neural network accelerator, the neural network server comprising: at least one processor(Abadi pg. 4, sec. 3 Implementation, fig. 3(distributed system structure), “The main components in a TensorFlow system are the client, which uses the Session interface to communicate with the master, and one or more worker processes, with each worker process responsible for arbitrating access to one or more computational devices (such as CPU cores or GPU cards) and for executing graph nodes on those devices as instructed by the master… [t]he distributed implementation…support[s]… an environment where the client, the master, and the workers can all be in different processes on different machines. In our distributed environment, these different tasks are containers in jobs managed by a cluster scheduling system.” Note: It is being interpreted that in the distributed implementation the client represents the neural network server and the workers represent the neural network accelerators), the at least one processor configured to perform computations at a higher precision than the neural network accelerator(Abadi pg. 9, sec. 5.5 Lossy Compression, “Some machine learning algorithms, including those typically used for training neural networks…we often use lossy compression of higher precision internal representations when sending data between devices… we often insert special conversion nodes that convert 32-bit floating point representations into a 16-bit floating point representation.” Note: It is being interpreted that the 32-bit floating point representation represents the precision of the processor and the 16-bit floating point representation represents the precision of the hardware accelerator), cause the neural network server to perform a method, the instructions comprising: instructions to compile a neural network model for execution on the system, wherein the neural network model is specified using source code comprising an application programming interface (API)(Abadi, pg. 2, fig. 1, fig. 2,  “We have open-sourced the TensorFlow API and a reference implementation under the Apache 2.0 license… [c]lients typically construct a computational graph using one of the supported frontend languages (C++ or Python). An example fragment to construct and then execute a TensorFlow graph using the Python front end is shown in Figure 1, and the resulting computation graph in Figure 2.”) marking a subgraph of the neural network model as targeted for the neural network accelerator(Abadi, pg. 7, sec. 4.3, “TensorFlow clients can control the placement of nodes on devices by providing partial constraints for a node about which devices it can execute on. For example, [‘]only place this node on a device of type GPU[’]… [w]ithin the confines of these constraints, the placement algorithm is responsible for choosing an assignment of nodes to devices that provides fast execution of the computation an also satisfies various constraints imposed by the devices themselves, such as limiting the total amount of memory needed on a device in order to execute its subset of graph nodes.”).
Abadi does not teach: and a computer-readable memory storing computer-executable instructions that when executed by the at least one processor; and an output of compilation is configuration data for configuring the neural network accelerator; and instructions to configure the neural network accelerator to evaluate the neural network model;and wherein the neural network accelerator comprises: configurable logic that is configurable using at least the generated configuration data, the configurable logic comprising a plurality of regions, a respective region configured to perform an operation of a respective node of the subgraph; and 
However Malaya teaches: and a computer-readable memory storing computer-executable instructions that when executed by the at least one processor(Malaya, para. 0057, Certain embodiments may be implemented as a computer program product that may include instructions stored on a non-transitory computer-readable medium. These instructions may be used to program a general purpose or special-purpose processor to perform the described operations.”); and an output of compilation is configuration data for configuring the neural network accelerator(Malaya, paras. 0014-0015, fig. 1, “[T]he host device 110 provides an interface through which a user can define configurations and specify tasks to be executed in the FPGA devices 121-123…[t]he host device 110 includes programming logic 111 that programs the FPGA devices 121-123 according to the configurations specified by the user.”); and instructions to configure the neural network accelerator to evaluate the neural network model(Malaya, para. 0042, fig. 1, fig. 2, fig. 4, fig. 6, At block 603, “[T]he computational units 210 and 220 are coupled with each other and with other neurons in a neural network 400. Accordingly, the computational units 210 and 220 are programmed to function as neurons (e.g., neurons 401 and 405) in the neural network 400. The neurons 401-420 in neural network 400 are arranged in multiple layers 421-423, with each layer including a subset of the neurons 401-420. The set of computational units 210 and 220 are thus connected to each other and to other neurons in the network 400 to generate a set of output values 426(i) and 427(i) based on inputs 424(i) and 425(i) for each iteration i.”); and wherein the neural network accelerator comprises: configurable logic that is configurable using at least the generated configuration data, the configurable logic comprising a plurality of regions(Malaya, paras. 0014-0017, fig. 1, fig.2,   FIG. 2 illustrates a functional block diagram for a
set of computational units 210 and 220 implemented in one or more of the FPGAs 121-123, according to an embodiment. The components illustrated in FIG. 2 can represent circuitry contained within one FPGA device, or may be distributed over multiple FPGA devices. Each of the computational units 210 and 220 are circuits implemented by configuring one or more configurable logic blocks (CLBs) of one or more of the FPGA devices 121-123. Each of the computational units 210 and 220 has at least one input and at least one output. The computational units 210 and 220 each generate output values based on one or more input values.”), a respective region configured to perform an operation of a respective node of the subgraph (Malaya, para. 0031, fig. 1, fig. 2, fig. 4, “The neural network 400 includes neurons 401-420, which are implemented by configuring the configurable logic blocks of FPGA devices 121-123. Neural network 400 is illustrated as including twenty neurons 401-420 and can represent the entire neural network or a portion of a larger network. In some embodiments, a neural network can include any number of neurons, with some neural networks having up to 107 neurons or more. The neurons in a neural network can be connected according to a variety of different topologies… [i]n one embodiment, each of the neurons 401-420 is implemented by a computational unit. For example, two of the connected neurons 401 and 405 may be implemented by computational units 210 and 220, respectively, with the output 203(i) connected to one or more neurons in subsequent layers.”); and memory comprising a plurality of memory elements, wherein a respective memory element is locally accessible by a respective region of the configurable logic (Malaya, para. 0023, fig. 2, In general, the LUT [lookup table] 130 can be used to store any information that can be used for determining which number representations 202(i) to use in the computational units 210 and 220. In one embodiment, the LUT [lookup table] 130 stores values correlating each of the computational units 210 and 220 with one or more corresponding number representations. The LUT [lookup table] 130 may also store values indicating how to adjust the number representations 202(i) in response to particular changes in the output 203(i) and/or the power consumption 205(i). The LUT [lookup table] 130 also stores the number representations themselves, indicating the number of bits to use for each field for each of the representations.”).
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Abadi’s system in view of Malaya to teach: and a computer-readable memory storing computer-executable instructions that when executed by the at least one processor; and an output of compilation is configuration data for configuring the neural network accelerator; and instructions to configure the neural network accelerator to evaluate the neural network model; and wherein the neural network accelerator comprises: configurable logic that is configurable using at least the generated configuration data, the configurable logic comprising a plurality of regions, a respective region configured to perform an operation of a respective node of the subgraph; and memory comprising a plurality of memory elements, wherein a respective memory element is locally accessible by a respective region of the configurable logic. The motivation to do so would be to have a device that can perform a wide range of numeral precisions for different machine learning applications applications(Malaya, para. 0013,  “[T]he reconfigurable nature of field programmable gate array (FPGA) devices in a computing system allows the system to support a wide range of numerical 
Regarding, claim 18, Abadi in view of Malaya teaches the system of claim 17, wherein a boundary of the subgraph is marked using marker nodes(Abadi, pg. 7, sec. 4.2 Partial Execution, fig. 6, “Two arguments to the Run call help define the exact subgraph of the computation graph that will be executed. First, the Run call accepts inputs, an optional mapping of name:port names to “fed” tensors values. Second, the Run call accepts output names, a list of output name[:port] specifications indicating which nodes should be executed.”) and compiling the neural network model comprises identifying all of the marker nodes at the boundary of the subgraph(Abadi, pg. 7, sec. 4.2 Partial Execution, fig. 6, “The graph is transformed based on the values of inputs and outputs. Each node:port specified in inputs is replaced with a feed node… Similarly, each output name with a port is connected to a special fetch node… Finally, once the graph has been rewritten with the insertion of these special feed and fetch nodes, the set of nodes to execute can be determined by starting at each of the nodes named by any output and working backwards in the graph using the graph dependencies to determine the full set of nodes that must be executed in the rewritten graph in order to compute the outputs.”).
Regarding claim 19, Abadi in view of Malaya teaches the system of claim 17, wherein compiling the neural network model comprises assigning training data of respective neural nodes of the subgraph to respective memory elements of the neural network accelerator (Abadi,  pg. 8, sec. 4.5 Input Operations, Although input data can be provided to a computation via feed nodes, another common mechanism used for training large-scale machine learning  In configurations where the client process is separate from the worker process, if the data were fed, it typically would require an extra network hop (from the storage system to the client and then from the client to the worker vs. directly from the storage system to the[] worker when using an input node).” Note: It is being interpreted that the worker represents the neural network accelerator).
Regarding claim 20, Abadi in view of Malaya teaches the system of claim 17, wherein the training data is generated by using calculations of multiple precisions for the computations of the subgraph during training of the neural network model (Malaya, para. 0054, fig. 1, fig. 4, fig. 6,  “The process 600 thus repeats blocks 607-619 for each of multiple iterations to generate outputs 426(i) and 427(i) for each iteration i, while adjusting the number precision used for performing computations in the computational units functioning as neurons 401-420 in the network 400. At each iteration of block 609, the adjustment logic 230 determines a new number representation [i.e., precision] based on the output accuracy, power consumption, values in LUT 130, or other signals.”).
8.	Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Abadi et al. "Tensorflow: Large-scale machine learning on heterogeneous distributed systems." arXiv preprint arXiv:1603.04467 (2016) in view of Malaya et al (US 2019/0171420 Al, “Malaya”) and in view of Zhuang et al. “Towards Effective Low-bitwidth Convolutional Neural Networks.”arXiv preprint arXiv:1711.00205v2 (2017). 

However Zhuang teaches: wherein the configuration information comprises training data, and the training data of the subgraph is generated using higher precision computations during early training and lower precision computations during later training(Zhuang, pg. 3, sec. 3.2 Two-stage optimization, “To reduce the difficulty of training, we devise a two-stage optimization procedure: at the first stage, we only quanitze the weights of the network while setting the activations to be full precision. After the converge (or after certain number of iterations) of this model, we further apply the quantization function on the activations as well….”). 
Accordingly, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Abadi’s method in view of Malaya and in view of Zhuang to teach: wherein the configuration information comprises training data, and the training data of the subgraph is generated using higher precision computations during early training and lower precision computations during later training. The motivation to do so would be to improve the accuracy of a low-precision neural network (Zhuang, pg. 1, Abstract, “Optimizing a low-precision network is very challenging since the training process can easily get trapped in a poor local minima, which results in substantial accuracy loss. To mitigate this problem…we propose to use a two-stage optimization strategy to progressively find good local minima. Specially, we propose to first optimize a net with quantized weights and then quantized activations. This is in contrast to the traditional methods which optimize them simultaneously.”).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Pradelle, Benoît, et al. "Polyhedral optimization of tensorflow computation graphs." Programming and Performance Visualization Tools. Springer, Cham, 2017(details polyhedral optimizations that make data dependencies of a graph to be defined exactly, allowing for such things like automatic parallelization).
Lane, Nicholas D., et al. "Deepx: A software accelerator for low-power deep learning inference on mobile devices." 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). IEEE, 2016(details a software accelerator that is able to decompose deep neural networks into different neural subgraphs based on the underlying hardware)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ADAM CLARK STANDKE whose telephone number is (571) 270-1806.  The examiner can normally be reached on 8:30-6:30 M-F. 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.







/ADAM C STANDKE/Examiner, Art Unit 2122                                                                                                                                                                                                        
/ERIC NILSSON/Primary Examiner, Art Unit 2122