Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
Claims 1-30 are pending in the present application. Claims 1, 2, 19, 24, and 25 have been amended.

Response to Arguments
Applicant's arguments filed 4/04/2022 have been fully considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.


Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. 
The table below indicates where the examiner has interpreted the structural means of each limitation to be disclosed.
19
means for deriving a simplified version of the directed graph
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” Fig. 2. 201
[0019] “The simplified version of the directed graph may be a down-sampled version of the directed graph. The down-sampling can involve reducing the resolution of the individual elements associated with the edges and vertices of the directed graph. For example, with specific reference to an ANN with convolutional and fully connected layers, the weight and filter values could be rounded off to reduce the number of bits required to represent each value. The simplification can be conducted at the graph, sector, layer, or element level.” Fig. 3, [0031], “Fig. 3 provides an illustration of one approach for executing step 201 from Fig. 2. Two sets of axes 300 and 310 illustrate one approach for deriving simplified version of direct graph 212 from directed graph 211. The x-axis of both sets of axes is "i" which is a counter variable for representing the elements of a tensor used in the execution of a directed graph. In this example, the tensor is a set of weights in a layer of an ANN represented by the directed graph. In a modern ANN, the number of weights can be quite large, for example, the tensor may include a million elements. The y-axis of graph 300 illustrates the value of the weight associated with counter "i". In this example, simplified version of directed graph 212 is obtained by down-sampling weight tensor 301 using polynomial interpolation. In this approach, polynomial 311 is derived to produce a function F(i) that will give an approximation of the value of weight wi. The polynomial can be represented by a set of coefficients equal to one plus the order of the polynomial. A computation utilizing weight tensor 301 can thereby be greatly simplified by transforming the computation into the polynomial space, and operating on the inputs to the weight layer using the much smaller coefficient tensor 312. Aside from the overhead associated with deriving the polynomial and transforming to and from the coefficient space, the simplified version of the directed graph will be less computationally intensive due to the reduced number of multiplications that need to take to execute the layer associated with weight tensor 301 in the directed graph and coefficient tensor 312 in the simplified version of the directed graph.”
19
means for applying a pilot input tensor to the simplified version of the directed graph
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” Fig. 2. 202
[0018] “The application of an input to the directed graph can be conceptualized as the provisioning of values to the origin vertices of the graph. For example, with reference to Fig. 1, applying input tensor X to directed graph 100 involves obtaining the values of the elements of tensor X from memory and making them available to the hardware that will conduct the calculations associated with the first set of edges of directed graph 100.” [0032], “Once the simplified version of the directed graph is obtained, a pilot tensor is applied to the simplified version as described above with reference to step 202. The pilot tensor and simplified version of the directed graph are used to obtain relevant information regarding how the actual directed graph will respond when a live input tensor is applied to the directed graph. As such, the pilot input tensor can in some cases be identical to the live input tensor. However, the pilot input tensor can also be modified if needed to operate with the simplified version of the directed graph, or to further simplify execution of the simplified version of the directed graph. For example, the pilot input tensor could have a lower rank or dimensionality than the live input tensor if the simplified version of the directed graph was not compatible with the rank or dimensionality of the live input tensor. The pilot input tensor could also be a down sampled or otherwise simplified version of the live input tensor. For example, the pilot input tensor could be a version of the live input tensor in which the data structures used to store the values of the tensor have been replaced with more simplified structures.”
19
means for obtaining a collection of execution data during the application of the pilot input tensor to the simplified version of the directed graph
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” Fig. 2. 203
[0022] “Data flow diagram 210 represents the pilot input tensor X being applied to the simplified version of the directed graph 212 to produce execution data 213. The execution data 213 is represented as a markup of the simplified version of the directed graph wherein highlighted portions are identified as having a near negligible contribution to the output tensor. However, the execution data can take on numerous other forms.” Fig. 4 [0033], “When the pilot input tensor is applied to the simplified version of the directed graph, execution data is obtained that will be later used to condition the execution of the directed graph. The data is generally obtained during execution of the directed graph, but can be separate and distinct from the actual values that are produced to obtain the output of the directed graph. For example, the execution data can be a set of execution data values such as the outputs of each hidden layer in an ANN. However, the execution data values can also be derived from those values via a comparison or other computation. The execution data values can represent, or can be used to derive, an approximation of the relative importance of the computation from which they were generated on the overall execution of the directed graph. For example, the execution data values could each uniquely correspond with a set of vertices in the directed graph, each vertex in the set of vertices could product a contribution to the inference tensor produced by the directed graph, and each execution data value cold be proportional in magnitude to the contribution to the inference tensor of each vertex. The execution data values can correspond to any aspect of the directed graph and can represent the importance of that aspect of the directed graph in any number of ways. In specific approaches, the relative importance will be represented by set levels such as high, medium, or low. However, the relative importance could be represented by a numerical value that is proportional to an impact on the inference tensor of the corresponding aspect of the directed graph. The proportionality may be linear or logarithmic.”
[0038] “Fig. 4 provides a conceptual data flow diagram for how the execution data and markup can be generated during the execution of the directed graph. As illustrated, different edges of the directed graph will be associated with different calculations 405 and 406. The two illustrated calculations are two matrix multiplications that could represent the multiplication of a set of weights with an input from a prior layer for purposes of generating a data element for the next layer in an artificially neural network. In the basic example illustrated in Fig. 4, the output of these calculations are compared to a threshold value Z. If the threshold is exceeded, the calculation is considered of high priority. If the threshold is not exceeded, the calculation is considered of low priority. In this example, the execution data is the determination made by 
 this calculation. The execution data can then be used to contribute to a markup of the directed graph as illustrated by the different shading levels in markup 404.”
19
means for applying a live input tensor to the directed graph
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” Fig. 2. 205
[0018] “The application of an input to the directed graph can be conceptualized as the provisioning of values to the origin vertices of the graph. For example, with reference to Fig. 1, applying input tensor X to directed graph 100 involves obtaining the values of the elements of tensor X from memory and making them available to the hardware that will conduct the calculations associated with the first set of edges of directed graph 100.”
[0032] “Once the simplified version of the directed graph is obtained, a pilot tensor is applied to the simplified version as described above with reference to step 202. The pilot tensor and simplified version of the directed graph are used to obtain relevant information regarding how the actual directed graph will respond when a live input tensor is applied to the directed graph. As such, the pilot input tensor can in some cases be identical to the live input tensor. However, the pilot input tensor can also be modified if needed to operate with the simplified version of the directed graph, or to further simplify execution of the simplified version of the directed graph. For example, the pilot input tensor could have a lower rank or dimensionality than the live input tensor if the simplified version of the directed graph was not compatible with the rank or dimensionality of the live input tensor. The pilot input tensor could also be a down sampled or otherwise simplified version of the live input tensor. For example, the pilot input tensor could be a version of the live input tensor in which the data structures used to store the values of the tensor have been replaced with more simplified structures.”
19
means for conditioning the execution of the directed graph, during the application of the live input tensor to the directed graph, using the collection of execution data
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” Fig. 2. 204-205
[0035], “The execution data can be utilized to produce a markup of the simplified version of the directed graph which tags the directed graph with different levels of priority such as high, medium, or low. These priority values could then be stored in association with different portions of the directed graph. The different levels of priority can describe how much of a contribution to the output tensor the various portions of the directed graph contributed. The markup can have fixed gradations or can be a heat map with smooth transitions across the graph to indicate the various levels of priority. The priority values for each edge or vertex can be calculated in real time as the directed graph is executing calculations associated with that edge or vertex. For example, the magnitude of a specific computation can be used as a proxy for the priority of that computation, and the execution data can be saved as soon as the computation has been carried out. However, the values can also be updated continuously as the graph continues to carry out the overall computation. Such approaches are beneficial where downstream calculations effectively negate the perceived impact of upstream calculations. As such, the magnitude of downstream calculations can be fed back to impact the stored execution data from prior computations along the same path through the directed graph. The effect of this feedback can be tailored based on how many layers in the directed graph have passed between the value that is being updated and the newly obtained value. “ [0036], “The execution data can also be used to generate specific instructions for a later execution of the directed graph. For example, in the same way that the execution data can be used to generate a tag to indicate that a specific edge of the directed graph is of "low" priority, the execution data can also be used to generate an instruction to reduce the fidelity of the calculations associated with that edge of the directed graph, or to suppress the calculations associated with that edge of the directed graph. Specific approaches for conditioning the execution of the directed graph are discussed in more detail below. Many of these approaches can be triggered by reading the priority information from a tag, and triggering some form of conditional computation based off that tag. However, approaches in which the execution data is the instruction itself short circuits this intermediate lookup step by directly generating the instruction for how a portion of the directed graph should be executed at a later time.” [0039], “The execution data can be used to condition the execution of the directed graph in numerous ways. In general, the approaches used to simplify the directed graph for purposes of generating the simplified version of the directed graph can also be applied to condition the execution of the directed graph. However, as the conditional execution is being guided by information that has been obtained about the performance of the graph, the degree by which the computations are simplified can be much greater in the case of the conditioned execution than in the case of generating the simplified version. As stated previously, the steps associated with conditional execution in Fig. 2 are drawn along separate paths because in different approaches they will exhibit various temporal relationships to each other. For example, the directed graph could be primed for conditional execution prior to the conditional execution of the directed graph, using the stored execution data. In particular, in the approach in which the execution data is stored in the header of packets representing the directed graph, the directed graph would thereby be effectively primed for conditional execution because the priority data would be available for utilization to condition execution in real time as the payload of the packet was pulled for computation during the execution of the directed graph. The priming could include identifying the associated portion of directed graph data, packaging the execution and directed graph data into a data package, and storing the data package at a set location in memory. In another example, the execution of the directed graph will reference a separate data structure as computation is being carried out to determine if and how the associated computation should be conditioned. The separate data structure could be a markup with priorities stored in combination with identifiers of specific locations in the directed graph and the execution of the directed graph could involve obtaining the priorities from the separate data structure using the identifiers as the associated calculation was being carried out.” Fig. 5
[0040] “The execution of the directed graph can be conditioned in numerous ways. Generally, the degree to which the computation is conditioned can be set to vary across the directed graph and can include various gradations that align with the relative priority of that portion of the graph. For example, regions of relatively high priority could be computed just as they would be in the unconditionally executed directed graph, while regions of relatively low priority could be excluded from computation entirely. The various approaches for conditional computation discussed below could be mixed and assigned in various ways to the levels of priority. For example, high, medium, and low priorities could be associated with three entirely separate conditional computation schemes. As another example, the conditional computation scheme could be held constant across the directed graph, but the relative accuracy of the scheme could be modified in accordance with the priorities. For example, a degree of rounding or down-sampling could be set proportional to the priority level with a smooth transition from original value execution, to rounded value execution, to execution conducted independently of the original values. Such approaches could be efficiently applied if the priority value was a smoothly varying numerical value.” [0041] “The actual conditional execution of the directed graph can be conducted in various ways. The conditioning and the forms of conditional computation being separated concepts. Based on the execution data, the fidelity of various computations in the execution of the directed graph can be selectively decreased to different levels. For example, the conditional computation could involve decreasing the number of bits used to represent the inputs or outputs of a given computation. As another example, the data structure used to represent the inputs or outputs of a given computation could be simplified (e.g., from 8-bit floating point to 4- bit fixed point). As another example, the conditional computation could involve providing a fixed value in place of executing the computation. In one particular example, this value could be stored in a header of a data structure that would have been involved in the computation. As another example, the actual arithmetic portion of the computation could be simplified such that it discarded a certain number of LSBs from the computation. As another example, the computation could be suppressed altogether without even the need for providing a masked value. In even more specific approaches approaches, replacement values for the output of the computation could be stored downstream in association with later stages of the directed graph.”
19
means for obtaining an output tensor from the conditional execution of the directed graph
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” 
“Execution of the directed graph will involve the execution of calculations associated with the edges of the directed graph, and the ultimate generation of output tensor Y. Tensor Y is therefore obtained from the directed graph and can be stored in memory as a distinct unit of data once the directed graph has been executed. Tensor Y can be an inference tensor generated by a machine intelligence system. However, the directed graphs executed by the methods of flow chart 200 can include multiple inputs or multiple outputs and can represent other computational systems besides those associated with machine intelligence.” Fig. 2. 206 



20
means for storing the execution data in memory as stored execution data
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” [0021] “Steps 202 and 203 are illustrated as sequential because the execution data is generally available for storage in memory after the input tensor has been applied and the graph has completed execution.” Fig. 2 202-203.
[0034] “However, the execution data 404 is produced and stored orthogonally to the main data flow of the directed graph. The execution data can be obtained and stored in various ways. The execution data can be obtained during the application of the input tensor to the simplified version of the directed graph by monitoring the values produced internally during the calculations associated with the edges of the directed graph.”
[0035], “The execution data can be utilized to produce a markup of the simplified version of the directed graph which tags the directed graph with different levels of priority such as high, medium, or low. These priority values could then be stored in association with different portions of the directed graph. The different levels of priority can describe how much of a contribution to the output tensor the various portions of the directed graph contributed. The markup can have fixed gradations or can be a heat map with smooth transitions across the graph to indicate the various levels of priority. The priority values for each edge or vertex can be calculated in real time as the directed graph is executing calculations associated with that edge or vertex. For example, the magnitude of a specific computation can be used as a proxy for the priority of that computation, and the execution data can be saved as soon as the computation has been carried out. However, the values can also be updated continuously as the graph continues to carry out the overall computation. Such approaches are beneficial where downstream calculations effectively negate the perceived impact of upstream calculations. As such, the magnitude of downstream calculations can be fed back to impact the stored execution data from prior computations along the same path through the directed graph. The effect of this feedback can be tailored based on how many layers in the directed graph have passed between the value that is being updated and the newly obtained value.” [0037], “The execution data can be stored in association with the portions of the directed graph to which they relate in various ways. For example, a markup could be stored in a distributed set of memory locations, or at a single memory location such that all of the data could be recalled using a single memory address or a contiguous sequence of memory addresses. The data can also be stored as an entirely separate data structure in memory. To use the example of 213, the heat map could be stored separately with priority levels and tags identifying specific portions of the graph. Alternatively, the data or markup can be stored directly within the data structures that represent the directed graph and can be obtained along with the data for the directed graph via a single address call to memory. For example, the execution data could be stored in packet headers where the payload of each packet was the data that represented the directed graph itself. To use the example of a directed graph that implements an ANN, the weights or filters of the ANN could be stored along with a value that represented the impact of that weight or filter on the output tensor in response to the pilot input tensor. In a specific example that is in accordance with this class of approaches, a priority value for a weight tensor and the weight tensor itself could be obtained from a memory location using a single memory address.”
20
means for priming the directed graph for the conditional execution, prior to the conditional execution of the directed graph, using the stored execution data
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” Fig. 2. 204-205
[0039] “For example, the directed graph could be primed for conditional execution prior to the conditional execution of the directed graph, using the stored execution data. In particular, in the approach in which the execution data is stored in the header of packets representing the directed graph, the directed graph would thereby be effectively primed for conditional execution because the priority data would be available for utilization to condition execution in real time as the payload of the packet was pulled for computation during the execution of the directed graph. The priming could include identifying the associated portion of directed graph data, packaging the execution and directed graph data into a data package, and storing the data package at a set location in memory. In another example, the execution of the directed graph will reference a separate data structure as computation is being carried out to determine if and how the associated computation should be conditioned. The separate data structure could be a markup with priorities stored in combination with identifiers of specific locations in the directed graph and the execution of the directed graph could involve obtaining the priorities from the separate data structure using the identifiers as the associated calculation was being carried out.”



21
means for generating a markup of the directed graph using the collection of execution data
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” 
[0035] “The execution data can be utilized to produce a markup of the simplified version of the directed graph which tags the directed graph with different levels of priority such as high, medium, or low. These priority values could then be stored in association with different portions of the directed graph. The different levels of priority can describe how much of a contribution to the output tensor the various portions of the directed graph contributed. The markup can have fixed gradations or can be a heat map with smooth transitions across the graph to indicate the various levels of priority. The priority values for each edge or vertex can be calculated in real time as the directed graph is executing calculations associated with that edge or vertex. For example, the magnitude of a specific computation can be used as a proxy for the priority of that computation, and the execution data can be saved as soon as the computation has been carried out. However, the values can also be updated continuously as the graph continues to carry out the overall computation.”
[0038] “The execution data can then be used to contribute to a markup of the directed graph as illustrated by the different shading levels in markup 404.” Figures 2, 4.



22
means for storing the markup in a distributed set of memory locations
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” [0021] “Steps 202 and 203 are illustrated as sequential because the execution data is generally available for storage in memory after the input tensor has been applied and the graph has completed execution.” Fig. 2 202-203.
[0037] “The execution data can be stored in association with the portions of the directed graph to which they relate in various ways. For example, a markup could be stored in a distributed set of memory locations, or at a single memory location such that all of the data could be recalled using a single memory address or a contiguous sequence of memory addresses. The data can also be stored as an entirely separate data structure in memory. To use the example of 213, the heat map could be stored separately with priority levels and tags identifying specific portions of the graph. Alternatively, the data or markup can be stored directly within the data structures that represent the directed graph and can be obtained along with the data for the directed graph via a single address call to memory. For example, the execution data could be stored in packet headers where the payload of each packet was the data that represented the directed graph itself. To use the example of a directed graph that implements an ANN, the weights or filters of the ANN could be stored along with a value that represented the impact of that weight or filter on the output tensor in response to the pilot input tensor.”
22
means for obtaining the priority value and the weight tensor from a memory location in the distributed set of memory locations using a single address
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” [0021] “Steps 202 and 203 are illustrated as sequential because the execution data is generally available for storage in memory after the input tensor has been applied and the graph has completed execution.” Fig. 2 202-203.
[0037] “The execution data can be stored in association with the portions of the directed graph to which they relate in various ways. For example, a markup could be stored in a distributed set of memory locations, or at a single memory location such that all of the data could be recalled using a single memory address or a contiguous sequence of memory addresses. The data can also be stored as an entirely separate data structure in memory. To use the example of 213, the heat map could be stored separately with priority levels and tags identifying specific portions of the graph. Alternatively, the data or markup can be stored directly within the data structures that represent the directed graph and can be obtained along with the data for the directed graph via a single address call to memory. For example, the execution data could be stored in packet headers where the payload of each packet was the data that represented the directed graph itself. To use the example of a directed graph that implements an ANN, the weights or filters of the ANN could be stored along with a value that represented the impact of that weight or filter on the output tensor in response to the pilot input tensor. In a specific example that is in accordance with this class of approaches, a priority value for a weight tensor and the weight tensor itself could be obtained from a memory location using a single memory address.”



23
means for generating a markup of the directed graph using the collection of execution data
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” 
[0035] “The execution data can be utilized to produce a markup of the simplified version of the directed graph which tags the directed graph with different levels of priority such as high, medium, or low. These priority values could then be stored in association with different portions of the directed graph. The different levels of priority can describe how much of a contribution to the output tensor the various portions of the directed graph contributed. The markup can have fixed gradations or can be a heat map with smooth transitions across the graph to indicate the various levels of priority. The priority values for each edge or vertex can be calculated in real time as the directed graph is executing calculations associated with that edge or vertex. For example, the magnitude of a specific computation can be used as a proxy for the priority of that computation, and the execution data can be saved as soon as the computation has been carried out. However, the values can also be updated continuously as the graph continues to carry out the overall computation.”
[0038] “The execution data can then be used to contribute to a markup of the directed graph as illustrated by the different shading levels in markup 404.” Figures 2, 4.
23
means for storing the markup in a distributed set of memory locations
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” [0021] “Steps 202 and 203 are illustrated as sequential because the execution data is generally available for storage in memory after the input tensor has been applied and the graph has completed execution.” Fig. 2 202-203.
[0037] “The execution data can be stored in association with the portions of the directed graph to which they relate in various ways. For example, a markup could be stored in a distributed set of memory locations, or at a single memory location such that all of the data could be recalled using a single memory address or a contiguous sequence of memory addresses. The data can also be stored as an entirely separate data structure in memory. To use the example of 213, the heat map could be stored separately with priority levels and tags identifying specific portions of the graph. Alternatively, the data or markup can be stored directly within the data structures that represent the directed graph and can be obtained along with the data for the directed graph via a single address call to memory. For example, the execution data could be stored in packet headers where the payload of each packet was the data that represented the directed graph itself. To use the example of a directed graph that implements an ANN, the weights or filters of the ANN could be stored along with a value that represented the impact of that weight or filter on the output tensor in response to the pilot input tensor.”
23
means for conditioning an update of the direct graph using the markup
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” 
[0035] “The execution data can be utilized to produce a markup of the simplified version of the directed graph which tags the directed graph with different levels of priority such as high, medium, or low. These priority values could then be stored in association with different portions of the directed graph. The different levels of priority can describe how much of a contribution to the output tensor the various portions of the directed graph contributed. The markup can have fixed gradations or can be a heat map with smooth transitions across the graph to indicate the various levels of priority. The priority values for each edge or vertex can be calculated in real time as the directed graph is executing calculations associated with that edge or vertex. For example, the magnitude of a specific computation can be used as a proxy for the priority of that computation, and the execution data can be saved as soon as the computation has been carried out. However, the values can also be updated continuously as the graph continues to carry out the overall computation. Such approaches are beneficial where downstream calculations effectively negate the perceived impact of upstream calculations. As such, the magnitude of downstream calculations can be fed back to impact the stored execution data from prior computations along the same path through the directed graph. The effect of this feedback can be tailored based on how many layers in the directed graph have passed between the value that is being updated and the newly obtained value.”
[0038] “The execution data can then be used to contribute to a markup of the directed graph as illustrated by the different shading levels in markup 404.” Figures 2, 4.
[0045], “In the specific application of an ANN the conditional computation can be used in both the generation of an inference tensor from the ANN and in training of the ANN. In approaches using back propagation, the updating of the weights during back propagation could be varied based on a known priority of that section of the network. For example, the degree to which weights are updated or modified could be limited by the priority of that portion of the ANN. Weights in highly sensitive and important portions of the neural network could be updated with high precision while weights in low sensitivity portions of the neural network could be kept constant during back propagation.”





Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 25 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. Claim 25 recites “the directed graph”, as opposed to the “neural network” which is recited in independent claim 24. For the purposes of compact prosecution, the examiner has interpreted “the directed graph” as recited in claim 25 as “the neural network”.

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6, 8, and 10-30 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pub. No. US 9633306 B2 to Liu et al, (hereinafter, “Liu”), in view of “Net2Net: ACCELERATING LEARNING VIA KNOWLEDGE TRANSFER” to Chen et al (hereinafter, “Chen”), further in view of “Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-offs by Selective Execution” to Liu and Deng (hereinafter, “Deng”).

As per claim 1, Liu teaches A computer-implemented method for executing a directed graph, in which each step is conducted by a processor, comprising (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components” [col 16, lines 22-26].):
deriving a simplified version of the directed graph, wherein the directed graph is an original directed graph (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26]. Examiner Note: Liu’s neural network is seen as equivalent to the directed graph of the instant application.);
obtaining a collection of execution data during an execution of the simplified version of the directed graph (Liu, “At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 49-56]. Examiner Note: Liu’s training data from a training input is seen as equivalent to the execution data from execution of the simplified version of the directed graph.)

Liu teaches the application of an input tensor to a directed graph prior to the simplification of the directed graph, but does not explicitly disclose applying an input tensor to the directed graph, wherein the directed graph is the original directed graph and not the simplified version of the directed graph; conditioning the execution of the directed graph, during the application of the live input tensor to the directed graph, using the collection of execution data; and obtaining an output tensor from the conditional execution of the directed graph. 

Chen teaches applying an input tensor to the directed graph, wherein the directed graph is the original directed graph and not the simplified version of the directed graph (p.2-3, “Specifically, suppose that a teacher network is represented by a function y = f(x; θ) where x is the input to the network, y is the output of the network, and θ is the parameters of the network” Examiner Note: Chen discloses applying input tensors to the larger network (see also, “Experiments”, p. 6-8). When Chen is applied to Liu, the resulting system would apply inputs to the original, larger directed graph subsequent to the creation of a simplified version of the graph.);
conditioning the execution of the directed graph [by selecting, during the application of the input tensor to the directed graph, computations for suppression] using the collection of execution data (Chen, p.1 “Specifically, we initialize the student to be a neural network that represents the same function as the teacher, but using a different parameterization. One of these transformations, Net2WiderNet allows replacing a model with an equivalent model that is wider (has more units in each hidden layer). Another of these transformations, Net2DeeperNet allows replacing a model that satisfies some properties with an equivalent, deeper model. After initializing the larger network to contain all of the knowledge previously acquired by the smaller network, the larger network may be trained to improve its performance.” Examiner Note: When Chen is applied to Liu, the resulting system would transfer the knowledge (e.g., use the execution data) of the smaller neural network (i.e., the simplified directed graph), to train (i.e., condition the execution of) the larger network (i.e., the original directed graph).).

Liu and Chen are analogous art because they are both directed to neural networks. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Liu’s network simplification with Chen’s knowledge transfer. The combination would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention because he/she would have been motivated to increase the speed of training the networks, which can be accomplished by transferring knowledge from a teacher network to a student network (Chen, p1, “We use Net2Net as a general term describing any process of training a student network significantly faster than would otherwise be possible by leveraging knowledge from a teacher network that was already trained on the same task.”).
	

The combination of Liu and Chen teaches conditioning the execution of the directed graph using execution data from the simplified directed graph, but does not explicitly teach conditioning the execution of the directed graph, during the application of the live input tensor to the directed graph, using the collection of execution data; and obtaining an output tensor from the conditional execution of the directed graph.

Deng teaches conditioning the execution of the directed graph, by selecting, during the application of the input tensor to the directed graph, computations for suppression using the collection of execution data (Deng, Figure 2. p.1, “This paper introduces Dynamic Deep Neural Networks (D2NN), a new type of feed-forward deep neural network that allows selective execution. That is, given an input, only a subset of neurons are executed, and the particular subset is determined by the network itself and dependent on the particular input.” Examiner Note: Selective execution of nodes in a neural network, based on the input to the neural network, is seen as equivalent to selecting computations for suppression.); and
obtaining an output tensor from the conditional execution of the directed graph (“As an example application, this D2NN can be used for binary classification of images, where some images can be rapidly classified as negative after only a small amount of computation.”).

Liu, Chen, and Deng are analogous art because they are directed to neural networks. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Liu’s network simplification with Chen’s knowledge transfer, and Deng’s selective execution. The combination would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention because he/she would have been motivated to increase computational efficiency, which can be accomplished with selective execution (Deng, p1, “D2NNs provide a way to improve computational efficiency by selective execution, pruning unnecessary computation depending on input.”).

As per claim 2, Liu teaches The computer-implemented method from claim 1, further comprising: 
applying a pilot input tensor to the simplified version of the directed graph, to conduct the execution of the simplified version of the directed graph (Liu, “At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re- train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 49-56]. Examiner Note: Liu’s training input is seen as equivalent to the pilot input tensor.);
wherein the input tensor is a live input tensor (Liu, “A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, 1.e., 50% of the input pixels were randomly set to zero” [col 11, lines 41-44]. “FIG. 10 illustrates a method for approximating a trained deep neural network using functional approximation to reduce the number of nodes in each layer according to an embodiment of the present invention. At step 1002, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above tn connection with step 102 of FIG. 1” [col 13, lines 23-30]. Examiner Note: Liu’s second set of inputs in the supervised training set are seen as equivalent to a live input tensor.);
the pilot input tensor and the live input tensor are not identical (Liu, “In order to evaluate the effectiveness of this approach, the present inventors have used this approach in left ventricle (LV) apex detection in 2D MR images. The dataset contains 7961 images from 184 patients, from which positive and negative patches of 32×32 pixels were sampled. 75% of the patches were randomly selected for training and the rest were used for testing. Images of the same patient appear multiple times within the same set, but not both. Positive patches were generated by placing the center at the annotated ground truth can cropping the corresponding image patch. Negative patches were sampled far away from the ground truth location of the LV apex. A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero. The size of the layers of the trained deep neural are 1024-1024-300-100-2. The training was initialized with unsupervised pre-training and then refined using supervised back-propagation. Table 1 shows the 2D LV apex classification error of approximations of the deep neural network generated using weight sparsification performed by thresholding with different sparse factors” [col 11, lines 30-51]. Examiner Note: Liu’s use of images where the test and training inputs are not identical sets ensures that the test input and live input are not identical, and thus are seen as equivalent to the pilot and live inputs of the instant application.); and
the pilot input tensor and the live input tensor are stochastically dependent (Liu, “A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero” [col 11, lines 41-44]. “FIG. 10 illustrates a method for approximating a trained deep neural network using functional approximation to reduce the number of nodes in each layer according to an embodiment of the present invention. At step 1002, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1” [col 13, lines 23-30].  Examiner Note: Liu’s setting of random input pixels to zero makes the training input stochastically dependent upon the live input.).

As per claim 3, Liu teaches The computer-implemented method from claim 1, further comprising: storing the execution data in memory as stored execution data (Liu, “In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 51-56]. “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer Software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define Such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, lines 22-38].); and
priming the directed graph for the conditional execution, prior to the conditional execution of the directed graph, using the stored execution data (Liu, “For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 37-56].).

As per claim 4, Liu teaches The computer-implemented method from claim 1, wherein: the directed graph includes a set of vertices and a set of edges interconnecting the set of vertices (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26].);
the directed graph is a neural network (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26].);
the set of edges of the directed graph are calculations involving a set of weights for the neural network, wherein the set of weights include at least one weight tensor (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26].);
at least a subset of the set of vertices are weights for the neural network (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26].);
the conditional execution of the directed graph produces an inference tensor (Liu, “If the trained deep neural network is a discriminative deep neural network, the approximation of the trained deep neural network calculates for each image patch, a probability that the target anatomical landmark is located at the pixel or voxel at which the image patch is centered. The location with the highest probability can then be selected as the detected anatomical landmark location in the medical image. If the trained deep neural network is a deep neural network regressor, the approximation of the trained deep neural network outputs a difference vector for each image patch that provides a displacement from the pixel or voxel at which the image patch is centered to a predicted location of the target anatomical landmark in the medical image. The predicted locations from each of the image patches can then be aggregated to determine the detected anatomical landmark location in the medical image. At step 110, the anatomical object detection result is output.” [col 5, line 42 – col 6, line 4]. Examiner Note: Liu’s object detection result output is seen as equivalent to an inference tensor, and is produced after the approximation and conditioning of Liu’s neural network.); and
the inference tensor is a response of the neural network to the live input tensor (Liu, “At step 108, anatomical object detection is performed in the medical image using the approximation of the trained deep neural network. In a possible implementation, a sliding window approach can be used in which a respective image patch centered at each pixel or voxel is extracted from the medical image. Each image patch is input to the approximation of the trained deep neural network, which operates directly on the pixels or voxels in each patch. If the trained deep neural network is a discriminative deep neural network, the approximation of the trained deep neural network calculates for each image patch, a probability that the target anatomical landmark is located at the pixel or voxel at which the image patch is centered… At step 110, the anatomical object detection result is output. For example, the anatomical object detection result can be output by displaying the medical image on a display device of the computer system with the anatomical object location marked or highlighted in the displayed medical image” [col 5, line 42 – col 6, line 4]. Examiner Note: Liu’s object detection result output is seen as equivalent to an inference tensor, and is produced after the approximation and conditioning of Liu’s neural network. It is necessarily a result of the live input (image to be classified).).

As per claim 5, Liu teaches The computer-implemented method from claim 4, wherein: an edge in the set of edges is a calculation using a four dimensional tensor. (Liu, Figure 2, “As shown in FIG. 2, the AE 200 is a feed-forward neural network with one hidden layer 204. The AE 200 has an input layer L.sub.1 202, the hidden layer L.sub.2, and an output layer L.sub.3 206. If the AE 200 is a fully connected network, each node in the input layer 202 can correspond to a respective voxel or pixel of an image patch. Ignoring the bias term (the nodes labeled as +1 in FIG. 2), the input and output layers 202 and 206, respectively have the same number of nodes. The goal of an AE is to minimize the difference between the input and output vectors. If the hidden layer 204 has a size equal to or larger than the input layer 202, an AE may learn an identify transformation. To prevent such a trivial solution, an AE can be set up with a hidden layer 204 with fewer nodes than the input layer 202. [col 4, lines 16-42]. Examiner Note: Figure two demonstrates that Liu’s network can include a four dimensional calculation, which could be represented as a tensor.)


As per claim 6, Liu teaches The computer-implemented method from claim 4, wherein: the deriving of the simplified version of the directed graph includes down-sampling the directed graph by a sampling factor (Liu, “According to an advantageous embodiment of the present invention, the SparseConnect and ShrinkConnect methods for approximating a trained deep neural network can be combined. The SparseConnect and ShrinkConnect methods exploit different types of redundancy within a trained deep neural network. The methods complement each other and may be combined to achieve an even greater speed up. For example, in a possible implementation, a trained deep neural network can be first be approximated using the ShrinkConnect method to reduce the number of nodes in each layer of the trained deep neural network, followed by using the SparseConnect method (using thresholding or re-weighted L1-norm minimization) to sparsify the weights in the filters connecting each layer in the approximation of the deep neural network resulting from applying the ShrinkConnect method. The present inventors tested this combined method using the thresholding approach for weight sparsification (SparseConnect) in order to approximate the trained deep neural network for LV apex detection in 2D MR images. The original trained deep neural network was simplified by a factor of 3 using the ShrinkConnect method (function approximation) and then further simplified by a factor of 10 using the SparseConnect method (weight sparsification)” [col 15, lines 21-43]. Examiner Note: Liu’s use of weight sparsification and function approximation by certain factors is seen as a form of down sampling.);
the simplified version of the directed graph is thereby a down-sampled version of the directed graph (Liu, “According to an advantageous embodiment of the present invention, the SparseConnect and ShrinkConnect methods for approximating a trained deep neural network can be combined. The SparseConnect and ShrinkConnect methods exploit different types of redundancy within a trained deep neural network. The methods complement each other and may be combined to achieve an even greater speed up. For example, in a possible implementation, a trained deep neural network can be first be approximated using the ShrinkConnect method to reduce the number of nodes in each layer of the trained deep neural network, followed by using the SparseConnect method (using thresholding or re-weighted L1-norm minimization) to sparsify the weights in the filters connecting each layer in the approximation of the deep neural network resulting from applying the ShrinkConnect method.” [col 15, lines 21-43].);
a first complete set of tensors used for executing the simplified version of the directed graph has a rank (Liu, “In order to evaluate the effectiveness of this approach, the present inventors have used this approach in left ventricle (LV) apex detection in 2D MR images. The dataset contains 7961 images from 184 patients, from which positive and negative patches of 32×32 pixels were sampled. 75% of the patches were randomly selected for training and the rest were used for testing. Images of the same patient appear multiple times within the same set, but not both. Positive patches were generated by placing the center at the annotated ground truth can cropping the corresponding image patch. Negative patches were sampled far away from the ground truth location of the LV apex. A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero. The size of the layers of the trained deep neural are 1024-1024-300-100-2. The training was initialized with unsupervised pre-training and then refined using supervised back-propagation. Table 1 shows the 2D LV apex classification error of approximations of the deep neural network generated using weight sparsification performed by thresholding with different sparse factors” [col 11, lines 30-51]. Examiner Note: Liu’s test input images have a dimensionality equal to the dimensionality of the live inputs.); and
a second complete set of tensors used for executing the directed graph has the rank (Liu, “In order to evaluate the effectiveness of this approach, the present inventors have used this approach in left ventricle (LV) apex detection in 2D MR images. The dataset contains 7961 images from 184 patients, from which positive and negative patches of 32×32 pixels were sampled. 75% of the patches were randomly selected for training and the rest were used for testing. Images of the same patient appear multiple times within the same set, but not both. Positive patches were generated by placing the center at the annotated ground truth can cropping the corresponding image patch. Negative patches were sampled far away from the ground truth location of the LV apex. A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero. The size of the layers of the trained deep neural are 1024-1024-300-100-2. The training was initialized with unsupervised pre-training and then refined using supervised back-propagation. Table 1 shows the 2D LV apex classification error of approximations of the deep neural network generated using weight sparsification performed by thresholding with different sparse factors” [col 11, lines 30-51]. Examiner Note: Liu’s test input images have a dimensionality equal to the dimensionality of the live inputs.).


As per claim 8, Liu teaches The computer-implemented method from claim 4, wherein: the deriving of the simplified version of the directed graph includes replacing a set of original values of the set of weights with a set of replacement values (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26].); and
the simplified version of the directed graph has a same number of layers as the directed graph (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26]. Examiner Note: Several of Liu’s simplification methods do not reduce the layers of the network.).

As per claim 10, Liu teaches The computer-implemented method from claim 4, wherein: the collection of execution data includes a set of execution data values (Liu, “In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 51-56].); 
the set of execution data values and the set of vertices have uniquely corresponding elements (Liu, “In order to evaluate the effectiveness of this approach, the present inventors have used this approach in left ventricle (LV) apex detection in 2D MR images. The dataset contains 7961 images from 184 patients, from which positive and negative patches of 32×32 pixels were sampled. 75% of the patches were randomly selected for training and the rest were used for testing. Images of the same patient appear multiple times within the same set, but not both. Positive patches were generated by placing the center at the annotated ground truth can cropping the corresponding image patch. Negative patches were sampled far away from the ground truth location of the LV apex. A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero. The size of the layers of the trained deep neural are 1024-1024-300-100-2. The training was initialized with unsupervised pre-training and then refined using supervised back-propagation. Table 1 shows the 2D LV apex classification error of approximations of the deep neural network generated using weight sparsification performed by thresholding with different sparse factors” [col 11, lines 30-51]. Examiner Note: Liu’s test input images have a dimensionality equal to the dimensionality of the live inputs
each uniquely corresponding vertex in the set of vertices produces a contribution to the inference tensor in response to the pilot input tensor (Liu, “In order to evaluate the effectiveness of this approach, the present inventors have used this approach in left ventricle (LV) apex detection in 2D MR images. The dataset contains 7961 images from 184 patients, from which positive and negative patches of 32×32 pixels were sampled. 75% of the patches were randomly selected for training and the rest were used for testing. Images of the same patient appear multiple times within the same set, but not both. Positive patches were generated by placing the center at the annotated ground truth can cropping the corresponding image patch. Negative patches were sampled far away from the ground truth location of the LV apex. A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero. The size of the layers of the trained deep neural are 1024-1024-300-100-2. The training was initialized with unsupervised pre-training and then refined using supervised back-propagation. Table 1 shows the 2D LV apex classification error of approximations of the deep neural network generated using weight sparsification performed by thresholding with different sparse factors” [col 11, lines 30-51]. Examiner Note: Liu’s test input images have a dimensionality equal to the dimensionality of the live inputs); and
each execution data value in the set of execution data values is proportional in magnitude to the contribution to the inference tensor of each uniquely corresponding vertex in the set of vertices (Liu, “The training of a deep neural network, such as a stacked denoising auto-encode can be performed based on stochastic gradient descent of a cost function measured as the Euclidean distance between predicted outcomes and the observations in the training data. In an ideal world, each node in the network should extract different pieces of information from the input image data so that the combination of nodes yields an accurate and robust prediction for the landmark location. However, there is no explicit constraint to prevent different nodes from learning the same thing. Moreover, due to the highly complex and non-convex nature of the optimization procedure used to train the deep neural network, the trained deep neural network will likely contain significant redundancy” [col 9, lines 52-64]. “In order to evaluate the effectiveness of this approach, the present inventors have used this approach in left ventricle (LV) apex detection in 2D MR images. The dataset contains 7961 images from 184 patients, from which positive and negative patches of 32×32 pixels were sampled. 75% of the patches were randomly selected for training and the rest were used for testing. Images of the same patient appear multiple times within the same set, but not both. Positive patches were generated by placing the center at the annotated ground truth can cropping the corresponding image patch. Negative patches were sampled far away from the ground truth location of the LV apex. A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero. The size of the layers of the trained deep neural are 1024-1024-300-100-2. The training was initialized with unsupervised pre-training and then refined using supervised back-propagation. Table 1 shows the 2D LV apex classification error of approximations of the deep neural network generated using weight sparsification performed by thresholding with different sparse factors” [col 11, lines 30-51]. Examiner Note: Liu’s utilization of gradient descent in the training process results in the proportional contribution of each vertex.).

As per claim 11, Liu teaches The computer-implemented method from claim 4, further comprising: storing the execution data in a distributed set of memory locations (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, lines 22-38]. Examiner Note: The examiner recognizes random access memory, a known form of distributed memory, as a possible embodiment of Liu’s memory unit.); and
obtaining, from a memory location in the distributed set of memory locations using a single address, both: (i) a subset of execution data from the execution data; and (ii) a weight tensor from the set of weights (Liu, “For illustrative purposes, the SparseConnect approximation methods and the ShrinkConnect approximation methods (described below) are described herein as being used in combination with a stacked denoising auto-encoder (DAE) deep neural network. However it is to be under stood that these methods can be similarly applied to any other trained deep neural network. Let W denote the weight matrix and h denote the output at each layer, and the input-output of an auto-encoder can be expressed as:
h(l)=ƒ(W(l)x + b(l))  (6)
where ƒ is a non-linear rectification function like sigmoid function.” [col 9, lines 39-64]. “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 34-56]); and
wherein the conditioning of the execution of the directed graph is conducted in real time using the execution data and the set of weights (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 34-56].).
As per claim 12, Liu teaches The computer-implemented method from claim 4, further comprising: generating a markup of the directed graph using the collection of execution data (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. …In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. I” [col 10, line 34 – col 11, line 29]. Examiner Note: Liu’s weight sparsification is seen as equivalent to the markup of the instant application); 
storing the markup in a distributed set of memory locations (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, lines 22-38]. Examiner Note: The examiner recognizes random access memory, a known form of distributed memory, as a possible embodiment of Liu’s memory unit.); and
conditioning an update of the set of weights using the markup (Liu, “In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights. Steps 704 and 706 can be iterated until a stopping condition is reached. For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used. It is also possible that these steps can be iterated until a target percentage of weights in each filter are set to zero.  In a possible embodiment, thresholding can be used to sparsify the weights of the network. In particular, a certain percentage of weights that have the largest magnitudes in each filter can be retained with the rest of the weights set to zero. In a possible implementation large percentage (e.g., 90% or 95%) of the weights can be set to zero each filter, and then a number of iterations (e.g., 30) of supervised back-propagation can be performed to refine the remaining non-zero weights. In another possible implementation, a smaller percentage (e.g., 10%) of weights can be set to zero, followed by supervised back-propagation to refine the remaining weights, and these steps can be iterated until a target percentage of weights in each filter are set to zero or until an overall accuracy of the approximated deep neural network decreases by a certain amount” [col 10, line 34 – col 11, line 29]. Examiner Note: Liu’s weight sparsification is seen as equivalent to the markup of the instant application).
As per claim 13, Liu teaches The computer-implemented method from claim 4, further comprising: generating a markup of the directed graph using the collection of execution data (Liu, “…For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. Examiner Note: Liu’s weight sparsification is seen as equivalent to the markup of the instant application);
wherein the markup identifies a priority value for a weight tensor (Liu, “…In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights. Steps 704 and 706 can be iterated until a stopping condition is reached. For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used. It is also possible that these steps can be iterated until a target percentage of weights in each filter are set to zero.  In a possible embodiment, thresholding can be used to sparsify the weights of the network. In particular, a certain percentage of weights that have the largest magnitudes in each filter can be retained with the rest of the weights set to zero. In a possible implementation large percentage (e.g., 90% or 95%) of the weights can be set to zero each filter, and then a number of iterations (e.g., 30) of supervised back-propagation can be performed to refine the remaining non-zero weights. In another possible implementation, a smaller percentage (e.g., 10%) of weights can be set to zero, followed by supervised back-propagation to refine the remaining weights, and these steps can be iterated until a target percentage of weights in each filter are set to zero or until an overall accuracy of the approximated deep neural network decreases by a certain amount” [col 10, line 34 – col 11, line 29].); and
wherein conditioning of the execution of the directed graph uses the markup (Liu, “… In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights. Steps 704 and 706 can be iterated until a stopping condition is reached. For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used. It is also possible that these steps can be iterated until a target percentage of weights in each filter are set to zero.  In a possible embodiment, thresholding can be used to sparsify the weights of the network. In particular, a certain percentage of weights that have the largest magnitudes in each filter can be retained with the rest of the weights set to zero. In a possible implementation large percentage (e.g., 90% or 95%) of the weights can be set to zero each filter, and then a number of iterations (e.g., 30) of supervised back-propagation can be performed to refine the remaining non-zero weights. In another possible implementation, a smaller percentage (e.g., 10%) of weights can be set to zero, followed by supervised back-propagation to refine the remaining weights, and these steps can be iterated until a target percentage of weights in each filter are set to zero or until an overall accuracy of the approximated deep neural network decreases by a certain amount” [col 10, line 34 – col 11, line 29].).
As per claim 14, Liu teaches The computer-implemented method from claim 13, further comprising: storing the markup in a distributed set of memory locations (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, lines 22-38]. Examiner Note: The examiner recognizes random access memory, a known form of distributed memory, as a possible embodiment of Liu’s memory unit.); and
obtaining the priority value and the weight tensor from a memory location in the distributed set of memory locations using a single address (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights. Steps 704 and 706 can be iterated until a stopping condition is reached. For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used. It is also possible that these steps can be iterated until a target percentage of weights in each filter are set to zero.  In a possible embodiment, thresholding can be used to sparsify the weights of the network. In particular, a certain percentage of weights that have the largest magnitudes in each filter can be retained with the rest of the weights set to zero. In a possible implementation large percentage (e.g., 90% or 95%) of the weights can be set to zero each filter, and then a number of iterations (e.g., 30) of supervised back-propagation can be performed to refine the remaining non-zero weights. In another possible implementation, a smaller percentage (e.g., 10%) of weights can be set to zero, followed by supervised back-propagation to refine the remaining weights, and these steps can be iterated until a target percentage of weights in each filter are set to zero or until an overall accuracy of the approximated deep neural network decreases by a certain amount” [col 10, line 34 – col 11, line 29].).
As per claim 15, Liu teaches The computer-implemented method from claim 13, further comprising: storing the markup at a single memory location (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, lines 22-38].);
wherein the conditioning of the execution of the directed graph further comprises: obtaining the markup from the single memory location (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network.t” [col 10, line 34 – col 11, line 29]. Examiner Note: Liu’s weight sparsification is seen as equivalent to the markup of the instant application);
obtaining a first subset of the set of weights from memory (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights.” [col 10, line 34 – col 11, line 29]. Examiner Note: Liu’s weight sparsification is seen as equivalent to the markup of the instant application); and
wherein the first subset is selected using the markup (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights. Steps 704 and 706 can be iterated until a stopping condition is reached.” [col 10, line 34 – col 11, line 29]. Examiner Note: Liu’s weight sparsification is seen as equivalent to the markup of the instant application).
As per claim 16, Liu teaches The computer-implemented method from claim 13, wherein the conditioning of the execution of the directed graph further comprises: reducing an accuracy of a computation using the weight tensor based on the priority value (Liu, “The method of FIG. 3 is performed for each hidden layer (or each layer) in the trained deep neural network. As described above, the method of FIG. 3 can be performed for each hidden layer during training prior to training the subsequent layer of the deep neural network. In a possible implementation, the Haar wavelet approximation for each hidden layer can be performed during training of the deep neural network using iterative approximation and training steps. FIG. 6 illustrates iteratively training the deep neural network while approximating the weight matrices using wavelet approximation according to an embodiment of the present invention. As shown in FIG. 6, at step 602 neural network training is performed to train the weights matrices of the neural network, and at step 604, Haar wavelet approximation is performed to reconstruct the weights using 1D Haar wavelet bases and wavelet coefficients and a number of wavelet coefficients are set to zero. Steps 602 and 604 are then iterated. In each round of iteration, the wavelet coefficients that are set to zero are kept at zero, while the remaining coefficients are adjusted by the neural network training algorithm, such as backpropagation. The iterations can be repeated until a number of wavelet coefficients remaining converges, for a predetermined number of iterations, or until a stopping condition associated with a decrease in accuracy of the approximation of the deep neural network is reached. In an exemplary implementation, the steps of FIG. 6 can be iterated for each hidden layer during the training of the hidden layer. In another embodiment, each iteration of step 604 can be performed for each hidden layer of a trained deep neural network and each iteration of step 602 can re-train the whole deep neural network” [col 8 line 43 – col 9 line 6].).
As per claim 17, Liu teaches The computer-implemented method from claim 13, wherein the conditioning of the execution of the directed graph further comprises: obtaining a first subset of weights from the set of weights from memory (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights. Steps 704 and 706 can be iterated until a stopping condition is reached. For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used. It is also possible that these steps can be iterated until a target percentage of weights in each filter are set to zero.  In a possible embodiment, thresholding can be used to sparsify the weights of the network. In particular, a certain percentage of weights that have the largest magnitudes in each filter can be retained with the rest of the weights set to zero. In a possible implementation large percentage (e.g., 90% or 95%) of the weights can be set to zero each filter, and then a number of iterations (e.g., 30) of supervised back-propagation can be performed to refine the remaining non-zero weights. In another possible implementation, a smaller percentage (e.g., 10%) of weights can be set to zero, followed by supervised back-propagation to refine the remaining weights, and these steps can be iterated until a target percentage of weights in each filter are set to zero or until an overall accuracy of the approximated deep neural network decreases by a certain amount” [col 10, line 34 – col 11, line 29]. Examiner Note: Liu’s weight sparsification is seen as equivalent to the markup of the instant application);
replacing a set of original values of a second subset of the set of weights with a set of replacement values (Liu, “FIGS. 4 and 5 illustrate examples of approximating weight matrices for nodes of a hidden layer of a trained deep neural network using the method of FIG. 3. As shown in FIG. 4, image 402 is a visualization of an original trained weight matrix associated with a hidden layer node, and image 404 is a visualization of an approximation of the weight matrix shown in image 402 using Haar wavelet reconstruction with half of the wavelet coefficients set to zero. As shown in FIG. 4, image 502 is a visualization of an original trained weight matrix associated with another hidden layer node, and image 504 is a visualization of an approximation of the weight matrix shown in image 502 using Haar wavelet reconstruction with half of the wavelet coefficients set to zero. (29) Once all the weight matrices of the hidden layer are reconstructed using the wavelet coefficients and the 1D wavelet bases and shrinkage is performed on the wavelet coefficients, the wavelet coefficients and 1D wavelet bases can be used on the input image patches in place of the weight matrices in order to approximate the Frobenius inner product…” [col 7 lines 46-66]); and
wherein the first subset of weights is selected using the markup (Liu, “In order to perform the anatomical landmark detection (step 108 of FIG. 1), the sliding window approach can be used where a plurality of image patches P are examined while sliding over the whole image or volume V. The computation of Φ.sup.TPΦ for each image patch for each node of the hidden layer can be sped up using integral imaging techniques when Haar wavelets are used for the wavelet bases. An integral image the same size as the original image is stored in a look-up table and the Haar wavelet bases determine which items (pixels) in the look-up table will be looked up. For example, the 4×4 Haar wavelet bases Φ.sub.4 shown in Equation (3) can be used, but the present invention is not limited thereto. In this 4×4 case, matrix multiplication PΦ amounts to four look-up operations for the multiplication with the first column of Φ.sub.4, four table look-ups and a minus operation for the second column, and two table look-ups and a minus operations for each of the third and fourth columns. This is faster than direct matrix multiplication. The same speed up can be obtained for the multiplication with Φ.sup.T. The same analysis described herein can be similarly applied to larger Haar wavelet bases as well. (32) Once Z=Φ.sup.TPΦ is obtained, the Frobenius inner product of Y and Z may seem as computationally expensive as the original goal of computing P:W=Σ.sub.mΣ.sub.nP(m,n)W(m,n). However, the wavelet coefficients Y are sparse due to the shrinkage applied to the wavelet coefficients in step 306, which results in less computations. Since the wavelet coefficients are computed offline from the neural network weight matrices rather than during detection, the shrinkage operation will not adversely affect detection speed” [col 8, lines 13-42].).

As per claim 18, Liu teaches The computer-implemented method from claim 17, wherein: the deriving of the simplified version of the directed graph includes replacing the set of original values of the set of weights with the set of replacement values (Liu, “FIGS. 4 and 5 illustrate examples of approximating weight matrices for nodes of a hidden layer of a trained deep neural network using the method of FIG. 3. As shown in FIG. 4, image 402 is a visualization of an original trained weight matrix associated with a hidden layer node, and image 404 is a visualization of an approximation of the weight matrix shown in image 402 using Haar wavelet reconstruction with half of the wavelet coefficients set to zero. As shown in FIG. 4, image 502 is a visualization of an original trained weight matrix associated with another hidden layer node, and image 504 is a visualization of an approximation of the weight matrix shown in image 502 using Haar wavelet reconstruction with half of the wavelet coefficients set to zero. Once all the weight matrices of the hidden layer are reconstructed using the wavelet coefficients and the 1D wavelet bases and shrinkage is performed on the wavelet coefficients, the wavelet coefficients and 1D wavelet bases can be used on the input image patches in place of the weight matrices in order to approximate the Frobenius inner product P:W=Σ.sub.mΣ.sub.nP(m,n)W(m,n), as follows: [Equation 4]. Accordingly, the Frobenius inner product P:W=Σ.sub.mΣ.sub.nP(m,n)W(m,n) is approximated as the inner product of Y and Φ.sup.TPΦ” [col 7 line 46 – col 8 line 12].).

Claim 19 is a means-for system claim corresponding to method claim 1. Liu teaches A system for executing a directed graph comprising (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components” [col 16, lines 22-26].):
a means for deriving a simplified version of the directed graph, wherein the directed graph is an original directed graph (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26]. Examiner Note: Liu’s neural network is seen as equivalent to the directed graph of the instant application.);
a means for obtaining a collection of execution data during the application of the pilot input tensor to the simplified version of the directed graph (Liu, “At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 49-56]. Examiner Note: Liu’s training data from a training input is seen as equivalent to the execution data of the pilot input tensor, and the pilot input tensor is seen as equivalent to the training input.)

Liu fails to disclose conditioning the execution of the directed graph, during the application of the live input tensor to the directed graph, using the collection of execution data; and obtaining an output tensor from the conditional execution of the directed graph. 

Chen teaches a means for applying a input tensor to the directed graph, wherein the directed graph is the original directed graph and not the simplified version of the directed graph  (p.2-3, “Specifically, suppose that a teacher network is represented by a function y = f(x; θ) where x is the input to the network, y is the output of the network, and θ is the parameters of the network” Examiner Note: Chen discloses applying input tensors to the larger network (see also, “Experiments”, p. 6-8). When Chen is applied to Liu, the resulting system would apply inputs to the original, larger directed graph subsequent to the creation of a simplified version of the graph. The “means for” structure is disclosed by Liu, as above.);
conditioning the execution of the directed graph [by selecting, during the application of the input tensor to the directed graph, computations for suppression] using the collection of execution data (Chen, p.1 “Specifically, we initialize the student to be a neural network that represents the same function as the teacher, but using a different parameterization. One of these transformations, Net2WiderNet allows replacing a model with an equivalent model that is wider (has more units in each hidden layer). Another of these transformations, Net2DeeperNet allows replacing a model that satisfies some properties with an equivalent, deeper model. After initializing the larger network to contain all of the knowledge previously acquired by the smaller network, the larger network may be trained to improve its performance.” Examiner Note: When Chen is applied to Liu, the resulting system would transfer the knowledge (e.g., use the execution data) of the smaller neural network (i.e., the simplified directed graph), to train (i.e., condition the execution of) the larger network (i.e., the original directed graph).).

Liu and Chen are analogous art because they are both directed to neural networks. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Liu’s network simplification with Chen’s knowledge transfer. The combination would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention because he/she would have been motivated to increase the speed of training the networks, which can be accomplished by transferring knowledge from a teacher network to a student network (Chen, p1, “We use Net2Net as a general term describing any process of training a student network significantly faster than would otherwise be possible by leveraging knowledge from a teacher network that was already trained on the same task.”).
	

The combination of Liu and Chen teaches conditioning the execution of the directed graph using execution data from the simplified directed graph, but does not explicitly teach conditioning the execution of the directed graph, during the application of the live input tensor to the directed graph, using the collection of execution data; and obtaining an output tensor from the conditional execution of the directed graph.

Deng teaches conditioning the execution of the directed graph, by selecting, during the application of the input tensor to the directed graph, computations for suppression using the collection of execution data (Deng, Figure 2. p.1, “This paper introduces Dynamic Deep Neural Networks (D2NN), a new type of feed-forward deep neural network that allows selective execution. That is, given an input, only a subset of neurons are executed, and the particular subset is determined by the network itself and dependent on the particular input.” Examiner Note: Selective execution of nodes in a neural network, based on the input to the neural network, is seen as equivalent to selecting computations for suppression.); and
obtaining an output tensor from the conditional execution of the directed graph (“As an example application, this D2NN can be used for binary classification of images, where some images can be rapidly classified as negative after only a small amount of computation.”).

Liu, Chen, and Deng are analogous art because they are directed to neural networks. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Liu’s network simplification with Chen’s knowledge transfer, and Deng’s selective execution. The combination would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention because he/she would have been motivated to increase computational efficiency, which can be accomplished with selective execution (Deng, p1, “D2NNs provide a way to improve computational efficiency by selective execution, pruning unnecessary computation depending on input.”).

As per claim 20, Liu teaches The system from claim 19, further comprising: a means for storing the execution data in memory as stored execution data (Liu, “In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 51-56]. “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer Software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define Such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, lines 22-38].); and
a means for priming the directed graph for the conditional execution, prior to the conditional execution of the directed graph, using the stored execution data (Liu, “For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 37-56].).

As per claim 21, Liu teaches The system from claim 19, further comprising: a means for generating a markup of the directed graph using the collection of execution data (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights. Steps 704 and 706 can be iterated until a stopping condition is reached. For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used. It is also possible that these steps can be iterated until a target percentage of weights in each filter are set to zero.  In a possible embodiment, thresholding can be used to sparsify the weights of the network. In particular, a certain percentage of weights that have the largest magnitudes in each filter can be retained with the rest of the weights set to zero. In a possible implementation large percentage (e.g., 90% or 95%) of the weights can be set to zero each filter, and then a number of iterations (e.g., 30) of supervised back-propagation can be performed to refine the remaining non-zero weights. In another possible implementation, a smaller percentage (e.g., 10%) of weights can be set to zero, followed by supervised back-propagation to refine the remaining weights, and these steps can be iterated until a target percentage of weights in each filter are set to zero or until an overall accuracy of the approximated deep neural network decreases by a certain amount” [col 10, line 34 – col 11, line 29]. Examiner Note: Liu’s weight sparsification is seen as equivalent to the markup of the instant application);
wherein the markup identifies a priority value for a weight tensor (Liu, “In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights. Steps 704 and 706 can be iterated until a stopping condition is reached. For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used. It is also possible that these steps can be iterated until a target percentage of weights in each filter are set to zero.  In a possible embodiment, thresholding can be used to sparsify the weights of the network. In particular, a certain percentage of weights that have the largest magnitudes in each filter can be retained with the rest of the weights set to zero. In a possible implementation large percentage (e.g., 90% or 95%) of the weights can be set to zero each filter, and then a number of iterations (e.g., 30) of supervised back-propagation can be performed to refine the remaining non-zero weights. In another possible implementation, a smaller percentage (e.g., 10%) of weights can be set to zero, followed by supervised back-propagation to refine the remaining weights, and these steps can be iterated until a target percentage of weights in each filter are set to zero or until an overall accuracy of the approximated deep neural network decreases by a certain amount” [col 10, line 34 – col 11, line 29].); and
wherein conditioning of the execution of the directed graph uses the markup (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights.” [col 10, line 34 – col 11, line 29].).
As per claim 22, Liu teaches The system from claim 21, further comprising: a means for storing the markup in a distributed set of memory locations (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, lines 22-38]. Examiner Note: The examiner recognizes random access memory, a known form of distributed memory, as a possible embodiment of Liu’s memory unit.); and
a means for obtaining the priority value and the weight tensor from a memory location in the distributed set of memory locations using a single address (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights. Steps 704 and 706 can be iterated until a stopping condition is reached. For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used. It is also possible that these steps can be iterated until a target percentage of weights in each filter are set to zero.  ” [col 10, line 34 – col 11, line 29].).

As per claim 23, Liu teaches The system from claim 19, further comprising: a means for generating a markup of the directed graph using the collection of execution data (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights. Steps 704 and 706 can be iterated until a stopping condition is reached. For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used. It is also possible that these steps can be iterated until a target percentage of weights in each filter are set to zero.” [col 10, line 34 – col 11, line 29]. Examiner Note: Liu’s weight sparsification is seen as equivalent to the markup of the instant application); 
a means for storing the markup in a distributed set of memory locations (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, lines 22-38]. Examiner Note: The examiner recognizes random access memory, a known form of distributed memory, as a possible embodiment of Liu’s memory unit.); and
a means for conditioning an update of the set of the direct graph using the markup (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights. Steps 704 and 706 can be iterated until a stopping condition is reached. For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used. It is also possible that these steps can be iterated until a target percentage of weights in each filter are set to zero.  In a possible embodiment, thresholding can be used to sparsify the weights of the network. In particular, a certain percentage of weights that have the largest magnitudes in each filter can be retained with the rest of the weights set to zero. In a possible implementation large percentage (e.g., 90% or 95%) of the weights can be set to zero each filter, and then a number of iterations (e.g., 30) of supervised back-propagation can be performed to refine the remaining non-zero weights. In another possible implementation, a smaller percentage (e.g., 10%) of weights can be set to zero, followed by supervised back-propagation to refine the remaining weights, and these steps can be iterated until a target percentage of weights in each filter are set to zero or until an overall accuracy of the approximated deep neural network decreases by a certain amount” [col 10, line 34 – col 11, line 29]. Examiner Note: Liu’s weight sparsification is seen as equivalent to the markup of the instant application).

As per claim 24, Liu teaches A computer-implemented method for generating an inference from a neural network, in which each step is conducted by a processor, comprising (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components” [col 16, lines 22-26].):
deriving a simplified version of the neural network, wherein the neural network is an original neural network (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26]. Examiner Note: Liu’s neural network is seen as equivalent to the directed graph of the instant application.);
obtaining a collection of execution data during an execution of the neural network (Liu, “At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 49-56]. Examiner Note: Liu’s training data from a training input is seen as equivalent to the execution data of the pilot input tensor, and the pilot input tensor is seen as equivalent to the training input.)

Liu teaches the application of an input tensor to a directed graph prior to the simplification of the directed graph, but does not explicitly disclose applying an input tensor to the neural network, wherein the neural network is the original neural network and not the simplified version of the neural network; conditioning the execution of the neural network, during the application of the live input tensor to the neural network, using the collection of execution data; and obtaining an output tensor from the conditional execution of the neural network. 

Chen teaches applying an input to the neural network, wherein the neural network is the original neural network and not the simplified version of the neural network (p.2-3, “Specifically, suppose that a teacher network is represented by a function y = f(x; θ) where x is the input to the network, y is the output of the network, and θ is the parameters of the network” Examiner Note: Chen discloses applying input tensors to the larger network (see also, “Experiments”, p. 6-8). When Chen is applied to Liu, the resulting system would apply inputs to the original, larger directed graph subsequent to the creation of a simplified version of the graph.);
conditioning the execution of the neural network [by selecting, during the application of the input tensor to the neural network, computations for suppression] using the collection of execution data (Chen, p.1 “Specifically, we initialize the student to be a neural network that represents the same function as the teacher, but using a different parameterization. One of these transformations, Net2WiderNet allows replacing a model with an equivalent model that is wider (has more units in each hidden layer). Another of these transformations, Net2DeeperNet allows replacing a model that satisfies some properties with an equivalent, deeper model. After initializing the larger network to contain all of the knowledge previously acquired by the smaller network, the larger network may be trained to improve its performance.” Examiner Note: When Chen is applied to Liu, the resulting system would transfer the knowledge (e.g., use the execution data) of the smaller neural network (i.e., the simplified directed graph), to train (i.e., condition the execution of) the larger network (i.e., the original directed graph).).

Liu and Chen are analogous art because they are both directed to neural networks. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Liu’s network simplification with Chen’s knowledge transfer. The combination would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention because he/she would have been motivated to increase the speed of training the networks, which can be accomplished by transferring knowledge from a teacher network to a student network (Chen, p1, “We use Net2Net as a general term describing any process of training a student network significantly faster than would otherwise be possible by leveraging knowledge from a teacher network that was already trained on the same task.”).
	

The combination of Liu and Chen teaches conditioning the execution of the directed graph using execution data from the simplified directed graph, but does not explicitly teach conditioning the execution of the neural network, during the application of the live input tensor to the neural network, using the collection of execution data; and obtaining an output tensor from the conditional execution of the neural network.

Deng teaches conditioning the execution of the neural network, by selecting, during the application of the input tensor to the directed graph, computations for suppression using the collection of execution data (Deng, Figure 2. p.1, “This paper introduces Dynamic Deep Neural Networks (D2NN), a new type of feed-forward deep neural network that allows selective execution. That is, given an input, only a subset of neurons are executed, and the particular subset is determined by the network itself and dependent on the particular input.” Examiner Note: Selective execution of nodes in a neural network, based on the input to the neural network, is seen as equivalent to selecting computations for suppression.); and
obtaining an output tensor from the conditional execution of the neural network (“As an example application, this D2NN can be used for binary classification of images, where some images can be rapidly classified as negative after only a small amount of computation.”).

Liu, Chen, and Deng are analogous art because they are directed to neural networks. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Liu’s network simplification with Chen’s knowledge transfer, and Deng’s selective execution. The combination would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention because he/she would have been motivated to increase computational efficiency, which can be accomplished with selective execution (Deng, p1, “D2NNs provide a way to improve computational efficiency by selective execution, pruning unnecessary computation depending on input.”).

As per claim 25, Liu teaches The computer-implemented method from claim 24, wherein: 
applying a pilot input tensor to the simplified version of the directed graph, to conduct the execution of the simplified version of the directed graph (Liu, “At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re- train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 49-56]. Examiner Note: Liu’s training input is seen as equivalent to the pilot input tensor.);
wherein the input tensor is a live input tensor (Liu, “A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, 1.e., 50% of the input pixels were randomly set to zero” [col 11, lines 41-44]. “FIG. 10 illustrates a method for approximating a trained deep neural network using functional approximation to reduce the number of nodes in each layer according to an embodiment of the present invention. At step 1002, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above tn connection with step 102 of FIG. 1” [col 13, lines 23-30]. Examiner Note: Liu’s second set of inputs in the supervised training set are seen as equivalent to a live input tensor.);
the first input and the second input are not identical (Liu, “In order to evaluate the effectiveness of this approach, the present inventors have used this approach in left ventricle (LV) apex detection in 2D MR images. The dataset contains 7961 images from 184 patients, from which positive and negative patches of 32×32 pixels were sampled. 75% of the patches were randomly selected for training and the rest were used for testing. Images of the same patient appear multiple times within the same set, but not both. Positive patches were generated by placing the center at the annotated ground truth can cropping the corresponding image patch. Negative patches were sampled far away from the ground truth location of the LV apex. A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero. The size of the layers of the trained deep neural are 1024-1024-300-100-2. The training was initialized with unsupervised pre-training and then refined using supervised back-propagation. Table 1 shows the 2D LV apex classification error of approximations of the deep neural network generated using weight sparsification performed by thresholding with different sparse factors” [col 11, lines 30-51]. Examiner Note: Liu’s use of images where the test and training inputs are not identical sets ensures that the test input and live input are not identical.); and
the first input and the second input are stochastically dependent (Liu, “A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero” [col 11, lines 41-44]. “FIG. 10 illustrates a method for approximating a trained deep neural network using functional approximation to reduce the number of nodes in each layer according to an embodiment of the present invention. At step 1002, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1” [col 13, lines 23-30].  Examiner Note: Liu’s setting of random input pixels to zero makes the training input stochastically dependent upon the live input.).

As per claim 26, Liu teaches The computer-implemented method from claim 24, further comprising: storing the execution data in memory as stored execution data (Liu, “In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 51-56]. “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer Software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define Such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, lines 22-38].); and
priming the neural network for the conditional computation, prior to the conditional computation of the neural network, using the stored execution data (Liu, “For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 37-56].).

As per claim 27, Liu teaches The computer-implemented method from claim 24, wherein: the deriving of the simplified version of the neural network includes down-sampling the directed graph by a sampling factor (Liu, “According to an advantageous embodiment of the present invention, the SparseConnect and ShrinkConnect methods for approximating a trained deep neural network can be combined. The SparseConnect and ShrinkConnect methods exploit different types of redundancy within a trained deep neural network. The methods complement each other and may be combined to achieve an even greater speed up. For example, in a possible implementation, a trained deep neural network can be first be approximated using the ShrinkConnect method to reduce the number of nodes in each layer of the trained deep neural network, followed by using the SparseConnect method (using thresholding or re-weighted L1-norm minimization) to sparsify the weights in the filters connecting each layer in the approximation of the deep neural network resulting from applying the ShrinkConnect method. The present inventors tested this combined method using the thresholding approach for weight sparsification (SparseConnect) in order to approximate the trained deep neural network for LV apex detection in 2D MR images. The original trained deep neural network was simplified by a factor of 3 using the ShrinkConnect method (function approximation) and then further simplified by a factor of 10 using the SparseConnect method (weight sparsification)” [col 15, lines 21-43]. Examiner Note: Liu’s use of weight sparsification and function approximation by certain factors is seen as a form of down sampling.).

As per claim 28, Liu teaches The computer-implemented method from claim 4, wherein: the deriving of the simplified version of the directed graph includes replacing a set of original values of the set of weights with a set of replacement values (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26].).


As per claim 29, Liu teaches The computer-implemented method from claim 24, further comprising: generating a markup of the neural network using the collection of execution data (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights. Steps 704 and 706 can be iterated until a stopping condition is reached. For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used. It is also possible that these steps can be iterated until a target percentage of weights in each filter are set to zero.  In a possible embodiment, thresholding can be used to sparsify the weights of the network. In particular, a certain percentage of weights that have the largest magnitudes in each filter can be retained with the rest of the weights set to zero. In a possible implementation large percentage (e.g., 90% or 95%) of the weights can be set to zero each filter, and then a number of iterations (e.g., 30) of supervised back-propagation can be performed to refine the remaining non-zero weights. In another possible implementation, a smaller percentage (e.g., 10%) of weights can be set to zero, followed by supervised back-propagation to refine the remaining weights, and these steps can be iterated until a target percentage of weights in each filter are set to zero or until an overall accuracy of the approximated deep neural network decreases by a certain amount” [col 10, line 34 – col 11, line 29]. Examiner Note: Liu’s weight sparsification is seen as equivalent to the markup of the instant application);
wherein the markup identifies a priority value for a weight in the neural network (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights. Steps 704 and 706 can be iterated until a stopping condition is reached. For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used. It is also possible that these steps can be iterated until a target percentage of weights in each filter are set to zero.  In a possible embodiment, thresholding can be used to sparsify the weights of the network. In particular, a certain percentage of weights that have the largest magnitudes in each filter can be retained with the rest of the weights set to zero. In a possible implementation large percentage (e.g., 90% or 95%) of the weights can be set to zero each filter, and then a number of iterations (e.g., 30) of supervised back-propagation can be performed to refine the remaining non-zero weights. In another possible implementation, a smaller percentage (e.g., 10%) of weights can be set to zero, followed by supervised back-propagation to refine the remaining weights, and these steps can be iterated until a target percentage of weights in each filter are set to zero or until an overall accuracy of the approximated deep neural network decreases by a certain amount” [col 10, line 34 – col 11, line 29].); and
wherein the conditional computation uses the markup (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using thresholding or L1-norm minimization) and each iteration of 706 can refine (possible multiple times) the remaining non-zero weights. Steps 704 and 706 can be iterated until a stopping condition is reached. For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used. It is also possible that these steps can be iterated until a target percentage of weights in each filter are set to zero.  In a possible embodiment, thresholding can be used to sparsify the weights of the network. In particular, a certain percentage of weights that have the largest magnitudes in each filter can be retained with the rest of the weights set to zero. In a possible implementation large percentage (e.g., 90% or 95%) of the weights can be set to zero each filter, and then a number of iterations (e.g., 30) of supervised back-propagation can be performed to refine the remaining non-zero weights. In another possible implementation, a smaller percentage (e.g., 10%) of weights can be set to zero, followed by supervised back-propagation to refine the remaining weights, and these steps can be iterated until a target percentage of weights in each filter are set to zero or until an overall accuracy of the approximated deep neural network decreases by a certain amount” [col 10, line 34 – col 11, line 29].).

As per claim 30, Liu teaches The computer-implemented method from claim 29, wherein: the conditional computation of the neural network includes reducing an accuracy of a computation using the weight based on the priority value (Liu, “For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used” [col 11, lines 7-12].).

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pub. No. US 9633306 B2 to Liu et al, (hereinafter, “Liu”), in view of U.S. Pub. No. US 20160162782 A1 to Park et al, (hereinafter, “Park”), and further in view of “Evaluation of Interpolation Effects on Upsampling and Accuracy of Cost Functions-Based Optimized Automatic Image Registration”  to Mahmoudzadeh and Kashou, (hereinafter, “Mahmoudzadeh”). 


As per claim 7, Liu discloses down-sampling of the directed graph, but does not disclose down-sampling of the directed graph utilizing polynomial interpolation.

 Mahmoudzadeh teaches The computer-implemented method from claim 6, wherein: the down-sampling of the directed graph utilizes polynomial interpolation (Mahmoudzadeh, 1.1.4, “B-spline interpolation uses weighted voxel values in a wider neighborhood compared to trilinear interpolation, but both the B-spline and trilinear kernels are symmetrical and separable. The place of the neighboring points as control points relates to B-spline interpolation and combines the intensity values at these places using a set of polynomial basis according to (5) [16].
Equation (5) shows k-order B-spline with n + 1 control points (P1, P2,…, Pn),
P(t)=∑n+1i=1Ni,kPi, tmin≤t<tmax. (5)
In (5), N i,k are the polynomial functions of order k (degree k − 1), and n is the number of control points; k must be at least 2 (linear) and less than n + 1.
P(t) is validly defined for t min≤ t < t max where t min= t k and t max= t n+2. A knot vector (t 1, t 2,…, t k+(n+1)) must be determined. This specifies the values of t at which the pieces of curve join, like knots joining bits of string. It is important to note that the degree of the weighting polynomial (the order of the curve) is not dependent on the number of control points, n [17]. The weighting polynomial can be recursively defined by the following equation [18]”. Examiner Note: When combined into Liu and Park’s neural network approximation system, Mahmoudzadeh’s polynomial interpolation method would result in the down-sampling of a directed graph utilizing polynomial interpolation.).

Liu, Park, and Mahmoudzadeh are analogous art because they are directed towards enhanced data processing methods. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Liu and Park’s neural network approximation method with Mahmoudzadeh’s polynomial interpolation method. The modification would have been obvious to one of ordinary skill in the art because they would have been motivated to reduce computational demand of the resulting neural network while maintaining acceptable accuracy, which can be accomplished through a decrease of resolution via polynomial interpolation (Mahmoudzadeh, abstract).

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pub. No. US 9633306 B2 to Liu et al, (hereinafter, “Liu”), in view of U.S. Pub. No. US 20160162782 A1 to Park et al, (hereinafter, “Park”), and further in view of U.S. Pub. No. US 20180046894 A1 to Yao et al, (hereinafter, “Yao”). 

As per claim 9, Liu and Park disclose the method of claim 8, but fail to teach wherein the replacing comprises one of: reducing a number of bits used to represent the set of original values to obtain the set of replacement values; and calculating the set of replacement values using a set of exponents of the set of original values.

Yao teaches wherein the replacing comprises one of: reducing a number of bits used to represent the set of original values to obtain the set of replacement values (Yao, “Using short fixed-point numbers instead of long floating-point numbers is efficient for implementations on the FPGA platform and can significantly reduce memory footprint and bandwidth requirements. A shorter bit width is always wanted, but it may lead to a severe accuracy loss. Though fixed-point numbers have been widely used in ANN accelerator designs, there is no comprehensive investigation on different quantization strategies and the tradeoff between the bit length of fixed-point numbers and the accuracy” [0096].); and
calculating the set of replacement values using a set of exponents of the set of original values (Yao, “For a fixed-point number, its value can be expressed as (9), where bw is the bit width of the number and f-l is the fractional length which can be negative.” [0098-0099]. Equation 9).

Liu, Park, and Yao are analogous art because they relate to neural network efficiency improvements. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Liu and Park’s fraud detection method with Yao’s neural network compression. The modification would have been obvious to one of ordinary skill in the art because they would have been motivated to reduce the computational demand of neural networks, which can be accomplished by Yao’s quantization (Yao, abstract, [0002]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. US Pub. No. 20150106310 A1 to Birdwell et Schuman and US Pub. No. 20180137406 A1 to Howard et al.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL G SMITH whose telephone number is (571)272-9730. The examiner can normally be reached M-F 8:00-17:00 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on 5712729767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
Respectfully Submitted,



/P.G.S./Examiner, Art Unit 2126                                                                                                                                                                                                        
/NICHOLAS KLICOS/Primary Examiner, Art Unit 2145