Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-30 are pending in the present application.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:

(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. 

19
means for deriving a simplified version of the directed graph
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” Fig. 2. 201
[0019] “The simplified version of the directed graph may be a down-sampled version of the directed graph. The down-sampling can involve reducing the resolution of the individual elements associated with the edges and vertices of the directed graph. For example, with specific reference to an ANN with convolutional and fully connected layers, the weight and filter values could be rounded off to reduce the number of bits required to represent each value. The simplification can be conducted at the graph, sector, layer, or element level.”
19
means for applying a pilot input tensor to the simplified version of the directed graph
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing 
[0018] “The application of an input to the directed graph can be conceptualized as the provisioning of values to the origin vertices of the graph. For example, with reference to Fig. 1, applying input tensor X to directed graph 100 involves obtaining the values of the elements of tensor X from memory and making them available to the hardware that will conduct the calculations associated with the first set of edges of directed graph 100.”

means for obtaining a collection of execution data during the application of the pilot input tensor to the simplified version of the directed graph
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” Fig. 2. 203
[0022] “Data flow diagram 210 represents the pilot input tensor X being applied to the simplified version of the directed graph 212 to produce execution data 213. The execution data 213 is represented as a markup of the simplified version of the directed graph wherein highlighted portions are 
[0038] “Fig. 4 provides a conceptual data flow diagram for how the execution data and markup can be generated during the execution of the directed graph. As illustrated, different edges of the directed graph will be associated with different calculations 405 and 406. The two illustrated calculations are two matrix multiplications that could represent the multiplication of a set of weights with an input from a prior layer for purposes of generating a data element for the next layer in an artificially neural network. In the basic example illustrated in Fig. 4, the output of these calculations are compared to a threshold value Z. If the threshold is exceeded, the calculation is considered of high priority. If the threshold is not exceeded, the calculation is considered of low priority. In this example, the execution data is the determination made by 
 this calculation. The execution data can then be used to contribute to a markup of the directed graph 

means for applying a live input tensor to the directed graph
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” Fig. 2. 205
[0018] “The application of an input to the directed graph can be conceptualized as the provisioning of values to the origin vertices of the graph. For example, with reference to Fig. 1, applying input tensor X to directed graph 100 involves obtaining the values of the elements of tensor X from memory and making them available to the hardware that will conduct the calculations associated with the first set of edges of directed graph 100.”
[0032] “Once the simplified version of the directed graph is obtained, a pilot tensor is applied to the simplified version as described above with reference to step 202. The pilot tensor and simplified version of the directed graph are used to obtain relevant information regarding how the actual directed graph 

means for conditioning the execution of the directed graph, during the application of the live input tensor to the directed graph, 

[0040] “The execution of the directed graph can be conditioned in numerous ways. Generally, the degree to which the computation is conditioned can be set to vary across the directed graph and can include various gradations that align with the relative priority of that portion of the graph. For example, regions of relatively high priority could be computed just as they would be in the unconditionally executed directed graph, while regions of relatively low priority could be excluded from computation entirely. The various approaches for conditional computation discussed below could be mixed and assigned in various ways to the levels of priority. For example, high, medium, and low priorities could be associated with three entirely separate conditional computation schemes. As another example, the conditional computation scheme could be held constant across the directed graph, but the relative accuracy of the scheme could be modified in accordance with the priorities. For example, a degree of rounding or down-sampling could be set 

means for obtaining an output tensor from the conditional execution of the directed graph
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” 
“Execution of the directed graph will involve the execution of calculations associated with the edges of the directed graph, and the ultimate generation of output tensor Y. Tensor Y is therefore obtained from the directed graph and can be stored in memory as a distinct unit of data once the directed graph has been executed. Tensor Y can be an inference tensor generated by a machine intelligence system. However, the directed graphs executed by the methods of flow chart 200 can include multiple inputs or multiple outputs and can represent other 



20
means for storing the execution data in memory as stored execution data
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” [0021] “Steps 202 and 203 are illustrated as sequential because the execution data is generally available for storage in memory after the input tensor has been applied and the graph has completed execution.” Fig. 2 202-203.
[0034] “However, the execution data 404 is produced and stored orthogonally to the main data flow of the directed graph. The execution data can be obtained and stored in various ways. The execution data can be obtained during the application of the input tensor to the simplified version of the directed graph by monitoring the values produced internally during the calculations associated with the edges of the directed graph.”

means for priming the directed graph for the conditional execution, prior to the conditional execution of the directed graph, using the stored execution data
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” Fig. 2. 204-205
[0039] “For example, the directed graph could be primed for conditional execution prior to the conditional execution of the directed graph, using the stored execution data. In particular, in the approach in which the execution data is stored in the header of packets representing the directed graph, the directed graph would thereby be effectively primed for conditional execution because the priority data would be available for utilization to condition execution in real time as the payload of the packet was pulled for computation during the execution of the directed graph. The priming could include identifying the associated portion of directed graph data, packaging the execution and directed graph data into a data package, and storing the data package at a set location in memory. In another example, the execution of the directed graph will 



21
means for generating a markup of the directed graph using the collection of execution data
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” 
[0035] “The execution data can be utilized to produce a markup of the simplified version of the directed graph which tags the directed graph with different levels of priority such as high, medium, or low. These priority values could then be stored in association with different portions of the directed 
[0038] “The execution data can then be used to contribute to a markup of the directed graph as illustrated by the different shading levels in markup 404.” Figures 2, 4.



22
means for storing the markup in a distributed set of memory locations
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. 
[0037] “The execution data can be stored in association with the portions of the directed graph to which they relate in various ways. For example, a markup could be stored in a distributed set of memory locations, or at a single memory location such that all of the data could be recalled using a single memory address or a contiguous sequence of memory addresses. The data can also be stored as an entirely separate data structure in memory. To use the example of 213, the heat map could be stored separately with priority levels and tags identifying specific portions of the graph. Alternatively, the data or markup can be stored directly within the data structures that represent the directed graph and can be obtained along with the data for the directed 

means for obtaining the priority value and the weight tensor from a memory location in the distributed set of memory locations using a single address
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” [0021] “Steps 202 and 203 are illustrated as sequential because the execution data is generally available for storage in memory after the input tensor has been applied and the graph has completed execution.” Fig. 2 202-203.
[0037] “The execution data can be stored in association with the portions of the directed graph to which they relate in various ways. For example, a markup could be stored in a distributed set of 



23
means for generating a markup of the directed graph using the collection of execution data
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” 
[0035] “The execution data can be utilized to produce a markup of the simplified version of the directed graph which tags the directed graph with different levels of priority such as high, medium, or low. These priority values could then be stored in association with different portions of the directed graph. The different levels of priority can describe how much of a contribution to the output tensor the various portions of the directed graph contributed. The markup can have fixed gradations or can be a heat map with smooth transitions across the graph to indicate the various levels of priority. The priority values for each edge or vertex can be calculated in real time as the directed graph is executing 
[0038] “The execution data can then be used to contribute to a markup of the directed graph as illustrated by the different shading levels in markup 404.” Figures 2, 4.

means for storing the markup in a distributed set of memory locations
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” [0021] “Steps 202 and 203 are illustrated as sequential because the execution data is generally available for storage in memory after the input tensor has been applied and the graph has completed execution.” Fig. 2 202-203.


means for conditioning an update of the direct graph using the markup
[0018] “The steps of flow chart 200 can be explained with reference to conceptual data flow diagram 210. Each of the steps can be conducted by a processor operating in combination with a memory for storing the related data structures and the instructions necessary to carry out the steps.” 
[0035] “The execution data can be utilized to produce a markup of the simplified version of the directed graph which tags the directed graph with different levels of priority such as high, medium, or low. These priority values could then be stored in association with different portions of the directed graph. The different levels of priority can describe how much of a contribution to the output tensor the various portions of the directed graph contributed. The markup can have fixed gradations or can be a heat map with smooth transitions across the graph to indicate the various levels of priority. The priority values for each edge or vertex can be calculated in real time as the directed graph is executing calculations associated with that edge or vertex. For 
[0038] “The execution data can then be used to contribute to a markup of the directed graph as illustrated by the different shading levels in markup 404.” Figures 2, 4.





Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6, 8, and 10-30 are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pub. No. US 9633306 B2 to Liu et al, (hereinafter, “Liu”), in view of U.S. Pub. No. US 20160162782 A1 to Park et al, (hereinafter, “Park”).

As per claim 1, Liu teaches A computer-implemented method for executing a directed graph, in which each step is conducted by a processor, comprising (Liu, “The above-described :
deriving a simplified version of the directed graph (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26]. Examiner Note: Liu’s neural network is seen as equivalent to the directed graph of the instant application.);
applying a pilot input tensor to the simplified version of the directed graph (Liu, “At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero [col 10, lines 49-56].” “For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used” [col 11, lines 7-12]. Examiner Note: Liu’s training data from a training input is seen as equivalent to the execution data of the pilot input tensor, and the pilot input tensor is seen as equivalent to the training input.);
obtaining a collection of execution data during the application of the pilot input tensor to the simplified version of the directed graph (Liu, “At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 49-56]. Examiner Note: Liu’s training data from a training input is seen as equivalent to the execution data of the pilot input tensor, and the pilot input tensor is seen as equivalent to the training input.)
applying a live input tensor to the directed graph (Liu, “A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero” [col 11, lines 41-44]. “FIG. 10 illustrates a method for approximating a trained deep neural network using functional approximation to reduce the number of nodes in each layer according to an embodiment of the present invention. At step 1002, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1” [col 13, lines 23-30]. Examiner Note: Liu’s second set of inputs in the supervised training set are seen as equivalent to a live input tensor.).

Liu fails to disclose conditioning the execution of the directed graph, during the application of the live input tensor to the directed graph, using the collection of execution data; and obtaining an output tensor from the conditional execution of the directed graph. 

Park teaches conditioning the execution of the directed graph, during the application of the live input tensor to the directed graph, using the collection of execution data (Park, “The training processor 240 changes the structure of the CNN based on an approximation result from the approximation processor 210, the changed number of output reconstruction filters γ from the filter count changer 220, and the result of modifying the structure of the following convolution layer(s) from the layer structure modifier 230; said processor may then train the CNN of the modified structure by using training data. At this time, the training processor 240 may fix the values of input conversion filters α and convolution filter β and then train the CNN of the modified structure” [0054]. Examiner Note: The adjustment of the CNN based on the results from an approximation processor is seen as equivalent to conditioning the execution of a directed graph using test execution data); and
obtaining an output tensor from the conditional execution of the directed graph. (Park, “Using the trained CNN, the classifier 250 classifies image data into classes. At this time, the result of said classification may include the class of the image data as well as the classification accuracy” [0059]. Examiner Note: Park’s results of image classification are seen as the results of executing the directed graph.).

Liu and Park are analogous art because they are both directed towards neural network approximation. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Liu’s neural network approximation system with Park’s neural network approximation. The modification would have been obvious to one of ordinary skill in the art because they would have been motivated to reduce neural network 


As per claim 2, Liu teaches The computer-implemented method from claim 1, wherein: the pilot input tensor and the live input tensor are not identical (Liu, “In order to evaluate the effectiveness of this approach, the present inventors have used this approach in left ventricle (LV) apex detection in 2D MR images. The dataset contains 7961 images from 184 patients, from which positive and negative patches of 32×32 pixels were sampled. 75% of the patches were randomly selected for training and the rest were used for testing. Images of the same patient appear multiple times within the same set, but not both. Positive patches were generated by placing the center at the annotated ground truth can cropping the corresponding image patch. Negative patches were sampled far away from the ground truth location of the LV apex. A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero. The size of the layers of the trained deep neural are 1024-1024-300-100-2. The training was initialized with unsupervised pre-training and then refined using supervised back-propagation. Table 1 shows the 2D LV apex classification error of approximations of the deep neural network generated using weight sparsification performed by thresholding with different sparse factors” [col 11, lines 30-51]. Examiner Note: Liu’s use of images where the test and training inputs are not identical sets ensures that the test input and live input are not identical, and thus are seen as equivalent to the pilot and live inputs of the instant application.); and
the pilot input tensor and the live input tensor are stochastically dependent (Liu, “A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero” [col 11, lines 41-44]. “FIG. 10 illustrates a method for approximating a trained deep neural network using functional approximation to reduce the number of nodes in each layer according to an embodiment of the present invention. At step 1002, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1” [col 13, lines 23-30].  Examiner Note: Liu’s setting of random input pixels to zero makes the training input stochastically dependent upon the live input.).

As per claim 3, Liu teaches The computer-implemented method from claim 1, further comprising: storing the execution data in memory as stored execution data (Liu, “In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 51-56]. “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer Software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define Such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and ; and
priming the directed graph for the conditional execution, prior to the conditional execution of the directed graph, using the stored execution data (Liu, “For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 37-56].).

As per claim 4, Strauss teaches The computer-implemented method from claim 1, wherein: the directed graph includes a set of vertices and a set of edges interconnecting the set of vertices (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. ;
the directed graph is a neural network (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26].);
the set of edges of the directed graph are calculations involving a set of weights for the neural network, wherein the set of weights include at least one weight tensor (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26].);
at least a subset of the set of vertices are weights for the neural network (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of ;
the conditional execution of the directed graph produces an inference tensor (Liu, “At step 108, anatomical object detection is performed in the medical image using the approximation of the trained deep neural network. In a possible implementation, a sliding window approach can be used in which a respective image patch centered at each pixel or voxel is extracted from the medical image. Each image patch is input to the approximation of the trained deep neural network, which operates directly on the pixels or voxels in each patch. If the trained deep neural network is a discriminative deep neural network, the approximation of the trained deep neural network calculates for each image patch, a probability that the target anatomical landmark is located at the pixel or voxel at which the image patch is centered. The location with the highest probability can then be selected as the detected anatomical landmark location in the medical image. If the trained deep neural network is a deep neural network regressor, the approximation of the trained deep neural network outputs a difference vector for each image patch that provides a displacement from the pixel or voxel at which the image patch is centered to a predicted location of the target anatomical landmark in the medical image. The predicted locations from each of the image patches can then be aggregated to determine the detected anatomical landmark location in the medical image. At step 110, the anatomical object detection result is output. For example, the anatomical object detection result can be output by displaying the medical image on a display device of the computer system with the anatomical object location marked or highlighted in the displayed medical image” [col 5, line 42 – col 6, line 4]. Examiner Note: Liu’s object detection result output is seen as equivalent to ; and
the inference tensor is a response of the neural network to the live input tensor (Liu, “At step 108, anatomical object detection is performed in the medical image using the approximation of the trained deep neural network. In a possible implementation, a sliding window approach can be used in which a respective image patch centered at each pixel or voxel is extracted from the medical image. Each image patch is input to the approximation of the trained deep neural network, which operates directly on the pixels or voxels in each patch. If the trained deep neural network is a discriminative deep neural network, the approximation of the trained deep neural network calculates for each image patch, a probability that the target anatomical landmark is located at the pixel or voxel at which the image patch is centered. The location with the highest probability can then be selected as the detected anatomical landmark location in the medical image. If the trained deep neural network is a deep neural network regressor, the approximation of the trained deep neural network outputs a difference vector for each image patch that provides a displacement from the pixel or voxel at which the image patch is centered to a predicted location of the target anatomical landmark in the medical image. The predicted locations from each of the image patches can then be aggregated to determine the detected anatomical landmark location in the medical image. At step 110, the anatomical object detection result is output. For example, the anatomical object detection result can be output by displaying the medical image on a display device of the computer system with the anatomical object location marked or highlighted in the displayed medical image” [col 5, line 42 – col 6, line 4]. Examiner Note: Liu’s object detection result output is seen as equivalent to an inference tensor, and is produced after the approximation and conditioning of Liu’s neural network. It is necessarily a result of the live input (image to be classified).).

As per claim 5, Liu teaches The computer-implemented method from claim 4, wherein: an edge in the set of edges is a calculation using a four dimensional tensor. (Liu, Figure 2, “As shown in FIG. 2, the AE 200 is a feed-forward neural network with one hidden layer 204. The AE 200 has an input layer L.sub.1 202, the hidden layer L.sub.2, and an output layer L.sub.3 206. If the AE 200 is a fully connected network, each node in the input layer 202 can correspond to a respective voxel or pixel of an image patch. Ignoring the bias term (the nodes labeled as +1 in FIG. 2), the input and output layers 202 and 206, respectively have the same number of nodes. The goal of an AE is to minimize the difference between the input and output vectors. If the hidden layer 204 has a size equal to or larger than the input layer 202, an AE may learn an identify transformation. To prevent such a trivial solution, an AE can be set up with a hidden layer 204 with fewer nodes than the input layer 202. The nodes of the hidden layer 204 can be calculated as a function of a bias term and a weighted sum of the nodes of the input layer 202, where a respective weight is assigned to each connection between a node of the input layer 202 and a node in the hidden layer 204. The bias term and the weights between the input layer 202 and the hidden layer 204 are learned in the training of the AE 200, for example using a back-propagation algorithm. The nodes of the hidden layer 204 can be considered to be features extracted from the pixels (represented by the nodes of the input layer 202) of an input image patch, and the learned weights can be considered to be filters that filter the input image data to generate the features” [col 4, lines 16-42]. Examiner Note: Figure two demonstrates that Liu’s network can include a four dimensional calculation, which could be represented as a tensor.)


As per claim 6, Liu teaches The computer-implemented method from claim 4, wherein: the deriving of the simplified version of the directed graph includes down-sampling the directed graph by a sampling factor (Liu, “According to an advantageous embodiment of the present invention, the SparseConnect and ShrinkConnect methods for approximating a trained deep neural network can be combined. The SparseConnect and ShrinkConnect methods exploit different types of redundancy within a trained deep neural network. The methods complement each other and may be combined to achieve an even greater speed up. For example, in a possible implementation, a trained deep neural network can be first be approximated using the ShrinkConnect method to reduce the number of nodes in each layer of the trained deep neural network, followed by using the SparseConnect method (using thresholding or re-weighted L1-norm minimization) to sparsify the weights in the filters connecting each layer in the approximation of the deep neural network resulting from applying the ShrinkConnect method. The present inventors tested this combined method using the thresholding approach for weight sparsification (SparseConnect) in order to approximate the trained deep neural network for LV apex detection in 2D MR images. The original trained deep neural network was simplified by a factor of 3 using the ShrinkConnect method (function approximation) and then further simplified by a factor of 10 using the SparseConnect method (weight sparsification)” [col 15, lines 21-43]. Examiner Note: Liu’s use of weight sparsification and function approximation by certain factors is seen as a form of down sampling.);
the simplified version of the directed graph is thereby a down-sampled version of the directed graph (Liu, “According to an advantageous embodiment of the present invention, the SparseConnect and ShrinkConnect methods for approximating a trained deep neural network can be combined. The SparseConnect and ShrinkConnect methods exploit different types of redundancy within a trained deep neural network. The methods complement each other and may be combined to achieve an even greater speed up. For example, in a possible implementation, a trained deep neural network can be first be approximated using the ShrinkConnect method to reduce the number of nodes in each layer of the trained deep neural network, followed by using the SparseConnect ;
a first complete set of tensors used for executing the simplified version of the directed graph has a rank (Liu, “In order to evaluate the effectiveness of this approach, the present inventors have used this approach in left ventricle (LV) apex detection in 2D MR images. The dataset contains 7961 images from 184 patients, from which positive and negative patches of 32×32 pixels were sampled. 75% of the patches were randomly selected for training and the rest were used for testing. Images of the same patient appear multiple times within the same set, but not both. Positive patches were generated by placing the center at the annotated ground truth can cropping the corresponding image patch. Negative patches were sampled far away from the ground truth location of the LV apex. A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero. The size of the layers of the trained deep neural are 1024-1024-300-100-2. The training was initialized with unsupervised pre-training and then refined using supervised back-propagation. Table 1 shows the 2D LV apex classification error of approximations of the deep neural network generated using weight sparsification performed by thresholding with different sparse factors” [col 11, lines 30-51]. Examiner Note: Liu’s test input images have a dimensionality equal to the dimensionality of the live inputs.); and
a second complete set of tensors used for executing the directed graph has the rank (Liu, “In order to evaluate the effectiveness of this approach, the present inventors have used this approach in left ventricle (LV) apex detection in 2D MR images. The dataset contains 7961 images from 184 patients, from which positive and negative patches of 32×32 pixels were sampled. 75% of the patches were randomly selected for training and the rest were used for testing. Images of the same patient appear multiple times within the same set, but not both. Positive patches were generated by placing the center at the annotated ground truth can cropping the corresponding image patch. Negative patches were sampled far away from the ground truth location of the LV apex. A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero. The size of the layers of the trained deep neural are 1024-1024-300-100-2. The training was initialized with unsupervised pre-training and then refined using supervised back-propagation. Table 1 shows the 2D LV apex classification error of approximations of the deep neural network generated using weight sparsification performed by thresholding with different sparse factors” [col 11, lines 30-51]. Examiner Note: Liu’s test input images have a dimensionality equal to the dimensionality of the live inputs.).


As per claim 8, Liu teaches The computer-implemented method from claim 4, wherein: the deriving of the simplified version of the directed graph includes replacing a set of original values of the set of weights with a set of replacement values (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar ; and
the simplified version of the directed graph has a same number of layers as the directed graph (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26]. Examiner Note: Several of Liu’s simplification methods do not reduce the layers of the network.).

As per claim 10, Liu teaches The computer-implemented method from claim 4, wherein: the collection of execution data includes a set of execution data values (Liu, “In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 51-56].); 
the set of execution data values and the set of vertices have uniquely corresponding elements (Liu, “In order to evaluate the effectiveness of this approach, the present inventors have used this approach in left ventricle (LV) apex detection in 2D MR images. The dataset contains 7961 images from 184 patients, from which positive and negative patches of 32×32 pixels were sampled. 
each uniquely corresponding vertex in the set of vertices produces a contribution to the inference tensor in response to the pilot input tensor (Liu, “In order to evaluate the effectiveness of this approach, the present inventors have used this approach in left ventricle (LV) apex detection in 2D MR images. The dataset contains 7961 images from 184 patients, from which positive and negative patches of 32×32 pixels were sampled. 75% of the patches were randomly selected for training and the rest were used for testing. Images of the same patient appear multiple times within the same set, but not both. Positive patches were generated by placing the center at the annotated ground truth can cropping the corresponding image patch. Negative patches were sampled far away from the ground truth location of the LV apex. A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero. The size of the layers of the trained deep neural are 1024-1024-300-100-2. The training was initialized with unsupervised pre-; and
each execution data value in the set of execution data values is proportional in magnitude to the contribution to the inference tensor of each uniquely corresponding vertex in the set of vertices (Liu, “The training of a deep neural network, such as a stacked denoising auto-encode can be performed based on stochastic gradient descent of a cost function measured as the Euclidean distance between predicted outcomes and the observations in the training data. In an ideal world, each node in the network should extract different pieces of information from the input image data so that the combination of nodes yields an accurate and robust prediction for the landmark location. However, there is no explicit constraint to prevent different nodes from learning the same thing. Moreover, due to the highly complex and non-convex nature of the optimization procedure used to train the deep neural network, the trained deep neural network will likely contain significant redundancy” [col 9, lines 52-64]. “In order to evaluate the effectiveness of this approach, the present inventors have used this approach in left ventricle (LV) apex detection in 2D MR images. The dataset contains 7961 images from 184 patients, from which positive and negative patches of 32×32 pixels were sampled. 75% of the patches were randomly selected for training and the rest were used for testing. Images of the same patient appear multiple times within the same set, but not both. Positive patches were generated by placing the center at the annotated ground truth can cropping the corresponding image patch. Negative patches were sampled far away from the ground truth location of the LV apex. A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of .

As per claim 11, Liu teaches The computer-implemented method from claim 4, further comprising: storing the execution data in a distributed set of memory locations (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, lines 22-38]. Examiner Note: The examiner recognizes random access memory, a known form of distributed memory, as a possible embodiment of Liu’s memory unit.); and
obtaining, from a memory location in the distributed set of memory locations using a single address, both: (i) a subset of execution data from the execution data; and (ii) a weight tensor from the set of weights (Liu, “For illustrative purposes, the SparseConnect approximation methods and the ShrinkConnect approximation methods (described below) are described herein as being used in combination with a stacked denoising auto-encoder (DAE) deep neural network. However it is to be under stood that these methods can be similarly applied to any other trained deep neural network. Let W denote the weight matrix and h denote the output at each layer, and the input-output of an auto-encoder can be expressed as:
h(l)=ƒ(W(l)x + b(l))  (6)
where ƒ is a non-linear rectification function like sigmoid function. The training of a deep neural network, such as a stacked denoising auto-encode can be performed based on stochastic gradient descent of a cost function measured as the Euclidean distance between predicted outcomes and the observations in the training data. In an ideal world, each node in the network should extract different pieces of information from the input image data so that the combination of nodes yields an accurate and robust prediction for the landmark location. However, there is no explicit constraint to prevent different nodes from learning the same thing. Moreover, due to the highly complex and non-convex nature of the optimization procedure used to train the deep neural network, the trained deep neural network will likely contain significant redundancy” [col 9, lines 39-64]. “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of ; and
wherein the conditioning of the execution of the directed graph is conducted in real time using the execution data and the set of weights (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 34-56].).
As per claim 12, Liu teaches The computer-implemented method from claim 4, further comprising: generating a markup of the directed graph using the collection of execution data (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of ; 
storing the markup in a distributed set of memory locations (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define such ; and
conditioning an update of the set of weights using the markup (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be .
As per claim 13, Liu teaches The computer-implemented method from claim 4, further comprising: generating a markup of the directed graph using the collection of execution data (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of ;
wherein the markup identifies a priority value for a weight tensor (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in ; and
wherein conditioning of the execution of the directed graph uses the markup (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining .
As per claim 14, Liu teaches The computer-implemented method from claim 13, further comprising: storing the markup in a distributed set of memory locations (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, lines 22-38]. Examiner Note: The examiner recognizes random access memory, a known form of distributed memory, as a possible embodiment of Liu’s memory unit.); and
obtaining the priority value and the weight tensor from a memory location in the distributed set of memory locations using a single address (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-.
As per claim 15, Liu teaches The computer-implemented method from claim 13, further comprising: storing the markup at a single memory location (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, lines 22-38].);
wherein the conditioning of the execution of the directed graph further comprises: obtaining the markup from the single memory location (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller ;
obtaining a first subset of the set of weights from memory (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-; and
wherein the first subset is selected using the markup (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on .
As per claim 16, Liu teaches The computer-implemented method from claim 13, wherein the conditioning of the execution of the directed graph further comprises: reducing an accuracy of a computation using the weight tensor based on the priority value (Liu, “The method of FIG. 3 is performed for each hidden layer (or each layer) in the trained deep neural network. As described above, the method of FIG. 3 can be performed for each hidden layer during training prior to training the subsequent layer of the deep neural network. In a possible implementation, the Haar wavelet approximation for each hidden layer can be performed during training of the deep neural network using iterative approximation and training steps. FIG. 6 illustrates iteratively training the deep neural network while approximating the weight matrices using wavelet approximation according to an embodiment of the present invention. As shown in FIG. 6, at step 602 neural network training is performed to train the weights matrices of the neural network, and at step 604, Haar wavelet approximation is performed to reconstruct the weights using 1D Haar wavelet bases and wavelet coefficients and a number of wavelet coefficients are set to zero. Steps 602 and 604 are then iterated. In each round of iteration, the wavelet coefficients that are set to zero are kept at zero, while the remaining coefficients are adjusted by the neural network training algorithm, such as backpropagation. The iterations can be repeated until a number of wavelet coefficients remaining converges, for a predetermined number of iterations, or until a stopping condition associated with a decrease in accuracy of the approximation of the deep neural network is reached. In an exemplary implementation, the steps of FIG. 6 can be iterated for each hidden layer during the training of the hidden layer. In another embodiment, each iteration of step 604 can be .
As per claim 17, Liu teaches The computer-implemented method from claim 13, wherein the conditioning of the execution of the directed graph further comprises: obtaining a first subset of weights from the set of weights from memory (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each ;
replacing a set of original values of a second subset of the set of weights with a set of replacement values (Liu, “FIGS. 4 and 5 illustrate examples of approximating weight matrices for nodes of a hidden layer of a trained deep neural network using the method of FIG. 3. As shown in ; and
wherein the first subset of weights is selected using the markup (Liu, “In order to perform the anatomical landmark detection (step 108 of FIG. 1), the sliding window approach can be used where a plurality of image patches P are examined while sliding over the whole image or volume V. The computation of Φ.sup.TPΦ for each image patch for each node of the hidden layer can be sped up using integral imaging techniques when Haar wavelets are used for the wavelet bases. An integral image the same size as the original image is stored in a look-up table and the Haar wavelet bases determine which items (pixels) in the look-up table will be looked up. For example, the 4×4 Haar wavelet bases Φ.sub.4 shown in Equation (3) can be used, but the present invention is not limited thereto. In this 4×4 case, matrix multiplication PΦ amounts to four look-up operations for the multiplication with the first column of Φ.sub.4, four table look-ups and a minus operation for the second column, and two table look-ups and a minus operations for each of the third and fourth columns. This is faster than direct matrix multiplication. The same speed up can be obtained for the multiplication with Φ.sup.T. The same analysis described herein can be similarly applied to larger Haar wavelet bases as well. (32) Once Z=Φ.sup.TPΦ is obtained, the Frobenius inner .

As per claim 18, Liu teaches The computer-implemented method from claim 17, wherein: the deriving of the simplified version of the directed graph includes replacing the set of original values of the set of weights with the set of replacement values (Liu, “FIGS. 4 and 5 illustrate examples of approximating weight matrices for nodes of a hidden layer of a trained deep neural network using the method of FIG. 3. As shown in FIG. 4, image 402 is a visualization of an original trained weight matrix associated with a hidden layer node, and image 404 is a visualization of an approximation of the weight matrix shown in image 402 using Haar wavelet reconstruction with half of the wavelet coefficients set to zero. As shown in FIG. 4, image 502 is a visualization of an original trained weight matrix associated with another hidden layer node, and image 504 is a visualization of an approximation of the weight matrix shown in image 502 using Haar wavelet reconstruction with half of the wavelet coefficients set to zero. Once all the weight matrices of the hidden layer are reconstructed using the wavelet coefficients and the 1D wavelet bases and shrinkage is performed on the wavelet coefficients, the wavelet coefficients and 1D wavelet bases can be used on the input image patches in place of the weight matrices in order to approximate the Frobenius inner product P:W=Σ.sub.mΣ.sub.nP(m,n)W(m,n), as follows: .

Claim 19 is a means-for system claim corresponding to method claim 1. Liu teaches A system for executing a directed graph comprising (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components” [col 16, lines 22-26].):
a means for deriving a simplified version of the directed graph (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26]. Examiner Note: Liu’s neural network is seen as equivalent to the directed graph of the instant application.);
a means for applying a pilot input tensor to the simplified version of the directed graph (Liu, “At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero [col 10, lines 49-56].” ;
a means for obtaining a collection of execution data during the application of the pilot input tensor to the simplified version of the directed graph (Liu, “At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 49-56]. Examiner Note: Liu’s training data from a training input is seen as equivalent to the execution data of the pilot input tensor, and the pilot input tensor is seen as equivalent to the training input.)
a means for applying a live input tensor to the directed graph (Liu, “A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero” [col 11, lines 41-44]. “FIG. 10 illustrates a method for approximating a trained deep neural network using functional approximation to reduce the number of nodes in each layer according to an embodiment of the present invention. At step 1002, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1” [col 13, lines .

Liu fails to disclose conditioning the execution of the directed graph, during the application of the live input tensor to the directed graph, using the collection of execution data; and obtaining an output tensor from the conditional execution of the directed graph. 

Park teaches a means for conditioning the execution of the directed graph, during the application of the live input tensor to the directed graph, using the collection of execution data (Park, “The training processor 240 changes the structure of the CNN based on an approximation result from the approximation processor 210, the changed number of output reconstruction filters γ from the filter count changer 220, and the result of modifying the structure of the following convolution layer(s) from the layer structure modifier 230; said processor may then train the CNN of the modified structure by using training data. At this time, the training processor 240 may fix the values of input conversion filters α and convolution filter β and then train the CNN of the modified structure” [0054]. Examiner Note: The adjustment of the CNN based on the results from an approximation processor is seen as equivalent to conditioning the execution of a directed graph using test execution data); and
a means for obtaining an output tensor from the conditional execution of the directed graph. (Park, “Using the trained CNN, the classifier 250 classifies image data into classes. At this time, the result of said classification may include the class of the image data as well as the classification accuracy” [0059]. Examiner Note: Park’s results of image classification are seen as the results of executing the directed graph.).

As per claim 20, Liu teaches The system from claim 19, further comprising: a means for storing the execution data in memory as stored execution data (Liu, “In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 51-56]. “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer Software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define Such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, lines 22-38].); and
a means for priming the directed graph for the conditional execution, prior to the conditional execution of the directed graph, using the stored execution data (Liu, “For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each .

As per claim 21, Liu teaches The system from claim 19, further comprising: a means for generating a markup of the directed graph using the collection of execution data (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to ;
wherein the markup identifies a priority value for a weight tensor (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero ; and
wherein conditioning of the execution of the directed graph uses the markup (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in .
As per claim 22, Liu teaches The system from claim 21, further comprising: a means for storing the markup in a distributed set of memory locations (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, lines 22-38]. Examiner Note: The examiner recognizes random access memory, a known form of distributed memory, as a possible embodiment of Liu’s memory unit.); and
a means for obtaining the priority value and the weight tensor from a memory location in the distributed set of memory locations using a single address (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of .

As per claim 23, Liu teaches The system from claim 19, further comprising: a means for generating a markup of the directed graph using the collection of execution data (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in ; 
a means for storing the markup in a distributed set of memory locations (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the computer 1202 by executing computer program instructions which define such operation. The computer program instructions may be stored in a storage device 1212 (e.g., magnetic disk) and loaded into memory 1210 when execution of the computer program instructions is desired. Thus, the steps of the methods of FIGS. 1, 3, 6, 7, and 10 may be defined by the computer program instructions stored in the memory 1210 and/or storage 1212 and controlled by the processor 1204 executing the computer program instructions” [col 16, ; and
a means for conditioning an update of the set of the direct graph using the markup (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero .

As per claim 24, Liu teaches A computer-implemented method for generating an inference from a neural network, in which each step is conducted by a processor, comprising (Liu, “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known :
deriving a simplified version of the neural network (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in a layer of the trained deep neural network. In another embodiment, principal component analysis (PCA) can be applied to a space of the weight matrices over all of the nodes in a layer of the trained deep neural network” [col 2, lines 16-26]. Examiner Note: Liu’s neural network is seen as equivalent to the directed graph of the instant application.);
applying a first input to the simplified version of the neural network (Liu, “At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero [col 10, lines 49-56].” “For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used” [col 11, lines 7-12]. Examiner Note: Liu’s training data from a training input is seen as equivalent to the execution data of the pilot input tensor, and the pilot input tensor is seen as equivalent to the training input.);
obtaining a collection of execution data during the application of the first input to the neural network (Liu, “At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 49-56]. Examiner Note: Liu’s training data from a training input is seen as equivalent to the execution data of the pilot input tensor, and the pilot input tensor is seen as equivalent to the training input.)
applying a second input to the neural network (Liu, “A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero” [col 11, lines 41-44]. “FIG. 10 illustrates a method for approximating a trained deep neural network using functional approximation to reduce the number of nodes in each layer according to an embodiment of the present invention. At step 1002, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1” [col 13, lines 23-30]. Examiner Note: Liu’s second set of inputs in the supervised training set are seen as equivalent to a live input tensor.).

Liu fails to disclose conditioning the execution of the directed graph, during the application of the live input tensor to the directed graph, using the collection of execution data; and obtaining an output tensor from the conditional execution of the directed graph. 

Park teaches conditioning the computation of the neural network, during the application of second input to the neural network (Park, “The training processor 240 changes the structure of the CNN based on an approximation result from the approximation processor 210, the changed number of output reconstruction filters γ from the filter count changer 220, and the result of modifying the structure of the following convolution layer(s) from the layer structure modifier 230; said processor may then train the CNN of the modified structure by using training data. At this time, the training processor 240 may fix the values of input conversion filters α and convolution filter β and then train the CNN of the modified structure” [0054]. Examiner Note: The adjustment of the CNN based on the results from an approximation processor is seen as equivalent to conditioning the computation of the neural network); and
obtaining an inference from the conditional computation of the neural network (Park, “Using the trained CNN, the classifier 250 classifies image data into classes. At this time, the result of said classification may include the class of the image data as well as the classification accuracy” [0059]. Examiner Note: Park’s results of image classification are seen as the results of executing the neural network.); 
wherein the conditional computation of the neural network is conditioned using the execution data (Park, “The training processor 240 changes the structure of the CNN based on an approximation result from the approximation processor 210, the changed number of output reconstruction filters γ from the filter count changer 220, and the result of modifying the structure of the following convolution layer(s) from the layer structure modifier 230; said processor may then train the CNN of the modified structure by using training data. At this time, the training processor 240 may fix the values of input conversion filters α and convolution filter β and then train the CNN ; and
the conditional computation of the neural network is less computationally intensive than a non-conditional computation of the neural network using the second input (Park, “In addition, the larger the CNN model, the more precisely it may recognize objects. Thus, for object recognition, a model that is larger than generally required is used, which causes an increase in the amount of time spent in computation and recognition of the object” [0007]. “In one general aspect, a method of training a convolutional neural network (CNN) including a plurality of convolution layers stored in a non-transitory memory is provided, the method involving approximating, using a processor, a convolution layer among the plurality of convolution layers using a low-rank approximation; reducing the number of output reconstruction filters of the approximated convolution layer; modifying a structure of the CNN based on an approximation result and the reduced number of output reconstruction filters; and training the modified CNN” [0009]. “The training processor 240 changes the structure of the CNN based on an approximation result from the approximation processor 210, the changed number of output reconstruction filters γ from the filter count changer 220, and the result of modifying the structure of the following convolution layer(s) from the layer structure modifier 230; said processor may then train the CNN of the modified structure by using training data. At this time, the training processor 240 may fix the values of input conversion filters α and convolution filter β and then train the CNN of the modified structure” [0054].).



As per claim 25, Liu teaches The computer-implemented method from claim 24, wherein: the first input and the second input are not identical (Liu, “In order to evaluate the effectiveness of this approach, the present inventors have used this approach in left ventricle (LV) apex detection in 2D MR images. The dataset contains 7961 images from 184 patients, from which positive and negative patches of 32×32 pixels were sampled. 75% of the patches were randomly selected for training and the rest were used for testing. Images of the same patient appear multiple times within the same set, but not both. Positive patches were generated by placing the center at the annotated ground truth can cropping the corresponding image patch. Negative patches were sampled far away from the ground truth location of the LV apex. A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero. The size of the layers of the trained deep neural are 1024-1024-300-100-2. The training was initialized with unsupervised pre-training and then refined using supervised back-propagation. Table 1 shows the 2D LV apex classification error of approximations of the deep neural network generated using weight sparsification performed by thresholding with different sparse factors” [col 11, lines 30-51]. ; and
the first input and the second input are stochastically dependent (Liu, “A stacked DAE deep neural network was trained for detecting the LV apex in 2D MR images. The noise fraction of the DAE is set to 50%, i.e., 50% of the input pixels were randomly set to zero” [col 11, lines 41-44]. “FIG. 10 illustrates a method for approximating a trained deep neural network using functional approximation to reduce the number of nodes in each layer according to an embodiment of the present invention. At step 1002, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1” [col 13, lines 23-30].  Examiner Note: Liu’s setting of random input pixels to zero makes the training input stochastically dependent upon the live input.).

As per claim 26, Liu teaches The computer-implemented method from claim 24, further comprising: storing the execution data in memory as stored execution data (Liu, “In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 51-56]. “The above-described methods for anatomical landmark detection and approximating a trained deep neural network may be implemented on a computer using well-known computer processors, memory units, storage devices, computer Software, and other components. A high-level block diagram of such a computer is illustrated in FIG. 12. Computer 1202 contains a processor 1204, which controls the overall operation of the ; and
priming the neural network for the conditional computation, prior to the conditional computation of the neural network, using the stored execution data (Liu, “For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero” [col 10, lines 37-56].).

As per claim 27, Liu teaches The computer-implemented method from claim 24, wherein: the deriving of the simplified version of the neural network includes down-sampling the directed graph by a sampling factor (Liu, “According to an advantageous .

As per claim 28, Liu teaches The computer-implemented method from claim 4, wherein: the deriving of the simplified version of the directed graph includes replacing a set of original values of the set of weights with a set of replacement values (Liu, “In one embodiment, weight sparsification can be used to calculate the approximation of the trained deep neural network. In another embodiment, function approximation can be used to reduce a number of nodes in each level of the trained deep neural network classifier. In another embodiment, 1-D Haar wavelet bases and wavelet coefficients can be used to reconstruct a weight matrix for a given node in .


As per claim 29, Liu teaches The computer-implemented method from claim 24, further comprising: generating a markup of the neural network using the collection of execution data (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero ;
wherein the markup identifies a priority value for a weight in the neural network (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For example, the number of non-zero weights in each filter can be reduced using thresholding or by enforcing L1-norm minimization in the back-propagation algorithm, as described in greater detail below. At step 706, the remaining non-zero weights of each filter in the approximated deep neural network resulting from step 704 are refined. In particular, a supervised training algorithm, such as supervised back-propagation can be used to re-train the approximated deep neural network based on the observed outcomes in the training data, by refining the remaining non-zero weights for each filter with the weights that were set to zero constrained to stay at zero. Step 706 can be repeated for a large number of iterations to refine the non-zero weights to achieve greater accuracy in the landmark detection. In a possible implementation for the method of FIG. 7, step 704 can be performed once to remove a large number (e.g., 90%) of non-zero weights in each filter (e.g., using thresholding or L1-norm minimization), and then step 706 can be performed and possibly repeated multiple times to refine the remaining non-zero weights in each filter, resulting in the final approximation of the trained deep neural network. In another possible implementation, steps 704 and 706 can be iterated to gradually reduce the number of non-zero weights for each filter to achieve a sparse set of weights. In this implementation, each iteration of step 704 can reduce a smaller number (e.g., 10%) of non-zero weights for each filter (using ; and
wherein the conditional computation uses the markup (Liu, “FIG. 7 illustrates a method of approximating a trained deep neural network using weight sparsification according to an embodiment of the present invention. At step 702, the deep neural network is trained. For example, the deep neural network can be trained using a stacked DAE in an unsupervised learning stage followed by a supervised learning stage as described above in connection with step 102 of FIG. 1. At step 704, a number of non-zero weights (coefficients) in each filter is reduced. The number of non-zero weights in each filter is reduced by setting a number of weights in the filter to zero and retaining other weights in the filter, such that a sparse set of weights is retained in each filter. For .

As per claim 30, Liu teaches The computer-implemented method from claim 29, wherein: the conditional computation of the neural network includes reducing an accuracy of a computation using the weight based on the priority value (Liu, “For example, an accuracy of the approximated deep neural network can be calculated using the training data after each iteration, and when the accuracy decreases by a certain amount, the method can be stopped and the approximated deep neural network resulting from the previous iteration can be used” [col 11, lines 7-12].).

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pub. No. US 9633306 B2 to Liu et al, (hereinafter, “Liu”), in view of U.S. Pub. No. US 20160162782 A1 to Park et al, (hereinafter, “Park”), and further in view of “Evaluation of Interpolation Effects on Upsampling and Accuracy of Cost Functions-Based Optimized Automatic Image Registration”  to Mahmoudzadeh and Kashou, (hereinafter, “Mahmoudzadeh”). 


As per claim 7, Liu discloses down-sampling of the directed graph, but does not disclose down-sampling of the directed graph utilizing polynomial interpolation.

 Mahmoudzadeh teaches The computer-implemented method from claim 6, wherein: the down-sampling of the directed graph utilizes polynomial interpolation (Mahmoudzadeh, 1.1.4, “B-spline interpolation uses weighted voxel values in a wider neighborhood compared to trilinear interpolation, but both the B-spline and trilinear kernels are symmetrical and separable. The place of the neighboring points as control points relates to B-spline interpolation and combines the intensity values at these places using a set of polynomial basis according to (5) [16].
Equation (5) shows k-order B-spline with n + 1 control points (P1, P2,…, Pn),
P(t)=∑n+1i=1Ni,kPi, tmin≤t<tmax. (5)
In (5), N i,k are the polynomial functions of order k (degree k − 1), and n is the number of control points; k must be at least 2 (linear) and less than n + 1.
P(t) is validly defined for t min≤ t < t max where t min= t k and t max= t n+2. A knot vector (t 1, t 2,…, t k+(n+1)) must be determined. This specifies the values of t at which the pieces of curve join, like knots joining bits of string. It is important to note that the degree of the weighting polynomial (the order of the curve) is not dependent on the number of control points, n [17]. The weighting polynomial can be recursively defined by the following equation [18]”. Examiner Note: When combined into Liu and Park’s neural network approximation system, Mahmoudzadeh’s polynomial interpolation method would result in the down-sampling of a directed graph utilizing polynomial interpolation.).

Liu, Park, and Mahmoudzadeh are analogous art because they are directed towards enhanced data processing methods. Therefore, it would have been obvious to one of ordinary skill in the art .

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over U.S. Pub. No. US 9633306 B2 to Liu et al, (hereinafter, “Liu”), in view of U.S. Pub. No. US 20160162782 A1 to Park et al, (hereinafter, “Park”), and further in view of U.S. Pub. No. US 20180046894 A1  to Yao et al, (hereinafter, “Yao”). 

As per claim 9, Liu and Park disclose the method of claim 8, but fail to teach wherein the replacing comprises one of: reducing a number of bits used to represent the set of original values to obtain the set of replacement values; and calculating the set of replacement values using a set of exponents of the set of original values.

Yao teaches wherein the replacing comprises one of: reducing a number of bits used to represent the set of original values to obtain the set of replacement values (Yao, “Using short fixed-point numbers instead of long floating-point numbers is efficient for implementations on the FPGA platform and can significantly reduce memory footprint and bandwidth requirements. A shorter bit width is always wanted, but it may lead to a severe accuracy loss. Though fixed-point numbers have been widely used in ANN accelerator designs, there is no comprehensive ; and
calculating the set of replacement values using a set of exponents of the set of original values (Yao, “For a fixed-point number, its value can be expressed as (9), where bw is the bit width of the number and f-l is the fractional length which can be negative.” [0098-0099]. Equation 9).

Liu, Park, and Yao are analogous art because they relate to neural network efficiency improvements. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Liu and Park’s fraud detection method with Yao’s neural network compression. The modification would have been obvious to one of ordinary skill in the art because they would have been motivated to reduce the computational demand of neural networks, which can be accomplished by Yao’s quantization (Yao, abstract, [0002]).


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. US Pub. No. 20150106310 A1 to Birdwell et Schuman and US Pub. No. 20180137406 A1 to Howard et al.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL G SMITH whose telephone number is (571)272-9730. The examiner can normally be reached on Monday-Friday from 9:00 A.M. to 5:00 P.M. EST. If attempts to reach 
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://portal.uspto.gov/external/portal. Should you have questions about access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (tollfree). 
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Respectfully Submitted,
/PAUL GORDON SMITH/
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126