Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
	This Office Action is in response to applicant’s amendment filed on July 18, 2022, under which claims 1-20 are pending and under consideration. 

Response to Arguments
Applicant’s amendments have overcome the previous drawings objections, the previous claim objections, and the previous § 112(b) rejection. Therefore, the previous objections and the previous § 112(b) rejection have been withdrawn. 
Furthermore, applicant’s arguments with respect to the § 101 subject-matter eligibility rejection have been fully considered and are deemed to be persuasive. Therefore, the § 101 rejection has been withdrawn. 
Applicant’s arguments directed to the remaining matters are not deemed to be persuasive, and are addressed below.
Claim Interpretation
	Applicant argues that the terms “schedule builder” and “static memory allocator” in claim 12 do not invoke means-plus-function claim interpretation under § 112(f). The examiner respectfully disagrees. The terms “builder” and “allocator” remain as generic placeholders coupled with functional language, and remain not being modified by sufficient structure, material, or acts for performing the claimed function. 
In particular, although the terms “builder” and “allocator” do not recite the term “means,” the use of the suffix “-er” and “-or” in this situation indicate that “builder” and “allocator” are substitutes for “means for building” and “means for allocating,” respectively. Therefore, these terms are generic placeholders. Furthermore, the examiner submits that terms “builder” and “allocator” are not “understood by persons of ordinary skill in the art to have a sufficiently definite meaning as the name for structure,” MPEP § 2181 (citing Williamson v. Citrix Online, LLC, 792 F.3d 1339, 1349 (Fed. Cir. 2015)), for purposes of avoiding § 112(f) interpretation. Therefore, these terms remain interpreted under § 112(f).
Prior Art Rejections – Independent Claim 1
Applicant’s arguments directed to the § 103 obviousness rejections have been fully considered but are not deemed to be persuasive. Therefore, the previous grounds of rejection have been maintained. 
In regards to claim 1, applicant argues that the cited references do not teach “performing an analysis of a deep neural network (DNN) computation graph for a DNN to identify one or more data structures created during training of the DNN.” In support of this position, applicant asserts that Yu, paragraphs [0029] and [0041] make no mention of a “data structure.” (Applicant’s response, page 20).
Applicant’s arguments are not persuasive. While it is true that Yu does not use the exact terminology of “data structure,” Yu nonetheless teaches a data structure as recited by the claim. As noted in the previous Office Action, paragraph [0081] of Yu teaches producing multiple data structures during the training of the neural network. This paragraph of Yu teaches: “The system then inserts one or more monitoring nodes into the computational graph. The monitoring nodes represent operations that, during the execution of the training computational graph,…for each performed iteration of each of the particular operations, stores the output of the particular operation represented by the node during the iteration for use in the gradient computations during the backward path.” That is, the output of the operation represented by a node correspond to a “data structure.” Therefore, Yu teaches a “data structure,” along with other elements of the claim, as further detailed in the rejections below.
Next, applicant argues that Yu does not teach “selecting a data structure from the one or more data structures to be encoded during training of the DNN based on the analysis.” Applicant notes that Yu “makes no mention of selecting a data structure to be encoded,” and that “these portions of Yu do not mention encoding in any manner. Applicant also notes that “monitoring nodes do not, however, define an encode function.” (Applicant’s response, page 21).
This argument does not overcome the rejection because Yu was not relied upon to teach the limitation of “encoding,” which is instead taught by Parashar. The Examiner notes that one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986). Here, since a combination of two references was relied upon to teach the limitation of “selecting a data structure from the one or more data structures to be encoded during training of the DNN based on the analysis,” applicant cannot overcome the rejection merely by arguing that Yu alone does not teach the entirety of this limitation.
  In regards to Parashar, applicant argues that Parashar “does not…provide any disclosure regarding creating a modified DNN computation graph by adding ‘at least one node defining an encode function…’.” (Applicant’s response, page 21, near bottom). This argument is not persuasive because the modification of the DNN computation graph is accounted for by Yu, rather than Parashar. Parashar was instead relied upon to teach the “encode” element. Since a combination of Yu and Parashar was used to teach the above limitation in question, the rejection cannot be overcome by arguing that Parashar alone does not teach the entirety of the limitation.
In general, applicant’s remarks focus on whether or not the cited reference individually teaches the entirety of the “selecting” and “creating” steps in claim 1. However, it is not necessary for these steps to be taught by any single reference, since each specific step may still be rendered obvious by a combination of references that teach individual elements within the steps. The fact that individual elements within a single step are taught by multiple references does not negate obviousness. “A person of ordinary skill in the art is also a person of ordinary creativity, not an automaton…In many cases a person of ordinary skill will be able to fit the teachings of multiple patents together like pieces of a puzzle.” MPEP § 22141(II)(C)(citing KSR Int'l Co. v. Teleflex Inc., 550 U.S. 398, 421, 82 USPQ2d 1385, 1397 (2007)).
Finally, applicant notes that the cited references do not teach each and every recitation of amended independent claim 1, even if combined in the manner suggested in the office action. However, as discussed above, Yu teaches the recitation of “data structure,” and the combination of Yu and Parashar, as further discussed in the rejections below, teaches the other limitations addressed in applicant’s remarks. Therefore, applicant’s arguments with respect to claim 1 are not deemed to be persuasive.
Prior Art Rejections – Other Claims 
In regards to dependent claim 2, applicant argues that Xie does not teach selecting a data structure and an encode function based upon layers in a layer pair of a DNN. (See applicant’s response, page 22). This argument is not persuasive because Xie teaches the limitation of “layers in a layer pair of the DNN” in § III, paragraph 3, which mentions “pairs of adjacent layers.” The selection of the data structure and the encode function is already taught by the combination of Yu and Parashar. Since the selection in Yu and Parashar is based on the neural network structure, as represented by the computational graph, neural network structural features, such as the pair of layers in Xie, would serve as a basis for the selection of the data structure and the encode function in the context of Yu and Parashar. Therefore, the combination of Yu, Parashar, and Xie renders obvious the features of claim 2. The examiner notes that the current claim language of “based upon” does not require any specific relationship between the selected elements and the layers. As such, the current claim language does not distinguish over the cited references.
In regards to dependent claim 3, applicant argues that Xie does not teach a ReLu layer and a pooling layer as recited by claim 3. (See applicant’s response, page 22). The Examiner respectfully disagrees. Xie § III teaches a convolutional neural network that includes multiple layers of convolution and pooling, and the use of the ReLU function between pairs of adjacent layers. See, e.g., paragraph 3 of § III in Xie, which mentions “maxpooling” operations for the first two layers and that “between pairs of adjacent layers, Rectified Linear Unit (ReLU) nonlinearity is applied,” respectively corresponding to pooling and ReLu layers. Note that the “ReLU” here refers to the activation function, and may be considered to be part of the first layer, while “maxpooling” refers to a pooling operation that is implemented in the second layer.  
In regards to dependent claim 4, applicant argues Parashar does not teach a positive value map as encompassed by dependent claim 4. (See applicant’s response, page 22). The Examiner respectfully disagrees. As discussed in the rejection, Parashar, page 28, left column (top sentence), and page 29, left column (bottom sentence) teaches that the ReLU function is applied point-wise to each element in the output activation. Here, the output activation with the ReLU applied corresponds to a positive value map, since it is a map that includes positive values.
In regards to dependent claim 5, applicant argues that Xie does not teach the limitations of this claim. (See applicant’s response, page 23). The Examiner respectfully disagrees, and submits that these features are taught by Xie for the reasons given in the rejections below. In summary, Xie, § III teaches that the second layer receives an input feature map in the form of the convolutional output, and then uses maxpooling to output another feature map (an output feature map), as a function of the input. Accordingly, Xie teaches a mapping from the input to the output in the form of the convolutional layer.
In regards to dependent claim 6, applicant argues that Parashar does not teach the limitations of this claim. (See applicant’s response, page 23). This argument is not persuasive because the limitations further recited in dependent claim 6 are accounted for by a combination of Yu and Parashar. Therefore, whether Parashar alone teaches the entirety of the limitations is not dispositive. 
In regards to dependent claim 8 (which appears to be inadvertently referred to as dependent claim 7 in applicant’s response), applicant argues that Xie does not teach an output feature map generated by a ReLU layer. (See applicant’s response, page 23). The Examiner respectfully disagrees. Xie teaches the use of a ReLU function, which is understood to be the activation function. Therefore, the output of the ReLU function corresponds to an “output feature map generated by the ReLU layer.”
In regards to dependent claim 9 (which appears to be inadvertently referred to as dependent claim 8 in applicant’s response,” applicant argues that Xie does not teach an input feature map consumed by the convolutional layer. (See applicant’s response, page 23). The Examiner respectfully disagrees. Xie teaches a convolutional layer that takes feature maps as the input and “convolves” (i.e.., consumes) them for calculation. See Xie, § III, paragraph 3: “In the second layer, the sets of 32 feature maps are convolved again and then downsampled to 24 × 24 with maxpooling.”).

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. 
Claims 12-15 invoke § 112(f). The limitations that invoke § 112(f) are: "schedule builder… to analyze… create…, and determine…" and "static memory allocator…to generate…" in claim 12.
In particular, although the terms “builder” and “allocator” do not recite the term “means,” the use of the suffix “-er” and “-or” in this situation indicate that “builder” and “allocator” are substitutes for “means for building” and “means for allocating,” respectively. 
Furthermore, while the claim language does not use the original linking word “configure to,” the above limitations nonetheless convey the same functions as before. For example, the functions of “analyze…”, “create…”, and “determine…”, are still modifying the generic placeholder term “builder.” See MPEP § 2181: “Typically, the claim limitation will use the linking word ‘for’ to associate ‘means’ or a generic placeholder with the function. However, other linking words may be used, such as ‘so that’ or ‘configured to’, provided it is clear that the claim element is reciting a function. In certain circumstances, it is also not necessary to use a linking word if other words used with ‘means’, or the generic placeholder, convey the function” (emphasis added).  
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.	Although these claim limitations invoked 35 U.S.C 112(f) by meeting the 3 pronged analysis, they both have sufficient structural meaning laid out in the specification. schedule builder configured to analyze is disclosed as a hardware component configured for analysis (paragraph [0030]) and static memory allocator is also disclosed as a hardware component that utilizes data gathered to generate an efficient memory allocation strategy (paragraph [0033]).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

1.	Claim 1 is rejected under 35 U.S.C. 103 as being unpatentable over Yu et al. (US 2017/0132513 A1) (hereinafter “Yu”) in view of Parashar et al., “SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks, ” May 2017, ACM SIGARCH Computer Architecture News, Volume 45 Issue 2, pp 27–40 (hereinafter “Parashar”).
Regarding claim 1, Yu teaches a computer-implemented method, comprising:
performing an analysis of a deep neural network (DNN) computation graph for a DNN to identify one or more data structures created during training of the DNN; (In general, [0041] teaches that “the system obtains data representing the computational graph and augments the computational graph…for training the neural network.” The neural network represented by the computational graph is a “deep neural network,” as taught in [0029]: “the operations represented in the computational graph are neural network operations…Some neural networks are deep neural networks.” The “analysis” of the DNN computational graph is described in more detail in [0080]: “the system can analyze the computational graph to identify one or more control flow nodes in the computational graph that cause the particular operations represented by the one or more particular nodes in the computational graph to be performed multiple times.” In regards to the limitation of “…to identify one or more structures,” Yu teaches producing multiple data structures from training the neural network, as further disclosed in [0081]: “The system then inserts one or more monitoring nodes into the computational graph. The monitoring nodes represent operations that, during the execution of the training computational graph,…for each performed iteration of each of the particular operations, stores the output of the particular operation represented by the node during the iteration for use in the gradient computations during the backward path.” That is, the output of the operation represented by a node correspond to a “data structure.”]
selecting a data structure from the one or more data structures to be […] during training of the DNN based on the analysis; (As noted above, [0081] teaches that operations of the computational graph are iterated multiple times and that data structures are produced from training the neural network. Since [0081] teaches inserting monitoring nodes, wherein a node “stores the output of the particular operation represented by the node during the iteration for use in the gradient computations during the backward path” ([0081]), the cited reference teaches “selecting” a particular output (i.e., data structure) to be stored during training of the DNN. That is, the identification of an operational node that has a corresponding output, for purposes of inserting a monitoring node, constitutes selecting that output. This selection is “based on the analysis” since it uses the computational graph that was analyzed as described in [0080].]
creating a modified DNN computation graph by adding at least one node to the DNN computation graph, the at least one node defining an […] function for […] the selected data structure during a forward pass of the DNN while training the DNN; ([0081]: “The system then inserts one or more monitoring nodes into the computational graph,” i.e., creating a modified computation graph with the additional monitoring node. [0081] further teaches: “The monitoring nodes represent operations that, during the execution of the training computational graph, …for each performed iteration of each of the particular operations, stores the output of the particular operation represented by the node during the iteration for use in the gradient computations during the backward path.” The limitation of “during a forward pass of the DNN while training the DNN” is taught by the subsequent part of [0081]: “In other words, to reuse forward values in the backward propagation path, the example system detects, during the construction of the backpropagation path, the forward values that are needed in the backpropagation. For each forward value, the system introduces a stack and adds nodes…in the forward propagation path to save the forward values at each iteration to the stack.” That is, the forward propagation path corresponds to the “forward pass” in the instant claim limitation.) and 
causing the DNN to be trained using the modified DNN computation graph. ([0041]: “To train the neural network, the system obtains data representing the computational graph and augments the computational graph to generate a training computational graph for training the neural network using a machine learning training algorithm.” See also [0073]: “The system then trains the neural network using the machine learning training algorithm by executing the training computational graph (206).”)  
Yu does not appear to explicitly teach the limitation that the “one or more data structures” are “to be encoded” during the training of the DNN and the limitation that the at least one node defines “an encode function for encoding” the selected data structure. 
Parashar, in an analogous art, teaches the above limitations. Parashar teaches a method that “enables maintaining the sparse weights and activations in a compressed encoding” (see abstract). Therefore, Parashar is in the same field of endeavor as the claimed invention, namely machine learning. 
In particular, Parashar teaches one or more data structures “to be encoded” and “an encode function for encoding” a data structure (§ 3.2 “PT-IS-CP-sparse Dataflow” on page 32 discloses “encoding of the output activations” in reference to neural networks. Note that “activations” are described as “the output values of an individual layer that are passed as inputs to the next layer” (§1 ‘Introduction’ on page 28), and thus are a data structure. Therefore, encoding output activations is analogous to encoding the data structures of Yu.)
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Yu with the teachings of Parashar to encode nodes of a deep neural network, particularly by modifying the method of Yu such that the “one or more data structures” are “to be encoded” during the training of the DNN and such “the at least one node” defines “an encode function for encoding” the selected data structure. One of ordinary skill in the art would have been motivated to make this modification because encoding data can reduce the impact on memory. As suggested by Parashar (§2 “Motivation” on page 29), “Encoding the sparse weights and/or activations provides an architecture an opportunity to reduce the amount of data that must be moved throughout the memory hierarchy.” If less memory is used per node, than more training data can be used to train the deep neural network.

2. 	Claims 2-11 are rejected under 35 U.S.C. 103 as being unpatentable over Yu in view of Parashar, and further in view of Xie et al., “Resource-Constrained Implementation and Optimization of a Deep Neural Network for Vehicle Classification,” 29 Aug.-2 Sept. 2016, 24th European Signal Processing Conference (EUSIPCO) (hereinafter “Xie”)
	Regarding claim 2, the combination of Yu and Parashar teaches the computer-implemented method of claim 1, and teaches “wherein the selected data structure and the encode function are selected,” as noted in the rejection of claim 1, above. However, the combination of references does not appear to explicitly teach the further limitation that the selected data structure and the encode function are selected “based upon layers in a layer pair of the DNN.” 
Xie, in an analogous art, teaches the above limitation. Xie generally relates to “optimization of a deep neural network,” and is therefore in the same field of endeavor as the claimed invention, namely machine learning. In general, Xie teaches a convolutional neural network that includes multiple layers of convolution and pooling, and the use of the ReLU function between pairs of adjacent layers (see Xie, § III).
In particular, Xie teaches “based upon layers in a layer pair of the DNN” (§ III (“DNN Topology for Vehicle Classifier”), paragraph 3 discloses that layers in a deep neural network are “composed of 5 layers” and further discusses choosing “pairs of adjacent layers” on which to apply certain operations. It is noted that the selection of the data structure and the encode function is already taught by the combination of Yu and Parashar. Since the selection in Yu and Parashar is based on the neural network structure, as represented by the computational graph, neural network structural features, such as the pair of layers in Xie, would serve as a basis for the selection of the data structure and the encode function in the context of Yu and Parashar.).
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Yu and Parashar with the teachings of Xie to specifically point out layers of a deep neural network and particularly by modifying Yu, as already modified thus-far, such that the selected data structure and function are selected based upon layers in a layer pair of the DNN. One of ordinary skill in the art would have been motivated to make this modification as suggested by Xie (§ III, paragraph 3: “Between pairs of adjacent layers, Rectified Linear Unit (ReLU) nonlinearity is applied”). Thus, it would be beneficial to identify layer pairs to apply certain functions/processes such as ReLU.
	
Regarding claim 3, the combination of Yu, Parashar, and Xie teaches the computer-implemented method of claim 2, as discussed above.
	Xie further teaches wherein a first layer of the layer pair comprises a rectified linear unit (ReLU) layer, and wherein a second layer of the layer pair comprises a pooling layer. (§III (“DNN Topology for Vehicle Classifier”), paragraph 3 teaches that a ReLU function is applied to the first layer (“Between pairs of adjacent layers, Rectified Linear Unit (ReLU) nonlinearity is applied”) and that a second layer utilizes a pooling method (“downsampled to 24 x 24 with maxpooling”). Note that the “ReLU” here refers to the activation function, and may be considered to be part of the first layer, while “maxpooling” refers to a pooling operation that is implemented in the second layer.). 

	Regarding claim 4, the combination of Yu, Parashar, and Xie teaches the computer-implemented method of claim 3, as discussed above.
Parashar further teaches wherein the selected data structure comprises a positive value map (PVM) indicating whether values […] were positive (Parashar teaches this limitation by addressing non-zero values. In §1 ‘Introduction’ on page 28, it is pointed out that a ReLU (rectified linear unit) function “clamps all negative activation values to zero.” See also page 29, left column, bottom sentence: “Specifically, the rectified linear unit (ReLU) function that is commonly used as the non-linear operator in CNNs forces all negatively valued activations to be clamped to zero. After completing computation of a convolutional layer, a ReLU function is applied point-wise to each element in the output activation matrices before the data is passed to the next layer.” Here, the output activation with the ReLU applied corresponds to a positive value map, since it is a map that includes positive values. ReLU is given a negative input, it “clamps” the output to zero. When an output value is zero it is not benefitting the training of a neural network, it is just using memory without any gain. It is common to remove these by way of pruning (§1 ‘Introduction’ on page 27), which is well known in the art. However, removing these nodes can potentially reduce accuracy of the neural network. This is why Parashar utilizes encoding and decoding for such values. §3.2  ‘PT-IS-CP-sparse Dataflow’ on page 32 discloses, “the key feature is that decoding sparse format ultimately yields a non-zero data value”.), and Xie further teaches that the values are in “an input feature map to the ReLU layer of the DNN” (§ III (“DNN Topology for Vehicle Classifier”), paragraph 3 teaches that the ReLU is applied to the feature maps between layers in the DNN).

	Regarding claim 5, the combination of Yu, Parashar, and Xie teaches the computer-implemented method of claim 3, as discussed above.  
Xie further teaches wherein the selected data structure comprises a mapping between an output feature map generated by the pooling layer and an input feature map to the pooling layer. (§ III (“DNN Topology for Vehicle Classifier”), paragraph 3: “In the second layer, the sets of 32 feature maps are convolved again and then downsampled to 24 × 24 with maxpooling.” That is, Xie, § III teaches that the second layer receives an input feature map in the form of the convolutional output, and then uses maxpooling to output another feature map (an output feature map), as a function of the input. Accordingly, Xie teaches a mapping from the input to the output in the form of the convolutional layer.)

	Regarding claim 6, the combination of Yu, Parashar, and Xie teaches the computer-implemented method of claim 2, as discussed above.
	Yu further teaches wherein creating the modified DNN computation graph further comprises adding at least one node defining a […] function for […] the selected data structure during a backward pass of the DNN while training the DNN (Yu teaches creating a modified neural network/DNN, and adding a node for computational purposes (paragraph [0069]) which hold gradient weights during a backward pass of the neural network/DNN (paragraph [0081])). 
Parashar further teaches defining a “decode” function for “decoding” a data structure (§3.2 ‘PT-IS-CP-spare Dataflow’ on page 32, “decoding” the format yields a data value indexing the “value in the weight or input activation matrices”).

Regarding claim 7, the combination of Yu, Parashar, and Xie teaches the computer-implemented method of claim 6, as discussed above.
Xie further teaches wherein a first layer of the layer pair comprises a rectified linear unit (ReLU) layer, and wherein a second layer of the layer pair comprises a convolution layer. (§ III (“DNN Topology for Vehicle Classifier”), paragraph 3 teaches that the DNN has a plurality of layers, at least one layer is a convolution layer and ReLU is used between adjacent layers. Note that the “ReLU” here refers to the activation function, and may be considered to be part of the first layer, while “maxpooling” refers to a pooling operation that is implemented in the second layer.).

Regarding claim 8, the combination of Yu, Parashar, and Xie teaches the computer-implemented method of claim 7, as discussed above.
Xie further teaches wherein the selected data structure comprises an output feature map generated by the ReLU layer. (§ III (“DNN Topology for Vehicle Classifier”), paragraph 3, teaches an output layer that takes the feature map and passes it through a ReLU function (§ III, paragraph 3: “Between pairs of adjacent layers, Rectified Linear Unit (ReLU) nonlinearity is applied.”). The ReLU function therefore generates another output, which corresponds to an “output feature map generated by the ReLU layer.”). 

Regarding claim 9, the combination of Yu, Parashar, and Xie teaches the computer-implemented method of claim 7, as discussed above.
Xie further teaches wherein the selected data structure comprises an input feature map consumed by the convolution layer. (§ III (“DNN Topology for Vehicle Classifier”) teaches that the second (convolution) layer takes feature maps as the input and “convolves” (i.e.., consumes) them for calculation. See § III, paragraph 3: “In the second layer, the sets of 32 feature maps are convolved again and then downsampled to 24 × 24 with maxpooling.”) 

Regarding claim 10, the combination of Yu and Parashar teaches the computer-implemented method of claim 1, as discussed above.
Parashar teaches wherein the encode function causes a precision […] to be reduced […] (§3.2  ‘PT-IS-CP-sparse Dataflow’ on page 32 discloses, “encoding the output activations” in reference to neural networks. As previously mentioned, encoding can reduce the impact on memory usage) and Yu further teaches this occurs during training of the DNN (Paragraph [0041], training of the neural network or an instance of a deep neural network[0029]).
The combination of Yu and Parashar does not explicitly teach the limitation of “wherein the selected data structure comprises an input feature map to a layer of the DNN” and the limitation that the reduced precision is of “the input feature map.”
Xie, in an analogous art, teaches the above limitation. Xie generally relates to “optimization of a deep neural network,” and is therefore in the same field of endeavor as the claimed invention, namely machine learning. In general, Xie teaches a convolutional neural network that includes multiple layers of convolution and pooling, and the use of the ReLU function between pairs of adjacent layers (see Xie, § III).
In particular, Xie teaches “wherein the selected data structure comprises an input feature map to a layer of the DNN” (§ III (“DNN Topology for Vehicle Classifier”), paragraph 3 teaches that the second layer of the DNN takes in a feature map produced by the first layer). Thus, Xie also teaches the limitation of “the input feature map” for the reduced precision.
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Yu and Parashar with the teachings of Xie by modifying Yu, as modified thus far, to use the encoding method on a feature map and to particularly include the feature that “the selected data structure comprises an input feature map to a layer of the DNN,” such that the reduced precision caused by the encode function is “of the input feature map.” One of ordinary skill in the art would have been motivated to make this modification because feature maps can hold a large amount of data (depending on the original input), as shown by Xie (§ III (“DNN Topology for Vehicle Classifier”), paragraph 3) where a single image is “convolved into 32 feature maps.” Applying an encode function to a feature map, or any data used, would reduce the amount of data a feature map would consume. As suggested by Parashar (§2 ‘Motivation’ on page 29), “Encoding the sparse weights and/or activations provides an architecture an opportunity to reduce the amount of data that must be moved throughout the memory hierarchy.”

Regarding claim 11, the combination of Yu and Parashar teaches the limitations with respect to claim 1 as outlined above.
Parashar further teaches wherein the encode function causes a precision […] to be reduced […] (§3.2  ‘PT-IS-CP-sparse Dataflow’ on page 32 discloses, “encoding the output activations” in reference to neural networks. As previously mentioned, encoding can reduce the impact on memory usage.) and Yu further teaches this occurs “during training of the DNN” (Yu, paragraph [0041] teaches the training of the neural network or an instance of a deep neural network ([0029])).
The combination of Yu and Parashar does not appear to explicitly teach the limitation of “wherein the selected data structure comprises an output feature map generated by a layer of the DNN” and the that precision reduced by the encode function is of “the output feature map.” 
Xie, in an analogous art, teaches the above limitation. Xie generally relates to “optimization of a deep neural network,” and is therefore in the same field of endeavor as the claimed invention, namely machine learning. In general, Xie teaches a convolutional neural network that includes multiple layers of convolution and pooling, and the use of the ReLU function between pairs of adjacent layers (see Xie, § III).
In particular, Xie teaches “wherein the selected data structure comprises an output feature map generated by a layer of the DNN” and the map to be reduced is “the output feature map” (§ III “DNN Topology for Vehicle Classifier”), paragraph 3 teaches a data structure in which there are 5 layers of a DNN. The 1st and 2nd layers each output a feature map).
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Yu and Parashar with the teachings of Xie by modifying Yu, as already modified thus far, to use the encoding method on a feature map and particularly such that “the selected data structure comprises an output feature map generated by a layer of the DNN” and that the encode function reduces the precision of “the output feature map.” One of ordinary skill in the art would have been motivated to make this modification because feature maps can hold a large amount of data (depending on the original input), as shown by Xie (§ III “DNN Topology for Vehicle Classifier”) where a single image is “convolved into 32 feature maps”. Applying an encode function to a feature map, or any data used, would reduce the amount of data a feature map would consume. As suggested by Parashar (§2 ‘Motivation’ on page 29), “Encoding the sparse weights and/or activations provides an architecture an opportunity to reduce the amount of data that must be moved throughout the memory hierarchy.”

3.	Claims 12-20 are rejected under 35 U.S.C. 103 as being unpatentable over Yu in view of Parashar, Xie, and Sekiyama et al. (US 2019/0303025 A1) (hereinafter “Sekiyama”)
Regarding claim 12, Yu teaches a computing device, comprising: ([0087], teaching examples of computing devices)	one or more processors; and ([0087]: “any appropriate type of computing device…that includes one or more processors and computer readable media”)	at least one computer storage media having computer-executable instructions stored thereupon which, when executed by the one or more processors, will cause the computing device to: ([0087]: “computer readable media”; [0089]: “Computers suitable for the execution of a computer program include”).
execute a schedule builder prior to training a DNN to (As shown in the specification of the claimed invention (paragraph [0030]), a schedule builder is hardware that is capable of performing an analysis of a deep neural network. Yu shows that the neural network can be trained using such hardware in paragraph [0087] and [0089]. Furthermore, the neural network can be a “deep neural network,” as taught in [0029] (“…deep neural networks.”)).
analyze a deep neural network (DNN) computation graph for the DNN to select a data structure […] during the training of the DNN […] (In general, [0041] teaches that “the system obtains data representing the computational graph and augments the computational graph…for training the neural network.” The analyzing is further described in [0080]: “the system can analyze the computational graph to identify one or more control flow nodes in the computational graph that cause the particular operations represented by the one or more particular nodes in the computational graph to be performed multiple times.” In regards to the limitation of “…to select a data structure,” Yu teaches producing multiple data structures from training the neural network, as further disclosed in [0081]: “The system then inserts one or more monitoring nodes into the computational graph. The monitoring nodes represent operations that, during the execution of the training computational graph,…for each performed iteration of each of the particular operations, stores the output of the particular operation represented by the node during the iteration for use in the gradient computations during the backward path.” That is, the output of the operation represented by a node correspond to a “data structure.” The identification of an operational node that has a corresponding output, for purposes of inserting a monitoring node, constitutes selecting that output. In regards to a functionality to be performed on the data structure “during the training of the DNN,” the above part of [0081] teaches the storage of the outputs during training).	create a modified DNN computation graph by adding at least one […] function for […] the selected data structure during a forward training pass of the DNN, and ([0081]: “The system then inserts one or more monitoring nodes into the computational graph,” i.e., creating a modified computation graph with the additional monitoring node. [0081] further teaches: “The monitoring nodes represent operations that, during the execution of the training computational graph, …for each performed iteration of each of the particular operations, stores the output of the particular operation represented by the node during the iteration for use in the gradient computations during the backward path.” The limitation of “during a forward training pass” is taught by the subsequent part of [0081]: “In other words, to reuse forward values in the backward propagation path, the example system detects, during the construction of the backpropagation path, the forward values that are needed in the backpropagation. For each forward value, the system introduces a stack and adds nodes…in the forward propagation path to save the forward values at each iteration to the stack.” That is, the forward propagation path corresponds to the “forward pass” in the instant claim limitation.)
train the DNN using the modified DNN computation graph, […] ([0041]: “To train the neural network, the system obtains data representing the computational graph and augments the computational graph to generate a training computational graph for training the neural network using a machine learning training algorithm.” See also [0073]: “The system then trains the neural network using the machine learning training algorithm by executing the training computational graph (206).”)  
	Yu does not appear to explicitly teach: 
(1) 	The limitations that the data structure is “to be encoded” during the training and the added functions including an “encode function for encoding” the selected data structure;
(2)	The limitation that the data structure is selected “based upon layers in a layer pair of the DNN”;
(3)	the schedule builder is further executed to “determine a lifetime of the selected data structure during the training of the DNN”; 
(4)	“execute a static memory allocator prior to training the DNN to generate a memory allocation strategy based upon the lifetime of the selected data structure”; and
(5)	“wherein the memory allocation strategy is utilized during the training of the DNN to allocate and deallocate memory for storing the selected data structure.” 
Parashar, in an analogous art, teaches the limitation of “to be encoded” and an “encode” function for “encoding” the selected data structure. Parashar teaches a method that “enables maintaining the sparse weights and activations in a compressed encoding” (see abstract). Therefore, Parashar is in the same field of endeavor as the claimed invention, namely machine learning. 
In particular, Parashar teaches a data structure “to be encoded” and adding at least one “encode function for encoding” a data structure (§ 3.2 “PT-IS-CP-sparse Dataflow” on page 32 discloses “encoding of the output activations” in reference to neural networks. Note that “activations” are described as “the output values of an individual layer that are passed as inputs to the next layer” (§1 ‘Introduction’ on page 28), and thus are a data structure. Therefore, encoding output activations is analogous to encoding the data structures of Yu.)
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Yu with the teachings of Parashar to encode nodes of a deep neural network, particularly by modifying Yu such that the data structure is “to be encoded” during the training and the added functions include an “encode function for encoding” the selected data structure. One of ordinary skill in the art would have been motivated to make this modification because encoding data can reduce the impact on memory. As suggested by Parashar (§2 (“Motivation”) on page 29) “Encoding the sparse weights and/or activations provides an architecture an opportunity to reduce the amount of data that must be moved throughout the memory hierarchy,” thus, if less memory is used per node, than more training data can be stored to train the deep neural network.
	The combination of Yu and Parashar does not appear to explicitly teach the remaining limitations (2) through (5) listed above.
	Xie, in an analogous art, teaches the limitation that the data structure is selected “based upon layers in a layer pair of the DNN.” Xie generally relates to “optimization of a deep neural network,” and is therefore in the same field of endeavor as the claimed invention, namely machine learning. In general, Xie teaches a convolutional neural network that includes multiple layers of convolution and pooling, and the use of the ReLU function between pairs of adjacent layers (see Xie, § III).
	In particular, Xie teaches “based upon layers in a layer pair of the DNN” (§ III (“DNN Topology for Vehicle Classifier”), paragraph 3 discloses that layers in a deep neural network are “composed of 5 layers” and further discusses choosing “pairs of adjacent layers” on which to apply certain operations. It is noted that the selection of the data structure and the encode function is already taught by the combination of Yu and Parashar. Since the selection in Yu and Parashar is based on the neural network structure, as represented by the computational graph, neural network structural features, such as the pair of layers in Xie, would serve as a basis for the selection of the data structure and the encode function in the context of Yu and Parashar.).
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Yu and Parashar with the teachings of Xie to specifically point out layers of a deep neural network, and particularly by modifying Yu, as already modified thus far, such that the data structure is selected “based upon layers in a layer pair of the DNN.” One of ordinary skill in the art would have been motivated to make this modification as suggested by Xie (§ III (“DNN Topology for Vehicle Classifier”), paragraph 3: “Between pairs of adjacent layers, Rectified Linear Unit (ReLU) nonlinearity is applied”). Thus, it would be beneficial to identify layer pairs to apply certain functions/processes such as ReLU.
	The combination of Yu, Parashar, and Xie does not appear to explicitly teach the remaining limitations (3) through (5) listed above. 
Sekiyama, in an analogous art, teaches the remaining limitations. Sekiyama generally relates to “memory reduction for neural networks with fixed structures” (see title) and is therefore in the same field of endeavor as the claimed invention, namely machine learning. Moreover, Sekiyama addresses the problem of “reducing consumption of a memory in a propagation process for a neural network” (abstract).
In particular, Sekiyama teaches determine a lifetime of the selected data structure during the training of the DNN; (Abstract: “The method collects, in a NN training iteration, information for each node relating to an allocation, size, and lifetime thereof.” [0057]: “At block 430, corresponding to the subsequent training iteration and/or any training iteration(s) thereafter, for the i-th allocation request, return P[i] and O[i], and reallocate memory for the second and subsequent iterations using P[i] and O[i] so that memory pieces with overlapping lifetimes can be shared by multiple nodes.” It is noted that the lifetime of a node is analogous to the lifetime of a data structure, since the memory is used to store data associated with the node. See [0060]: “the [memory] pieces can be used by nodes…The nodes can correspond to nodes of the layers of the neural network.” See also [0031] and [0005] (“for a propagation process for a deep neural network having fixed structures for computation order and node data dependency”). See also [0016]: “the term ‘fixed structures’ refers to fixed structures of computation order and node data dependency for forward propagation and back propagation processes.”). Sekiyama further teaches execute a static memory allocator prior to training the DNN ([0075]: “implemented by special purpose hardware-based systems”) configured to generate a memory allocation strategy based upon the lifetime of the selected data structure; (As shown in the specification of the claimed invention, paragraph [0033], a static memory allocator is hardware that can utilize data to create a memory allocation strategy. Sekiyama shows training a neural network/DNN (paragraph [0057]), and creating a memory allocation strategy. See [0017]: “using dynamic profiling results for scheduling memory allocation. The memory allocation scheduling can be for any of forward propagation and back propagation in the neural network.” See also paragraphs [0031]-[0045] and throughout.). wherein the memory allocation strategy is utilized during the training of the DNN to allocate and deallocate memory for storing the selected data structure. ([0057]: “At block 430, corresponding to the subsequent training iteration and/or any training iteration(s) thereafter, for the i-th allocation request, return P[i] and O[i], and reallocate memory for the second and subsequent iterations using P[i] and O[i] so that memory pieces with overlapping lifetimes can be shared by multiple nodes.” That is, [0057] teaches allocation and reallocation of memory while training.).
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Yu, Parashar, and Xie with the teachings of Sekiyama to create a more efficient memory strategy, particularly by modifying Yu, as already modified thus far, such that the schedule builder is further executed to “determine a lifetime of the selected data structure during the training of the DNN,” the operations executed the computing device further includes “execute a static memory allocator prior to training the DNN to generate a memory allocation strategy based upon the lifetime of the selected data structure”; and the operation of training the DNN is performed such that “the memory allocation strategy is utilized during the training of the DNN to allocate and deallocate memory for storing the selected data structure.” One of ordinary skill in the art would have been motivated to make this modification because manipulating memory can be very beneficial for reducing a memory footprint, which is the goal for Sekiyama “reducing a consumption of a memory used for a propagation process for a deep neural network” (paragraph [0003]). If the amount of consumption for each node is the same across the network, regardless of how much it actually needs, then neural networks would become too inefficient to use for certain processes. Utilizing the memory allocation strategy distributes memory as needed. As suggested by Sekiyama et al., “The memory includes a plurality of memory segments for allocating to a plurality of nodes” (paragraph [0005]).

Regarding claim 13, the combination of Yu, Parashar, Xie, and Sekiyama teaches the computing device of claim 12, as discussed above.
Xie further teaches wherein a first layer of the layer pair comprises a rectified linear unit (ReLU) layer, and wherein a second layer of the layer pair comprises a pooling layer (§III (“DNN Topology for Vehicle Classifier”), paragraph 3 teaches that a ReLU function is applied to the first layer (“Between pairs of adjacent layers, Rectified Linear Unit (ReLU) nonlinearity is applied”) and that a second layer utilizes a pooling method (“downsampled to 24 x 24 with maxpooling”). Note that the “ReLU” here refers to the activation function, and may be considered to be part of the first layer, while “maxpooling” refers to a pooling operation that is implemented in the second layer.).

Regarding claim 14, the combination of Yu, Parashar, Xie, and Sekiyama teaches the computing device of claim 12, as discussed above.
Xie further teaches wherein a first layer of the layer pair comprises a rectified linear unit (ReLU) layer, and wherein a second layer of the layer pair comprises a convolution layer (§III (“DNN Topology for Vehicle Classifier”) (see paragraph 3), which teaches a DNN having a plurality of layers, at least one of which is a convolution layer. Furthermore ReLU is used between adjacent layers: “The DNN architecture (Figure 1) is composed of five layers — two convolutional layers, followed by two dense layers, and finally, the classifier layer… Between pairs of adjacent layers, Rectified Linear Unit (ReLU) nonlinearity is applied”).

Regarding claim 15, the combination of Yu, Parashar, Xie, and Sekiyama teaches the computing device of claim 12, as discussed above.
Xie further teaches wherein the selected data structure comprises an input feature map to a layer of the DNN, […] of an input feature map or an output feature map […] (§ III ‘DNN Topology for Vehicle Classifier’ (see Paragraph 3), the first layer outputs a feature map and the second layer of the DNN uses it as an input feature map). 
Parashar further teaches “wherein the encode function causes a precision” of an input feature map or an output feature map “to be reduced” (Parashar, §3.2  ‘PT-IS-CP-sparse Dataflow’ on page 32, discloses “encoding the output activations” in reference to neural networks. As previously mentioned, encoding can reduce the impact on memory usage). 
Yu further teaches “during training of the DNN” (Paragraph [0041], training a neural network, or a deep neural network ([0029])).

	Regarding claim 16, Yu teaches A computer storage media having computer-executable instructions stored thereupon which, when executed by one or more processors of a computing device, will cause a computing device to: (See [0084] in general, which teaches a “computer storage medium.” See also [0087]: “any appropriate type of computing device…that includes one or more processors and computer readable media”; [0089]: “Computers suitable for the execution of a computer program include”).
	analyze a deep neural network (DNN) to select a data structure […] during training of the DNN […] (In general, [0041] teaches that “the system obtains data representing the computational graph and augments the computational graph…for training the neural network.” The computational graph represents a neural network, which is “deep neural network,” as taught in [0029]. Thus, analyzing the computation graph constitutes analyzing the DNN. The analyzing is further described in [0080]: “the system can analyze the computational graph to identify one or more control flow nodes in the computational graph that cause the particular operations represented by the one or more particular nodes in the computational graph to be performed multiple times.” In regards to the limitation of “to select a data structure,” Yu teaches producing multiple data structures from training the neural network, as further disclosed in [0081]: “The system then inserts one or more monitoring nodes into the computational graph. The monitoring nodes represent operations that, during the execution of the training computational graph,…for each performed iteration of each of the particular operations, stores the output of the particular operation represented by the node during the iteration for use in the gradient computations during the backward path.” That is, the output of the operation represented by a node correspond to a “data structure.” The identification of an operational node that has a corresponding output, for purposes of inserting a monitoring node, constitutes selecting that output. In regards to a functionality to be performed on the data structure “during the training of the DNN,” the above part of [0081] teaches the storage of the outputs during training).
	create a modified DNN by adding at least one […] function to the DNN for […] the selected data structure during a forward training pass ([0081]: “The system then inserts one or more monitoring nodes into the computational graph,” i.e., creating a modified computation graph with the additional monitoring node. [0081] further teaches: “The monitoring nodes represent operations that, during the execution of the training computational graph, monitor a number of iterations of the particular operations that are performed, and for each performed iteration of each of the particular operations, stores the output of the particular operation represented by the node during the iteration for use in the gradient computations during the backward path.” Note that the monitoring and storing are functions added to the DNN, since they pertain to the computation of the neural network. The limitation of “during a forward training pass” is taught by the subsequent part of [0081]: “In other words, to reuse forward values in the backward propagation path, the example system detects, during the construction of the backpropagation path, the forward values that are needed in the backpropagation. For each forward value, the system introduces a stack and adds nodes…in the forward propagation path to save the forward values at each iteration to the stack.” That is, the forward propagation path corresponds to the “forward pass” in the instant claim limitation.)
	Yu does not appear to explicitly teach: 
(1) 	The limitations that the data structure is “to be encoded” during the training and the added functions including an “encode function for encoding” the selected data structure;
(2)	The limitation that the data structure is selected “based upon layers in a layer pair of the DNN”;
(3)	The operations of “determine a lifetime of the selected data structure during training of the modified DNN”; “generate a memory allocation strategy based upon the lifetime of the selected data structure”; and “cause the modified DNN to be trained using the memory allocation strategy.” 
Parashar, in an analogous art, teaches the limitation of “to be encoded” and an “encode” function for “encoding” the selected data structure. Parashar teaches a method that “enables maintaining the sparse weights and activations in a compressed encoding” (see abstract). Therefore, Parashar is in the same field of endeavor as the claimed invention, namely machine learning. 
However, Parashar, in an analogous art, teaches the following limitations. […] to be encoded […] of line 4 and by adding at least one encode function to the DNN for encoding the selected data structure  (§3.2  ‘PT-IS-CP-sparse Dataflow’ on page 32, “encoding the output activations” in reference to neural networks. Input/output activations are weights or values from a node(§1 ‘Introduction’ on page 28)).
In particular, Parashar teaches a data structure “to be encoded” and adding at least one “encode function for encoding” a data structure (§ 3.2 “PT-IS-CP-sparse Dataflow” on page 32 discloses “encoding of the output activations” in reference to neural networks. Note that “activations” are described as “the output values of an individual layer that are passed as inputs to the next layer” (§1 ‘Introduction’ on page 28), and thus are a data structure. Therefore, encoding output activations is analogous to encoding the data structures of Yu.)
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Yu with the teachings of Parashar to encode nodes of a deep neural network, particularly by modifying Yu such that the data structure is “to be encoded” during the training and the added functions include an “encode function for encoding” the selected data structure. One of ordinary skill in the art would have been motivated to make this modification because encoding data can reduce the impact on memory. As suggested by Parashar (§2 (“Motivation”) on page 29) “Encoding the sparse weights and/or activations provides an architecture an opportunity to reduce the amount of data that must be moved throughout the memory hierarchy.” Thus, if less memory is used per node, than more training data can be stored to train the deep neural network.
The combination of Yu and Parashar does not appear to explicitly teach the remaining limitations (2) and (3) listed above.
Xie, in an analogous art, teaches the limitation that the data structure is selected “based upon layers in a layer pair of the DNN.” Xie generally relates to “optimization of a deep neural network,” and is therefore in the same field of endeavor as the claimed invention, namely machine learning. In general, Xie teaches a convolutional neural network that includes multiple layers of convolution and pooling, and the use of the ReLU function between pairs of adjacent layers (see Xie, § III).
	In particular, Xie teaches selection “based upon layers in a layer pair of the DNN” (§ III (“DNN Topology for Vehicle Classifier”), paragraph 3 discloses that layers in a deep neural network are “composed of 5 layers” and further discusses choosing “pairs of adjacent layers” on which to apply certain operations. It is noted that the selection of the data structure and the encode function is already taught by the combination of Yu and Parashar. Since the selection in Yu and Parashar is based on the neural network structure, as represented by the computational graph, neural network structural features, such as the pair of layers in Xie, would serve as a basis for the selection of the data structure and the encode function in the context of Yu and Parashar.).
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Yu and Parashar with the teachings of Xie to specifically point out layers of a deep neural network, and particularly by modifying Yu, as already modified thus far, such that the data structure is selected “based upon layers in a layer pair of the DNN.” One of ordinary skill in the art would have been motivated to make this modification as suggested by Xie (§ III (“DNN Topology for Vehicle Classifier”), paragraph 3: “Between pairs of adjacent layers, Rectified Linear Unit (ReLU) nonlinearity is applied”). Thus, it would be beneficial to identify layer pairs to apply certain functions/processes such as ReLU.
Sekiyama, in an analogous art, teaches the remaining limitations. Sekiyama generally relates to “memory reduction for neural networks with fixed structures” (see title) and is therefore in the same field of endeavor as the claimed invention, namely machine learning. Moreover, Sekiyama addresses the problem of “reducing consumption of a memory in a propagation process for a neural network” (abstract).
	In particular, Sekiyama teaches “determine a lifetime of the selected data structure during training of the modified DNN” (Abstract: “The method collects, in a NN training iteration, information for each node relating to an allocation, size, and lifetime thereof.” [0057]: “At block 430, corresponding to the subsequent training iteration and/or any training iteration(s) thereafter, for the i-th allocation request, return P[i] and O[i], and reallocate memory for the second and subsequent iterations using P[i] and O[i] so that memory pieces with overlapping lifetimes can be shared by multiple nodes.” It is noted that the lifetime of a node is analogous to the lifetime of a data structure, since the memory is used to store data associated with the node. See [0060]: “the [memory] pieces can be used by nodes…The nodes can correspond to nodes of the layers of the neural network.” See also [0031] and [0005] (“for a propagation process for a deep neural network having fixed structures for computation order and node data dependency”). See also [0016]: “the term ‘fixed structures’ refers to fixed structures of computation order and node data dependency for forward propagation and back propagation processes.”), “generate a memory allocation strategy based upon the lifetime of the selected data structure” (As shown in the specification of the claimed invention, paragraph [0033], a static memory allocator is hardware that can utilize data to create a memory allocation strategy. Sekiyama shows training a neural network/DNN (paragraph [0057]), and creating a memory allocation strategy. See [0017]: “using dynamic profiling results for scheduling memory allocation. The memory allocation scheduling can be for any of forward propagation and back propagation in the neural network.” See also paragraphs [0031]-[0045] and throughout.); and “cause the modified DNN to be trained using memory allocation strategy” (Sekiyama shows training a neural network/DNN using allocation strategy (paragraph [0057]: “At block 430, corresponding to the subsequent training iteration and/or any training iteration(s) thereafter, for the i-th allocation request, return P[i] and O[i], and reallocate memory for the second and subsequent iterations using P[i] and O[i] so that memory pieces with overlapping lifetimes can be shared by multiple nodes.”).
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the teachings of Yu, Parashar, and Xie with the teachings of Sekiyama to create a more efficient memory strategy, particularly by modifying Yu, as already modified thus far, to include the further operations of “determine a lifetime of the selected data structure during training of the modified DNN”; “generate a memory allocation strategy based upon the lifetime of the selected data structure”; and “cause the modified DNN to be trained using the memory allocation strategy.” One of ordinary skill in the art would have been motivated to make this modification because manipulating memory can be very beneficial for reducing a memory footprint, which is the goal for Sekiyama “reducing a consumption of a memory used for a propagation process for a deep neural network” (paragraph [0003]). If the amount of consumption for each node is the same across the network, regardless of how much it actually needs, then neural networks would become too inefficient to use for certain processes. Utilizing the memory allocation strategy distributes memory as needed. As suggested by Sekiyama et al., “The memory includes a plurality of memory segments for allocating to a plurality of nodes” (paragraph [0005]).

	Regarding claim 17, the combination of Yu, Parashar, Xie, and Sekiyama teaches the computer storage media of claim 16, as discussed above.
	Xie further teaches wherein a first layer of the layer pair comprises a rectified linear unit (ReLU) layer, and wherein a second layer of the layer pair comprises a pooling layer. (§ III (“DNN Topology for Vehicle Classifier”), paragraph 3 teaches that a ReLU function is applied to the first layer (“Between pairs of adjacent layers, Rectified Linear Unit (ReLU) nonlinearity is applied”) and that a second layer utilizes a pooling method (“downsampled to 24 x 24 with maxpooling”). Note that the “ReLU” here refers to the activation function, and may be considered to be part of the first layer, while “maxpooling” refers to a pooling operation that is implemented in the second layer.).

	Regarding claim 18, the combination of Yu, Parashar, Xie, and Sekiyama teaches the computer storage media of claim 16, as discussed above.
	Xie further teaches wherein a first layer of the layer pair comprises a rectified linear unit (ReLU) layer, and wherein a second layer of the layer pair comprises a convolution layer. (§ III (“DNN Topology for Vehicle Classifier”) (see paragraph 3), which teaches a DNN having a plurality of layers, at least one of which is a convolution layer. Furthermore ReLU is used between adjacent layers: “The DNN architecture (Figure 1) is composed of five layers — two convolutional layers, followed by two dense layers, and finally, the classifier layer… Between pairs of adjacent layers, Rectified Linear Unit (ReLU) nonlinearity is applied”).

	Regarding claim 19, the combination of Yu, Parashar, Xie, and Sekiyama teaches the computer storage media of claim 16, as discussed above.
Xie further teaches wherein the selected data structure comprises an input feature map to a layer of the DNN, […] of an input feature map or an output feature map […] (§ III ‘DNN Topology for Vehicle Classifier’ (see Paragraph 3), the first layer outputs a feature map and the second layer of the DNN uses it as an input feature map). 
Parashar further teaches “wherein the encode function causes a precision” of an input feature map or an output feature map “to be reduced” (Parashar, §3.2  ‘PT-IS-CP-sparse Dataflow’ on page 32, discloses “encoding the output activations” in reference to neural networks. As previously mentioned, encoding can reduce the impact on memory usage). 
Yu further teaches “during training of the DNN” (Paragraph [0041], training a neural network, or a deep neural network ([0029])).

	Regarding claim 20, the combination of Yu, Parashar, Xie, and Sekiyama teaches the computer storage media of claim 16, as discussed above.
	Yu further teaches wherein creating the modified DNN further comprises adding at least one […] function for […] the selected data structure during a backward training pass (Yu teaches creating a modified neural network/DNN, and adding a node for computational purposes (paragraph [0069]) which hold gradient weights during a backward pass of the neural network/DNN (paragraph [0081])).
Parashar further teaches a “decode” function for “decoding” a data structure (§3.2 ‘PT-IS-CP-spare Dataflow’ on page 32, “decoding” the format yields a data value indexing the “value in the weight or input activation matrices”).


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The following documents depict the state of the art.
Rhu et al., “vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design,” arXiv:1602.08124v3 [cs.DC] 28 Jul 2016 teaches memory allocation under the consideration that each layer’s feature maps are later reused during its own backward propagation pass. 
	Singh et al. (US20190197420A1) teaches the compression of feature maps (see [0038]).

THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YAO DAVID HUANG whose telephone number is (571)270-1764. The examiner can normally be reached Monday - Friday 9:00 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/Y.D.H./Examiner, Art Unit 2124                                                                                                                                                                                                        

/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124