DETAILED ACTION
Notice of Pre-AlA or AIA  Status
1. 	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Status of Claim
2. Claims 1-20 are pending. Claims 1, 11, and 16 are in independent forms. 
Priority
3. 	Foreign priority has been claimed to JP application #2019-031923 filed on 02/25/2019. 
Information Disclosure Statement
4. 	No information disclosure statements (IDS's) submitted on these application. 
Drawings
5. 	The drawings filed on 02/24/2020 are accepted. 

Claim Rejections - 35 USC § 103
6.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

7.	Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Yu et al. US Patent Application Publication No. 2017/0132513 (hereinafter Yu) in view of Golovashkin et al. US Patent Application Publication No. 2017/0046614 (hereinafter Golovashkin).
Regarding claim 1, Yu discloses an optimization apparatus comprising: 
“one or more memories” (see Yu pars. 0004, 0051, Any devices performing neural network operations, e.g., devices 116-122, can include a memory, e.g., a random access memory (RAM), for storing instructions and data and a processor for executing stored instructions); and one or more processors configured to, for an operation node constituting an operation of a neural network: 
“calculate (see Yu par. 0015, Determining (calculate) that multiple iterations of one or more particular operations represented by one or more particular nodes in the computational graph are performed during execution of the computational graph may comprise analyzing the computational graph to identify one or more control flow nodes in the computational graph that cause the particular operations represented by the one or more particular nodes in the computational graph to be performed multiple times. The neural network may be a recurrent neural network that receives a respective neural network input at each of a plurality of time steps and generates a respective neural network at each of the plurality of time steps. The operations represented by each of the particular nodes may generate a respective node output for each of the plurality of time steps, and the monitoring nodes may store the respective node outputs for each of the plurality of time steps); and 
“acquire data for the operation node whose operation result is to be stored, based on (see Yu par. 0015, Storing the output of the particular operation represented by the node during the iteration may include asynchronously sending the data from a device on which it was produced to a central processing unit for storage after the data was produced and asynchronously retrieving the data from the center processing unit for use on the device in the backward path through the computational graph that represents operations for computing the gradients of the objective function with respect to the parameters of the neural network. Training the neural network using the machine learning training algorithm by executing the training computational graph may comprise allocating the nodes in the training computational graph across a plurality of devices and causing each of the devices to perform the operations represented by the nodes allocated to the device). 
Yu does not explicitly discloses a time consumption. However, in analogues art, Golovashkin discloses a time consumption (see par. 0067, Before training, the vertices of each layer are fully connected with the vertices of a next layer. A neural network model can have hundreds of billions of edges. The burden of operating, during training or production, so big a neural network is computationally excessive, consuming relatively vast amount computer resources and time to compute. Computation is reduced by simplifying the graph of neural network 110).
Therefore it would have been obvious to a person of ordinary skill in the art before the effective filing date of the application to incorporate the teachings of Golovashkin into the system of Yu to include a sample size that is too big may cause sampling to consume excessive time. Whereas, a sample size that is too small may increase the amount of training iterations needed for convergence, which also consumes excessive time (see Golovashkin par. 0107).

Regarding claim 2, Yu in view of Golovashkin discloses the optimization apparatus according to claim 1,
Yu further discloses wherein the one or more processors are further configured to: calculate a memory consumption for recomputing the operation result of the focused operation node, wherein the acquired data is further based on the memory consumption (see Yu par. 0082, respective nodes outputs may be produced on a device, such as a GPU, with limited memory. Storing respective node outputs for each time step may lead to numerous values being stored on a stack, reducing the amount of device memory available for other things. Furthermore, old values are stored the longest since backpropagation uses values in reverse order of the forward propagation).  

Regarding claim 3, Yu in view of Golovashkin discloses the optimization apparatus according to claim 2, 
Yu further discloses wherein the one or more processors are configured to calculate the memory consumption using a lower set capable of performing recomputation of the focused operation node by the operation node included in the lower set, the lower set being based on an operation sequence in a forward propagation process in the representation (see Yu par. 0081, where the neural network is a recurrent neural network, the operations represented by each of the particular nodes generate a respective node output for each of the time steps, and the monitoring nodes store the respective node outputs for each of the time steps, i.e., so that the outputs of the operations of the particular nodes for all of the time steps are available when the backward pass begins after the neural network output for the last time step is computed. In other words, to reuse forward values in the backward propagation path, the example system detects, during the construction of the backpropagation path, the forward values that are needed in the backpropagation. For each forward value, the system introduces a stack and adds nodes, such as “Iteration Counter” operations, in the forward propagation path to save the forward values at each iteration to the stack. The backpropagation path uses these values from the stack in reverse order);  

Regarding claim 4, Yu in view of Golovashkin discloses the optimization apparatus according to claim 3, 
Yu further discloses wherein the one or more processors are configured to calculate the memory consumption based on a memory consumption in an area of nodes until the focused operation node is reached in the forward propagation process (see Yu par. 0070,  The system begins a backward propagation path with the last operation node in the forward path. The system then adds the differentiated operations of the forward propagation path in reverse order to the backward propagation path until the system reaches the first node of the forward propagation path. For example, if a forward propagation path includes operations A, B, and C, the backward propagation will include C′, B′, and finally A′). 

Regarding claim 5, Yu in view of Golovashkin discloses the optimization apparatus according to claim 3, 
Yu further discloses wherein the one or more processors are configured to calculate the memory consumption based on a memory consumption for storing the operation result of the focused operation node (see Yu par. 0082, Storing respective node outputs for each time step may lead to numerous values being stored on a stack, reducing the amount of device memory available for other things). 

Regarding claim 6, Yu in view of Golovashkin discloses the optimization apparatus according to claim 3, 
Yu further discloses wherein the one or more processor are configured to calculate the memory consumption based on a memory consumption for storing an operation result of the lower set having the focused operation node as a boundary (see Yu par. 0006, Each node represents a respective operation performed by the neural network as part of determining a neural network output from a neural network input, each connector directed edge connects a respective first node to a respective second node that represents an operation that receives, as input, an output of an operation represented by the respective first node, and each parameter directed edge connects into a respective node and represents a flow of one or more parameters of the neural network as input to the operation represented by the respective node). 

Regarding claim 7, Yu in view of Golovashkin the optimization apparatus according to claim 3, 
Golovashkin further discloses wherein the one or more processors are configured to calculate, when using an operation result of a gradient in the another operation node at the time of operating a gradient in the focused operation node, the memory consumption based on a memory consumption for storing the operation result of the gradient in the another operation node (see Golovashkin pars. 0071-0072, The second phase is a forward-backward pass over the graph. During the forward-backward pass, computer 100 calculates a value of and a gradient of the objective function on the full graph, and calculates the Hessian matrix 140 on the sparsified or reduced graph. The elements of sparse Hessian matrix 140 are coefficients calculated as partial second derivatives of edge weights. Calculation of sparse Hessian matrix 140 during an iteration is based on sparse Hessian matrix 140 of the previous iteration. As sparsification removes edges from the graph, the graph becomes sparser. This makes the Hessian matrix 140 sparser, such that more coefficients of sparse Hessian matrix 140 become zero, with zero representing a removed edge. Computer 100 may store sparse Hessian matrix 140 in a format optimized for a sparse matrix). 
Therefore it would have been obvious to a person of ordinary skill in the art before the effective filing date of the application to incorporate the teachings of Golovashkin into the system of Yu to include a sample size that is too big may cause sampling to consume excessive time. Whereas, a sample size that is too small may increase the amount of training iterations needed for convergence, which also consumes excessive time (see Golovashkin par. 0107).

Regarding claim 8, Yu in view of Golovashkin discloses the optimization apparatus according to claim 3, 
Yu further discloses wherein the one or more processors are configured to calculate the time consumption by calculating a recomputation time from the operation node whose operation result is stored in the lower set having the focused operation node as a boundary (see Yu par. 0008, inserting a plurality of gradient nodes and training edges into the computational graph to generate a backward path through the computational graph that represents operations for computing the gradients of the objective function with respect to parameters flowing along a respective parameter directed edge in the computational graph; and training the neural network using the machine learning training algorithm by executing the training computational graph).  

Regarding claim 9, Yu in view of Golovashkin discloses the optimization apparatus according to claim 3, 
Yu further discloses wherein the one or more processors are configured to calculate the memory consumption while excluding at least part of operation nodes not used for the recomputation (see Yu par. 0015, Determining that multiple iterations of one or more particular operations represented by one or more particular nodes in the computational graph are performed during execution of the computational graph may comprise analyzing the computational graph to identify one or more control flow nodes in the computational graph that cause the particular operations represented by the one or more particular nodes in the computational graph to be performed multiple times).  

Regarding claim 10, Yu in view of Golovashkin the optimization apparatus according to claim 2, 
Golovashkin further discloses wherein the one or more processors are configured to acquire, when the memory consumption has been calculated, a memory consumption whose corresponding time consumption is minimum (see Golovashkin par. 0067, The burden of operating, during training or production, so big a neural network is computationally excessive, consuming relatively vast amount computer resources and time to compute. Computation is reduced by simplifying the graph of neural network 110. Use of a simplified graph of neural network 110 requires less computer resources and time to compute the Hessian matrix).  
Therefore it would have been obvious to a person of ordinary skill in the art before the effective filing date of the application to incorporate the teachings of Golovashkin into the system of Yu to include a sample size that is too big may cause sampling to consume excessive time. Whereas, a sample size that is too small may increase the amount of training iterations needed for convergence, which also consumes excessive time (see Golovashkin par. 0107).

Regarding claim 11, Yu discloses an optimization method for an operation node constituting an operation of a neural network, the method comprising: 
“calculating, by one or more processors, (see Yu par. 0015, Determining (calculate) that multiple iterations of one or more particular operations represented by one or more particular nodes in the computational graph are performed during execution of the computational graph may comprise analyzing the computational graph to identify one or more control flow nodes in the computational graph that cause the particular operations represented by the one or more particular nodes in the computational graph to be performed multiple times. The neural network may be a recurrent neural network that receives a respective neural network input at each of a plurality of time steps and generates a respective neural network at each of the plurality of time steps. The operations represented by each of the particular nodes may generate a respective node output for each of the plurality of time steps, and the monitoring nodes may store the respective node outputs for each of the plurality of time steps);  and  
4838-8062-8405.128”acquiring, by the one or more processors, data for the operation node whose operation result is to be stored, based on (see Yu par. 0015, Storing the output of the particular operation represented by the node during the iteration may include asynchronously sending the data from a device on which it was produced to a central processing unit for storage after the data was produced and asynchronously retrieving the data from the center processing unit for use on the device in the backward path through the computational graph that represents operations for computing the gradients of the objective function with respect to the parameters of the neural network. Training the neural network using the machine learning training algorithm by executing the training computational graph may comprise allocating the nodes in the training computational graph across a plurality of devices and causing each of the devices to perform the operations represented by the nodes allocated to the device).
Yu does not explicitly discloses a time consumption. However, in analogues art, Golovashkin discloses a time consumption (see par. 0067, Before training, the vertices of each layer are fully connected with the vertices of a next layer. A neural network model can have hundreds of billions of edges. The burden of operating, during training or production, so big a neural network is computationally excessive, consuming relatively vast amount computer resources and time to compute. Computation is reduced by simplifying the graph of neural network 110).
Therefore it would have been obvious to a person of ordinary skill in the art before the effective filing date of the application to incorporate the teachings of Golovashkin into the system of Yu to include a sample size that is too big may cause sampling to consume excessive time. Whereas, a sample size that is too small may increase the amount of training iterations needed for convergence, which also consumes excessive time (see Golovashkin par. 0107).
 Regarding claim 12, Yu in view of Golovashkin discloses the optimization method according to claim 11, 
Yu further discloses calculating, by the one or more processors, a memory consumption for recomputing the operation result of the focused operation node; and acquiring, by the one or more processors, the data further based on the memory consumption (see Yu par. 0082, respective nodes outputs may be produced on a device, such as a GPU, with limited memory. Storing respective node outputs for each time step may lead to numerous values being stored on a stack, reducing the amount of device memory available for other things. Furthermore, old values are stored the longest since backpropagation uses values in reverse order of the forward propagation).   

Regarding claim 13, Yu in view of Golovashkin discloses the optimization method according to claim 12, 
Yu further discloses calculating, by the one or more processors, the memory consumption using a lower set capable of performing recomputation of the focused operation node by the operation node included in the lower set, the lower set being based on an operation sequence in a forward propagation process in the representation (see Yu par. 0081, where the neural network is a recurrent neural network, the operations represented by each of the particular nodes generate a respective node output for each of the time steps, and the monitoring nodes store the respective node outputs for each of the time steps, i.e., so that the outputs of the operations of the particular nodes for all of the time steps are available when the backward pass begins after the neural network output for the last time step is computed. In other words, to reuse forward values in the backward propagation path, the example system detects, during the construction of the backpropagation path, the forward values that are needed in the backpropagation. For each forward value, the system introduces a stack and adds nodes, such as “Iteration Counter” operations, in the forward propagation path to save the forward values at each iteration to the stack. The backpropagation path uses these values from the stack in reverse order);  
  
Regarding claim 14, Yu in view of Golovashkin discloses the optimization method according to claim 13, 
Yu further discloses calculating, by the one or more processors, the memory consumption based on a memory consumption in an area of nodes until the focused operation node is reached in the forward propagation process (see Yu par. 0070,  The system begins a backward propagation path with the last operation node in the forward path. The system then adds the differentiated operations of the forward propagation path in reverse order to the backward propagation path until the system reaches the first node of the forward propagation path. For example, if a forward propagation path includes operations A, B, and C, the backward propagation will include C′, B′, and finally A′). 
 
Regarding claim 15, Yu in view of Golovashkin discloses the optimization method according to claim 12, 
Golovashkin further discloses acquiring, by the one or more processors, when the memory consumption has been calculated, a memory consumption whose corresponding time consumption is minimum (see Golovashkin par. 0067, The burden of operating, during training or production, so big a neural network is computationally excessive, consuming relatively vast amount computer resources and time to compute. Computation is reduced by simplifying the graph of neural network 110. Use of a simplified graph of neural network 110 requires less computer resources and time to compute the Hessian matrix).  
Therefore it would have been obvious to a person of ordinary skill in the art before the effective filing date of the application to incorporate the teachings of Golovashkin into the system of Yu to include a sample size that is too big may cause sampling to consume excessive time. Whereas, a sample size that is too small may increase the amount of training iterations needed for convergence, which also consumes excessive time (see Golovashkin par. 0107).
  
Regarding claim 16, Yu discloses a non-transitory computer readable medium storing a program configured to cause one or more processors to, for an operation node constituting an operation of a neural network:  
4838-8062-8405.129”calculate (see Yu par. 0015, Determining (calculate) that multiple iterations of one or more particular operations represented by one or more particular nodes in the computational graph are performed during execution of the computational graph may comprise analyzing the computational graph to identify one or more control flow nodes in the computational graph that cause the particular operations represented by the one or more particular nodes in the computational graph to be performed multiple times. The neural network may be a recurrent neural network that receives a respective neural network input at each of a plurality of time steps and generates a respective neural network at each of the plurality of time steps. The operations represented by each of the particular nodes may generate a respective node output for each of the plurality of time steps, and the monitoring nodes may store the respective node outputs for each of the plurality of time steps); and 
“acquire data for the operation node whose operation result is to be stored, based on (see Yu par. 0015, Storing the output of the particular operation represented by the node during the iteration may include asynchronously sending the data from a device on which it was produced to a central processing unit for storage after the data was produced and asynchronously retrieving the data from the center processing unit for use on the device in the backward path through the computational graph that represents operations for computing the gradients of the objective function with respect to the parameters of the neural network. Training the neural network using the machine learning training algorithm by executing the training computational graph may comprise allocating the nodes in the training computational graph across a plurality of devices and causing each of the devices to perform the operations represented by the nodes allocated to the device).
Yu does not explicitly discloses a time consumption. However, in analogues art, Golovashkin discloses a time consumption (see par. 0067, Before training, the vertices of each layer are fully connected with the vertices of a next layer. A neural network model can have hundreds of billions of edges. The burden of operating, during training or production, so big a neural network is computationally excessive, consuming relatively vast amount computer resources and time to compute. Computation is reduced by simplifying the graph of neural network 110).
Therefore it would have been obvious to a person of ordinary skill in the art before the effective filing date of the application to incorporate the teachings of Golovashkin into the system of Yu to include a sample size that is too big may cause sampling to consume excessive time. Whereas, a sample size that is too small may increase the amount of training iterations needed for convergence, which also consumes excessive time (see Golovashkin par. 0107).

Regarding claim 17, Yu in view of Golovashkin discloses the non-transitory computer readable medium according to claim 16, 
Yu further discloses wherein the one or more processors are caused to: calculate a memory consumption for recomputing the operation result of the focused operation node; and acquire the data further based on the memory consumption (see Yu par. 0082, respective nodes outputs may be produced on a device, such as a GPU, with limited memory. Storing respective node outputs for each time step may lead to numerous values being stored on a stack, reducing the amount of device memory available for other things. Furthermore, old values are stored the longest since backpropagation uses values in reverse order of the forward propagation).  
 
Regarding claim 18, Yu in view of Golovashkin discloses the non-transitory computer readable medium according to claim 17, 
Yu further discloses wherein the one or more processors are caused to calculate the memory consumption using a lower set capable of performing recomputation of the focused operation node by the operation node included in the lower set, the lower set being based on an operation sequence in a forward propagation process in the representation (see Yu par. 0081, where the neural network is a recurrent neural network, the operations represented by each of the particular nodes generate a respective node output for each of the time steps, and the monitoring nodes store the respective node outputs for each of the time steps, i.e., so that the outputs of the operations of the particular nodes for all of the time steps are available when the backward pass begins after the neural network output for the last time step is computed. In other words, to reuse forward values in the backward propagation path, the example system detects, during the construction of the backpropagation path, the forward values that are needed in the backpropagation. For each forward value, the system introduces a stack and adds nodes, such as “Iteration Counter” operations, in the forward propagation path to save the forward values at each iteration to the stack. The backpropagation path uses these values from the stack in reverse order);  
 
Regarding claim 19, Yu in view of Golovashkin discloses the non-transitory computer readable medium according to claim 18, 
Yu further discloses wherein the one or more processors are caused to calculate the memory consumption based on a memory consumption in an area of nodes until the focused operation node is reached in the forward propagation process (see Yu par. 0070,  The system begins a backward propagation path with the last operation node in the forward path. The system then adds the differentiated operations of the forward propagation path in reverse order to the backward propagation path until the system reaches the first node of the forward propagation path. For example, if a forward propagation path includes operations A, B, and C, the backward propagation will include C′, B′, and finally A′). 
 
Regarding claim 20, Yu in view of Golovashkin discloses the non-transitory computer readable medium according to claim 17, 
Yu further discloses wherein the one or more processors are caused to acquire, when the memory consumption has been calculated, a memory consumption whose corresponding time consumption is minimum (see Golovashkin par. 0067, The burden of operating, during training or production, so big a neural network is computationally excessive, consuming relatively vast amount computer resources and time to compute. Computation is reduced by simplifying the graph of neural network 110. Use of a simplified graph of neural network 110 requires less computer resources and time to compute the Hessian matrix).  
Therefore it would have been obvious to a person of ordinary skill in the art before the effective filing date of the application to incorporate the teachings of Golovashkin into the system of Yu to include a sample size that is too big may cause sampling to consume excessive time. Whereas, a sample size that is too small may increase the amount of training iterations needed for convergence, which also consumes excessive time (see Golovashkin par. 0107).



Conclusion
Any inquiry concerning this communication or earlier communications from the examiner
should be directed to SAMUEL AMBAYE whose telephone number is (571)270-7635. The examiner can
normally be reached M-F 9:00 AM - 6:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jeffrey Pwu can be reached on (571) 272-6798. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/SAMUEL AMBAYE/Examiner, Art Unit 2433                                                                                                                                                                                                        
/FATOUMATA TRAORE/Primary Examiner, Art Unit 2436