DETAILED ACTION
This action is in response to the claims filed 12/21/2021 for application 16/014,503. Claims 1, 8 and 15 have been amended. Claims 1, 3-8, 10-15, and 17-21 are currently pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.


Claims 1, 8, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Tamano et al. ("Optimizing Multiple Machine Learning Jobs on MapReduce", hereinafter "Tamano") in view of Chen et al. ("Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling, hereinafter "Chen") and further in view of Zhang et al. ("Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters", hereinafter "Zhang") and further in view of Campos et al. ("Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster", hereinafter "Campos") and Kotthaus et al. ("RAMBO: Resource-Aware Model-Based Optimization with Scheduling for Heterogeneous Runtimes and a Comparison with Asynchronous Model-Based Optimization", hereinafter "Kotthaus").

Regarding claim 1, Tamano teaches A method for efficient machine and deep learning hyperparameter tuning in a distributed computing system (“Recently, MapReduce has been used to parallelize machine learning algorithms. To obtain the best performance for these algorithms, tuning the parameters of the algorithms is required.” [Abstract; pg. 59; col 1, lines 1-4]), by a processor, comprising:
	collecting runtime metrics of each of a plurality of training iterations to identify candidate jobs (“When we execute a learning process using various parameters on MapReduce, there are various patterns for assigning multiple learning jobs to a cluster, and the total execution time varies depending on the assignment patterns. (Tamano discloses: The better pattern depends on the job characteristics. To execute jobs efficiently, we need to choose the best assignment among the various patterns. [pg. 59; col 2, lines 35-38]) For example, we have twenty nodes in a cluster and execute a learning job twenty times using different parameters.” [pg. 59; col 2, lines 14-21]) to merge during an execution phase (“Table I summarizes how much data each node reads, how many jobs each node executes, and how much computation each node requires for each partitioning pattern.” [pg. 62; col 1, lines 24-26]), wherein the candidate jobs comprise hyperparameter search jobs based on a training dataset (“When we execute a learning process using various parameters on MapReduce, there are various patterns for assigning multiple learning jobs to a cluster, and the total execution time varies depending on the assignment patterns. For example, we have twenty nodes in a cluster and execute a learning job twenty times using different parameters. Fig. 1 shows two patterns.” [Fig. 1; pg. 59; col 2, lines 14-20]);
        identifying the candidate jobs based on the collected runtime metrics (“To evaluate the proposed method, we implemented experimental MapReduce runtime based on the Message Passing Interface (MPI) and executed logistic regression in four cases. The results showed that the proposed method can correctly predict the optimal job assignment, which results in minimum execution time.” [pg. 60, left col, ¶2; note: Examiner is interpreting predicting to be equivalent to identifying. The prediction of optimal jobs is based off MapReduce runtime.]) 
according to a memory requirement for each current and previous training iteration of the plurality of training iterations of each of the candidate jobs (“We proposed the method for optimizing the job assignment for machine learning to minimize the total execution time. Our method uses extended MapReduce execution, memory based execution and job integration, for machine learning and optimizes the job assignment based on the execution. We developed an execution cost model to predict the execution time of these jobs on the extended execution. Minimizing the cost model derived the optimal assignment.” [pg. 66, § Conclusion, ¶1-2]) 
grouping the candidate jobs into job groups (“Twenty learning jobs with different parameters are assigned to the group. MapReduce runs on twenty nodes in parallel. On the other hand, the right pattern shows that the cluster is partitioned into ten groups. Each group consists of two nodes. Two learning jobs with different parameters are assigned to each group. Since there are ten groups, twenty jobs are executed in total.” [pg. 59; col 2, lines 22-28]); and
merging the job groups containing the candidate jobs together prior to executing the candidate jobs during the execution phase (“Since our runtime supports job integration, the forty jobs are integrated and executed so as not to read the data set forty times. Pattern B partitions the cluster into two groups and assigns twenty MapReduce jobs to each group. Twenty jobs are integrated and executed in each group.” [pg. 62; col 1, lines 9 – 13]), wherein the merging of the job groups (“In pattern B, one MapReduce job runs on ten nodes in parallel. Assuming that we have a 40-GB data set, we observe the number of MapReduce jobs each node handles and the data size each node required to read for the jobs” [pg. 62, col 1, lines 15-18; each node implies the particular accelerator device is included.]), performing the execution (“we have twenty nodes(i.e. accelerator devices) in a cluster and execute a learning job twenty times using different parameters. [pg. 59; col 2, lines 17-18]).
However Tamano fails to explicitly teach that the memory requirement is a computed memory footprint;
	wherein the memory footprint is computed, for a given job of the candidate jobs and for a given iteration of the plurality of training iterations, by determining an in-graphical processing unit (in-GPU) memory consumption estimated to be required of a particular accelerator device executing the given job, and wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the training dataset, batch size, and model type configuration parameters of the given job;
wherein the candidate jobs are initially grouped into the job groups according to the collected runtime metrics determined during the plurality of training iterations;
           Chen teaches, in disclosing a MapReduce application execution similar to the MapReduce application disclosed by Tamano, a computed memory footprint as a memory requirement, wherein the memory footprint is computed, for a given job of the candidate jobs and for a given iteration of the plurality of training iterations (“Ostrich also reduces the memory footprint of MapReduce applications in their whole lifecycle, through tiling workloads and reusing buffers. Figure 16 shows both the size and the time of memory consumption for WC on Ostrich is significantly better than that on Phoenix. The increment of memory consumption on Ostrich is less and more steady, since the Input Buffer and Intermediate Buffer are allocated in the first iteration and reused among the rest of the iterations. On the contrary, the memory consumption on Phoenix increases with the processing of input data, and the stale data occupies the memory and is not released until the entire job is finished.” [pg. 3:19, § 7.3.2. Memory Footprint; See further: "Tiled-MapReduce provides good opportunities to exploit the memory hierarchy by limiting the footprint of a subjob within a certain range…” [pg. 3:19, § 7.4.2: Relevance of Iteration Size; note: computed memory footprint is implicit.]])
Tamano and Chen are both in the same field of endeavor of optimizing machine learning jobs and thus are analogous. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Tamano’s machine learning algorithm to compute memory footprints of each candidate jobs for a plurality of iterations as taught by Chen. One would have been motivated to make this modification in order to limit memory consumption and optimize computing resources to increase performance. [pg. 3:2, § 1. Introduction, ¶4, Chen]
However Tamano/Chen fails to explicitly teach by determining an in-graphical processing unit (in-GPU) memory consumption estimated to be required of a particular accelerator device executing the given job, and wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the training dataset, batch size, and model type configuration parameters of the given job;
wherein the candidate jobs are initially grouped into the job groups according to the collected runtime metrics determined during the plurality of training iterations;
Zhang teaches by determining an in-graphical processing unit (in-GPU) memory consumption estimated to be required of a particular accelerator device executing the given job (“Poseidon reports 32x and 28x speedups when training GoogLeNet and VGG19 with 4 nodes (32 GPUs in total), confirming our statement that the overheads caused by memory movement between GPUs are usually negligible compared to network communication” [pg. 189, § Multi-GPU Settings, ¶1; See further “The Move API takes care of the memory movement between RAM and GPU memory, and performs necessary computation, e.g., the transformation between SFs and gradients, and the application of updates.” [pg. 186, § Client Library, ¶2]]), and wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the model type configuration parameters of the given job (“Therefore, the synchronization overheads depend not only on the model (type, shape, size of the layer), but also the size of the clusters. The optimal solution usually changes with M,N,K,P1,P2. HybComm takes into account these factors and allows to dynamically adjust the communication method for different parts of a model – it always chooses the best method from available ones whenever it results in fewer communication overheads.” [pg. 185, 3.2 Hybrid Communication, ¶2]);
	Tamano, Chen, and Zhang are all in the same field of endeavor of distributed learning and thus are analogous. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Zhang discloses an efficient communication architecture for GPU clusters. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Tamano’s/Chen’s teachings to compute GPU memory consumption as taught by Zhang. One would have been motivated to make this modification in order to allow more data batches to be processed using the high throughput of GPUs. [Abstract, Zhang]
	Although Zhang teaches wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the model type configuration parameters of the given job, the reference fails to explicitly teach wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the training dataset and batch size of the given job.
	Campos teaches wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the training dataset (“Each worker has an effective batch size of 128 samples, i.e. 32 images are processed at a time by each GPU. To prevent overfitting, data augmentation consisting in random crops and/or horizontal flips is asynchronously performed on CPU while previous batches are processed by the GPUs. The CNN weights are initialized using a model pre-trained on ILSVRC, practice that has been proven beneficial even when training on large-scale datasets” [pg. 319, § 6 Experimental setup, ¶3]) and batch size of the given job (“Despite the huge increase in the overall depth, a ResNet with 50 layers has roughly half the parameters in AlexNet. However, the impact of an increased depth is more notorious in the memory footprint of deeper architectures, which store more intermediate results coming from the output of each single layer, thus benefiting from multi-GPU setups that allow the use of larger batch sizes” [pg. 317, § 4. CNN architecture, ¶1]).
	Tamano, Chen, Zhang, and Campos are all in the same field of endeavor of distributed learning and thus are analogous. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Zhang discloses an efficient communication architecture for GPU clusters. Campos discloses training strategies on a distributed GPU cluster. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Tamano’s/Chen’s/Zhang’s teachings to compute a memory cost of a job as a function of a training dataset and batch size as taught by Campos. One would have been motivated to make this modification in order to optimize the use of resources in a distributed environment. [pg. 316, § Related Work, ¶1-2, Campos]

Kotthaus teaches wherein the candidate jobs are initially grouped into the job groups according to the collected runtime metrics determined during the plurality of training iterations (“First, we oversample a set of q > m candidate points from the qLCB criterion and partition them into                         
                            
                                
                                    q
                                
                                ^
                            
                             
                        
                    <                         
                            q
                        
                     clusters using the Euclidean distance. Next, we take the candidate with maximum priority pj from each cluster and sort them according to their priority before pushing them to the list                         
                            
                                
                                    J
                                
                                ^
                            
                        
                     of selected jobs. Selected jobs are removed from the clusters and empty clusters are eliminated. We repeat this procedure until we have moved all q jobs into the list                         
                            
                                
                                    J
                                
                                ^
                            
                        
                    . Finally, we assign new priorities                         
                            
                                
                                    p
                                
                                ^
                            
                        
                    j based on the order of                         
                            
                                
                                    J
                                
                                ^
                            
                        
                    , i.e. the first job in                         
                            
                                
                                    J
                                
                                ^
                            
                        
                     gets the highest priority q and the last job gets the lowest priority 1. As a result, the set of candidates contains batches of jobs with similar priority that are spread in the domain space. The new priorities serve as input for scheduling which groups the q jobs to m CPUs using the runtime estimates                         
                            
                                
                                    t
                                
                                ^
                            
                        
                    ” [pg. 186, 3.4 Refinement of Job Priorities via Clustering, ¶2-3; Kotthaus’ method organizes job candidates based on priority into a selected list and then is further used to group q jobs to m CPUs using runtime estimates, thus the examiner is interpreting this process as equivalent to candidate jobs being initially grouped into the job groups.]);
Tamano, Chen, Zhang, Campos and Kotthaus are all in the same field of endeavor of optimizing machine learning jobs and thus are analogous. Tamano [pg. 184, § 3. Resource-Aware Scheduling with Synchronous Model Update, Kotthaus]


Regarding claim 8, Tamano teaches A system for efficient machine and deep learning hyperparameter tuning in a distributed computing system (“Recently, MapReduce has been used to parallelize machine learning algorithms. To obtain the best performance for these algorithms, tuning the parameters of the algorithms is required.” [Abstract; pg. 59; col 1, lines 1-4]), by a processor, comprising:
a processor executing instructions stored in a memory device; wherein the processor (“Each node consisted of Intel Xeon 2.00-GHz 4 Core, 12-GB memory, 178-MB/s HDD bandwidth, and 64-bit Linux (v2.6.26-2) and connected it to a 1-Gbps network. [pg. 64, § Experimental Settings, ¶1):
	collects runtime metrics of each of a plurality of training iterations to identify candidate jobs (“When we execute a learning process using various parameters on MapReduce, there are various patterns for assigning multiple learning jobs to a cluster, and the total execution time varies depending on the assignment patterns. (Tamano discloses: The better pattern depends on the job characteristics. To execute jobs efficiently, we need to choose the best assignment among the various patterns. [pg. 59; col 2, lines 35-38]) For example, we have twenty nodes in a cluster and execute a learning job twenty times using different parameters.” [pg. 59; col 2, lines 14-21]) to merge during an execution phase (“Table I summarizes how much data each node reads, how many jobs each node executes, and how much computation each node requires for each partitioning pattern.” [pg. 62; col 1, lines 24-26]), wherein the candidate jobs comprise hyperparameter search jobs based on a training dataset (“When we execute a learning process using various parameters on MapReduce, there are various patterns for assigning multiple learning jobs to a cluster, and the total execution time varies depending on the assignment patterns. For example, we have twenty nodes in a cluster and execute a learning job twenty times using different parameters. Fig. 1 shows two patterns.” [Fig. 1; pg. 59; col 2, lines 14-20]);
        identify the candidate jobs based on the collected runtime metrics (“To evaluate the proposed method, we implemented experimental MapReduce runtime based on the Message Passing Interface (MPI) and executed logistic regression in four cases. The results showed that the proposed method can correctly predict the optimal job assignment, which results in minimum execution time.” [pg. 60, left col, ¶2; note: Examiner is interpreting predicting to be equivalent to identifying. The prediction of optimal jobs is based off MapReduce runtime.]) 
according to a memory requirement for each current and previous training iteration of the plurality of training iterations of each of the candidate jobs (“We proposed the method for optimizing the job assignment for machine learning to minimize the total execution time. Our method uses extended MapReduce execution, memory based execution and job integration, for machine learning and optimizes the job assignment based on the execution. We developed an execution cost model to predict the execution time of these jobs on the extended execution. Minimizing the cost model derived the optimal assignment.” [pg. 66, § Conclusion, ¶1-2]) 
groups the candidate jobs into job groups (“Twenty learning jobs with different parameters are assigned to the group. MapReduce runs on twenty nodes in parallel. On the other hand, the right pattern shows that the cluster is partitioned into ten groups. Each group consists of two nodes. Two learning jobs with different parameters are assigned to each group. Since there are ten groups, twenty jobs are executed in total.” [pg. 59; col 2, lines 22-28]); and
merges the job groups containing the candidate jobs together prior to executing the candidate jobs during the execution phase (“Since our runtime supports job integration, the forty jobs are integrated and executed so as not to read the data set forty times. Pattern B partitions the cluster into two groups and assigns twenty MapReduce jobs to each group. Twenty jobs are integrated and executed in each group.” [pg. 62; col 1, lines 9 – 13]), wherein the merging of the job groups for execution is performed for each of a plurality of accelerator devices, inclusive of the particular accelerator device (“In pattern B, one MapReduce job runs on ten nodes in parallel. Assuming that we have a 40-GB data set, we observe the number of MapReduce jobs each node handles and the data size each node required to read for the jobs” [pg. 62, col 1, lines 15-18; each node implies the particular accelerator device is included.]), performing the execution (“we have twenty nodes(i.e. accelerator devices) in a cluster and execute a learning job twenty times using different parameters. [pg. 59; col 2, lines 17-18]).
However Tamano fails to explicitly teach that the memory requirement is a computed memory footprint;
	wherein the memory footprint is computed, for a given job of the candidate jobs and for a given iteration of the plurality of training iterations, by determining an in-graphical processing unit (in-GPU) memory consumption estimated to be required of a particular accelerator device executing the given job, and wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the training dataset, batch size, and model type configuration parameters of the given job;
wherein the candidate jobs are initially grouped into the job groups according to the collected runtime metrics determined during the plurality of training iterations;
           Chen teaches, in disclosing a MapReduce application execution similar to the MapReduce application disclosed by Tamano, a computed memory footprint as a (“Ostrich also reduces the memory footprint of MapReduce applications in their whole lifecycle, through tiling workloads and reusing buffers. Figure 16 shows both the size and the time of memory consumption for WC on Ostrich is significantly better than that on Phoenix. The increment of memory consumption on Ostrich is less and more steady, since the Input Buffer and Intermediate Buffer are allocated in the first iteration and reused among the rest of the iterations. On the contrary, the memory consumption on Phoenix increases with the processing of input data, and the stale data occupies the memory and is not released until the entire job is finished.” [pg. 3:19, § 7.3.2. Memory Footprint; See further: "Tiled-MapReduce provides good opportunities to exploit the memory hierarchy by limiting the footprint of a subjob within a certain range…” [pg. 3:19, § 7.4.2: Relevance of Iteration Size; note: computed memory footprint is implicit.]])
Tamano and Chen are both in the same field of endeavor of optimizing machine learning jobs and thus are analogous. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Tamano’s machine learning algorithm to compute memory footprints of each candidate jobs for a plurality of iterations as taught by Chen. One would have been motivated to make this modification in order to limit memory [pg. 3:2, § 1. Introduction, ¶4, Chen]
	However Tamano/Chen fails to explicitly teach by determining an in-graphical processing unit (in-GPU) memory consumption estimated to be required of a particular accelerator device executing the given job, and wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the training dataset, batch size, and model type configuration parameters of the given job;
wherein the candidate jobs are initially grouped into the job groups according to the collected runtime metrics determined during the plurality of training iterations;
Zhang teaches by determining an in-graphical processing unit (in-GPU) memory consumption estimated to be required of a particular accelerator device executing the given job (“Poseidon reports 32x and 28x speedups when training GoogLeNet and VGG19 with 4 nodes (32 GPUs in total), confirming our statement that the overheads caused by memory movement between GPUs are usually negligible compared to network communication” [pg. 189, § Multi-GPU Settings, ¶1; See further “The Move API takes care of the memory movement between RAM and GPU memory, and performs necessary computation, e.g., the transformation between SFs and gradients, and the application of updates.” [pg. 186, § Client Library, ¶2]]), and wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the model type configuration parameters of the given job (“Therefore, the synchronization overheads depend not only on the model (type, shape, size of the layer), but also the size of the clusters. The optimal solution usually changes with M,N,K,P1,P2. HybComm takes into account these factors and allows to dynamically adjust the communication method for different parts of a model – it always chooses the best method from available ones whenever it results in fewer communication overheads.” [pg. 185, 3.2 Hybrid Communication, ¶2]);
	Tamano, Chen, and Zhang are all in the same field of endeavor of distributed learning and thus are analogous. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Zhang discloses an efficient communication architecture for GPU clusters. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Tamano’s/Chen’s teachings to compute GPU memory consumption as taught by Zhang. One would have been motivated to make this modification in order to allow more data batches to be processed using the high throughput of GPUs. [Abstract, Zhang]
	Although Zhang teaches wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the model type configuration parameters of the given job, the reference fails to explicitly teach wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the training dataset and batch size of the given job.
	Campos teaches wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the training dataset (“Each worker has an effective batch size of 128 samples, i.e. 32 images are processed at a time by each GPU. To prevent overfitting, data augmentation consisting in random crops and/or horizontal flips is asynchronously performed on CPU while previous batches are processed by the GPUs. The CNN weights are initialized using a model pre-trained on ILSVRC, practice that has been proven beneficial even when training on large-scale datasets” [pg. 319, § 6 Experimental setup, ¶3]) and batch size of the given job (“Despite the huge increase in the overall depth, a ResNet with 50 layers has roughly half the parameters in AlexNet. However, the impact of an increased depth is more notorious in the memory footprint of deeper architectures, which store more intermediate results coming from the output of each single layer, thus benefiting from multi-GPU setups that allow the use of larger batch sizes” [pg. 317, § 4. CNN architecture, ¶1]).
	Tamano, Chen, Zhang, and Campos are all in the same field of endeavor of distributed learning and thus are analogous. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Zhang discloses an efficient communication architecture for GPU clusters. Campos discloses training strategies on a distributed GPU cluster. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Tamano’s/Chen’s/Zhang’s teachings to compute a memory cost of a job as a function of a training dataset and batch size as taught by Campos. One would have been motivated to make this modification in order to optimize the use of resources in a distributed environment. [pg. 316, § Related Work, ¶1-2, Campos]

Kotthaus teaches wherein the candidate jobs are initially grouped into the job groups according to the collected runtime metrics determined during the plurality of training iterations (“First, we oversample a set of q > m candidate points from the qLCB criterion and partition them into                         
                            
                                
                                    q
                                
                                ^
                            
                             
                        
                    <                         
                            q
                        
                     clusters using the Euclidean distance. Next, we take the candidate with maximum priority pj from each cluster and sort them according to their priority before pushing them to the list                         
                            
                                
                                    J
                                
                                ^
                            
                        
                     of selected jobs. Selected jobs are removed from the clusters and empty clusters are eliminated. We repeat this procedure until we have moved all q jobs into the list                         
                            
                                
                                    J
                                
                                ^
                            
                        
                    . Finally, we assign new priorities                         
                            
                                
                                    p
                                
                                ^
                            
                        
                    j based on the order of                         
                            
                                
                                    J
                                
                                ^
                            
                        
                    , i.e. the first job in                         
                            
                                
                                    J
                                
                                ^
                            
                        
                     gets the highest priority q and the last job gets the lowest priority 1. As a result, the set of candidates contains batches of jobs with similar priority that are spread in the domain space. The new priorities serve as input for scheduling which groups the q jobs to m CPUs using the runtime estimates                         
                            
                                
                                    t
                                
                                ^
                            
                        
                    ” [pg. 186, 3.4 Refinement of Job Priorities via Clustering, ¶2-3; Kotthaus’ method organizes job candidates based on priority into a selected list and then is further used to group q jobs to m CPUs using runtime estimates, thus the examiner is interpreting this process as equivalent to candidate jobs being initially grouped into the job groups.]);
Tamano, Chen, Zhang, Campos and Kotthaus are all in the same field of endeavor of optimizing machine learning jobs and thus are analogous. Tamano [pg. 184, § 3. Resource-Aware Scheduling with Synchronous Model Update, Kotthaus]

Regarding claim 15, Tamano teaches A computer program product for efficient machine and deep learning hyperparameter tuning in a distributed computing system, (“Recently, MapReduce has been used to parallelize machine learning algorithms. To obtain the best performance for these algorithms, tuning the parameters of the algorithms is required.” Abstract) by a processor, the computer program product embodied on a non-transitory computer-readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising (“Each node consisted of Intel Xeon 2.00-GHz 4 Core, 12-GB memory, 178-MB/s HDD bandwidth, and 64-bit Linux (v2.6.26-2) and connected it to a 1-Gbps network. [pg. 64, § Experimental Settings, ¶1): 
(“When we execute a learning process using various parameters on MapReduce, there are various patterns for assigning multiple learning jobs to a cluster, and the total execution time varies depending on the assignment patterns. (Tamano discloses: The better pattern depends on the job characteristics. To execute jobs efficiently, we need to choose the best assignment among the various patterns. [pg. 59; col 2, lines 35-38]) For example, we have twenty nodes in a cluster and execute a learning job twenty times using different parameters.” [pg. 59; col 2, lines 14-21]) to merge during an execution phase (“Table I summarizes how much data each node reads, how many jobs each node executes, and how much computation each node requires for each partitioning pattern.” [pg. 62; col 1, lines 24-26]), wherein the candidate jobs comprise hyperparameter search jobs based on a training dataset (“When we execute a learning process using various parameters on MapReduce, there are various patterns for assigning multiple learning jobs to a cluster, and the total execution time varies depending on the assignment patterns. For example, we have twenty nodes in a cluster and execute a learning job twenty times using different parameters. Fig. 1 shows two patterns.” [Fig. 1; pg. 59; col 2, lines 14-20]);
       	an executable portion that identifies the candidate jobs based on the collected runtime metrics (“To evaluate the proposed method, we implemented experimental MapReduce runtime based on the Message Passing Interface (MPI) and executed logistic regression in four cases. The results showed that the proposed method can correctly predict the optimal job assignment, which results in minimum execution time.” [pg. 60, left col, ¶2; note: Examiner is interpreting predicting to be equivalent to identifying. The prediction of optimal jobs is based off MapReduce runtime.]) 
according to a memory requirement for each current and previous training iteration of the plurality of training iterations of each of the candidate jobs (“We proposed the method for optimizing the job assignment for machine learning to minimize the total execution time. Our method uses extended MapReduce execution, memory based execution and job integration, for machine learning and optimizes the job assignment based on the execution. We developed an execution cost model to predict the execution time of these jobs on the extended execution. Minimizing the cost model derived the optimal assignment.” [pg. 66, § Conclusion, ¶1-2]) 
an executable portion that groups the candidate jobs into job groups (“Twenty learning jobs with different parameters are assigned to the group. MapReduce runs on twenty nodes in parallel. On the other hand, the right pattern shows that the cluster is partitioned into ten groups. Each group consists of two nodes. Two learning jobs with different parameters are assigned to each group. Since there are ten groups, twenty jobs are executed in total.” [pg. 59; col 2, lines 22-28]); and
an executable portion that merges the job groups containing the candidate jobs together prior to executing the candidate jobs during the execution phase (“Since our runtime supports job integration, the forty jobs are integrated and executed so as not to read the data set forty times. Pattern B partitions the cluster into two groups and assigns twenty MapReduce jobs to each group. Twenty jobs are integrated and executed in each group.” [pg. 62; col 1, lines 9 – 13]), wherein the merging of the job groups for execution is performed for each of a plurality of accelerator devices, inclusive of the particular accelerator device (“In pattern B, one MapReduce job runs on ten nodes in parallel. Assuming that we have a 40-GB data set, we observe the number of MapReduce jobs each node handles and the data size each node required to read for the jobs” [pg. 62, col 1, lines 15-18; each node implies the particular accelerator device is included.]), performing the execution (“we have twenty nodes(i.e. accelerator devices) in a cluster and execute a learning job twenty times using different parameters. [pg. 59; col 2, lines 17-18]).
However Tamano fails to explicitly teach that the memory requirement is a computed memory footprint;
	wherein the memory footprint is computed, for a given job of the candidate jobs and for a given iteration of the plurality of training iterations, by determining an in-graphical processing unit (in-GPU) memory consumption estimated to be required of a particular accelerator device executing the given job, and wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the training dataset, batch size, and model type configuration parameters of the given job;
wherein the candidate jobs are initially grouped into the job groups according to the collected runtime metrics determined during the plurality of training iterations;
           Chen teaches, in disclosing a MapReduce application execution similar to the MapReduce application disclosed by Tamano, a computed memory footprint as a (“Ostrich also reduces the memory footprint of MapReduce applications in their whole lifecycle, through tiling workloads and reusing buffers. Figure 16 shows both the size and the time of memory consumption for WC on Ostrich is significantly better than that on Phoenix. The increment of memory consumption on Ostrich is less and more steady, since the Input Buffer and Intermediate Buffer are allocated in the first iteration and reused among the rest of the iterations. On the contrary, the memory consumption on Phoenix increases with the processing of input data, and the stale data occupies the memory and is not released until the entire job is finished.” [pg. 3:19, § 7.3.2. Memory Footprint; See further: "Tiled-MapReduce provides good opportunities to exploit the memory hierarchy by limiting the footprint of a subjob within a certain range…” [pg. 3:19, § 7.4.2: Relevance of Iteration Size; note: computed memory footprint is implicit.]])
Tamano and Chen are both in the same field of endeavor of optimizing machine learning jobs and thus are analogous. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Tamano’s machine learning algorithm to compute memory footprints of each candidate jobs for a plurality of iterations as taught by Chen. One would have been motivated to make this modification in order to limit memory [pg. 3:2, § 1. Introduction, ¶4, Chen]
	However Tamano/Chen fails to explicitly teach by determining an in-graphical processing unit (in-GPU) memory consumption estimated to be required of a particular accelerator device executing the given job, and wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the training dataset, batch size, and model type configuration parameters of the given job;
wherein the candidate jobs are initially grouped into the job groups according to the collected runtime metrics determined during the plurality of training iterations;
Zhang teaches by determining an in-graphical processing unit (in-GPU) memory consumption estimated to be required of a particular accelerator device executing the given job (“Poseidon reports 32x and 28x speedups when training GoogLeNet and VGG19 with 4 nodes (32 GPUs in total), confirming our statement that the overheads caused by memory movement between GPUs are usually negligible compared to network communication” [pg. 189, § Multi-GPU Settings, ¶1; See further “The Move API takes care of the memory movement between RAM and GPU memory, and performs necessary computation, e.g., the transformation between SFs and gradients, and the application of updates.” [pg. 186, § Client Library, ¶2]]), and wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the model type configuration parameters of the given job (“Therefore, the synchronization overheads depend not only on the model (type, shape, size of the layer), but also the size of the clusters. The optimal solution usually changes with M,N,K,P1,P2. HybComm takes into account these factors and allows to dynamically adjust the communication method for different parts of a model – it always chooses the best method from available ones whenever it results in fewer communication overheads.” [pg. 185, 3.2 Hybrid Communication, ¶2]);
	Tamano, Chen, and Zhang are all in the same field of endeavor of distributed learning and thus are analogous. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Zhang discloses an efficient communication architecture for GPU clusters. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Tamano’s/Chen’s teachings to compute GPU memory consumption as taught by Zhang. One would have been motivated to make this modification in order to allow more data batches to be processed using the high throughput of GPUs. [Abstract, Zhang]
	Although Zhang teaches wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the model type configuration parameters of the given job, the reference fails to explicitly teach wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the training dataset and batch size of the given job.
	Campos teaches wherein determining the in-GPU memory consumption comprises computing a memory cost for the given job for the given iteration as a function of the training dataset (“Each worker has an effective batch size of 128 samples, i.e. 32 images are processed at a time by each GPU. To prevent overfitting, data augmentation consisting in random crops and/or horizontal flips is asynchronously performed on CPU while previous batches are processed by the GPUs. The CNN weights are initialized using a model pre-trained on ILSVRC, practice that has been proven beneficial even when training on large-scale datasets” [pg. 319, § 6 Experimental setup, ¶3]) and batch size of the given job (“Despite the huge increase in the overall depth, a ResNet with 50 layers has roughly half the parameters in AlexNet. However, the impact of an increased depth is more notorious in the memory footprint of deeper architectures, which store more intermediate results coming from the output of each single layer, thus benefiting from multi-GPU setups that allow the use of larger batch sizes” [pg. 317, § 4. CNN architecture, ¶1]).
	Tamano, Chen, Zhang, and Campos are all in the same field of endeavor of distributed learning and thus are analogous. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Zhang discloses an efficient communication architecture for GPU clusters. Campos discloses training strategies on a distributed GPU cluster. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Tamano’s/Chen’s/Zhang’s teachings to compute a memory cost of a job as a function of a training dataset and batch size as taught by Campos. One would have been motivated to make this modification in order to optimize the use of resources in a distributed environment. [pg. 316, § Related Work, ¶1-2, Campos]

Kotthaus teaches wherein the candidate jobs are initially grouped into the job groups according to the collected runtime metrics determined during the plurality of training iterations (“First, we oversample a set of q > m candidate points from the qLCB criterion and partition them into                         
                            
                                
                                    q
                                
                                ^
                            
                             
                        
                    <                         
                            q
                        
                     clusters using the Euclidean distance. Next, we take the candidate with maximum priority pj from each cluster and sort them according to their priority before pushing them to the list                         
                            
                                
                                    J
                                
                                ^
                            
                        
                     of selected jobs. Selected jobs are removed from the clusters and empty clusters are eliminated. We repeat this procedure until we have moved all q jobs into the list                         
                            
                                
                                    J
                                
                                ^
                            
                        
                    . Finally, we assign new priorities                         
                            
                                
                                    p
                                
                                ^
                            
                        
                    j based on the order of                         
                            
                                
                                    J
                                
                                ^
                            
                        
                    , i.e. the first job in                         
                            
                                
                                    J
                                
                                ^
                            
                        
                     gets the highest priority q and the last job gets the lowest priority 1. As a result, the set of candidates contains batches of jobs with similar priority that are spread in the domain space. The new priorities serve as input for scheduling which groups the q jobs to m CPUs using the runtime estimates                         
                            
                                
                                    t
                                
                                ^
                            
                        
                    ” [pg. 186, 3.4 Refinement of Job Priorities via Clustering, ¶2-3; Kotthaus’ method organizes job candidates based on priority into a selected list and then is further used to group q jobs to m CPUs using runtime estimates, thus the examiner is interpreting this process as equivalent to candidate jobs being initially grouped into the job groups.]);
Tamano, Chen, Zhang, Campos and Kotthaus are all in the same field of endeavor of optimizing machine learning jobs and thus are analogous. Tamano [pg. 184, § 3. Resource-Aware Scheduling with Synchronous Model Update, Kotthaus]

Claims 3, 4, 10, 11, 17, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Tamano in view of Chen and Zhang and Campos and Kotthaus and further in view of Koch et al. (US 10,360,517 B2, hereinafter "Koch").

Regarding claim 3, Tamano/Chen/Zhang/Campos/Kotthaus teaches the method of claim 1, however the combination fails to explicitly teach further including caching the runtime metrics, the runtime metrics including at least a model size and an input dataset associated with the training dataset.
Koch teaches: further including caching the runtime metrics (“In an operation 622(Fig. 6A), the results(i.e runtime metrics) are stored in evaluation cache 316 and in model data 318 in association with the set of hyperparameter values.” [pg. 40; col 32, lines 1-3]), the runtime metrics including at least a model size and an input dataset associated with the training dataset. (“Evaluation cache 316, model data 318, and selected model data 320 are created from results(i.e. runtime metrics) generated by worker system 106.” [pg. 27; col 5, lines 8-10])
Tamano, Chen, Zhang, Campos, Kotthaus and Koch are all in the same field of endeavor of optimizing machine learning jobs and thus are analogous. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Kotthaus discloses a job scheduling approach aimed to increased parallel optimization by efficient resource utilization. Zhang discloses an efficient communication architecture for GPU clusters. Campos discloses training strategies on a distributed GPU cluster. Koch discloses a distributed computing system that caches results which includes model data and selected model data. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Tamano’s/Chen’s/Zhang’s/Campos’/Kotthaus’ optimization algorithms to cache runtime metrics which include model data and selected model data as taught by Koch to further improve optimization in machine learning jobs.

Regarding claim 4, Tamano/Chen/Zhang/Campos/Kotthaus/Koch teaches the method of claim 3, where Tamano further teaches: further including, pursuant to identifying the candidate jobs: collecting a physical memory size of each of the plurality (“Assuming that we have a 40-GB data set, we observe the number of MapReduce jobs each node handles and the data size each node required to read for the jobs.” [pg. 62; col 1, lines 16-18]); grouping job requests according to at least one of a model parameter, the model size, and the input dataset (Fig 2.); and P201707942US0130using the model size and input dataset to compute the memory footprint for each training iteration (“Table I summarizes how much data each node reads, how many jobs each node executes, and how much computation each node requires for each partitioning pattern. The amount of computation is the product of data size and number of jobs.” [pg. 62; col 1, lines 24-28]).

Regarding claim 10, Tamano/Chen/Zhang/Campos/Kotthaus teaches the system of claim 8, however the combination fails to explicitly teach wherein the processor caches the runtime metrics, the runtime metrics including at least a model size and an input dataset associated with the training dataset.
Koch teaches: wherein the processor caches the runtime metrics (“In an operation 622(Fig. 6A), the results(i.e runtime metrics) are stored in evaluation cache 316 and in model data 318 in association with the set of hyperparameter values.” [pg. 40; col 32, lines 1-3]), the runtime metrics including at least a model size and an input dataset associated with the training dataset. (“Evaluation cache 316, model data 318, and selected model data 320 are created from results(i.e. runtime metrics) generated by worker system 106.” [pg. 27; col 5, lines 8-10])


Regarding claim 11, Tamano/Chen/Zhang/Campos/Kotthaus/Koch teaches the system of claim 10, where Tamano further teaches: wherein the processor, pursuant to identifying the candidate jobs: collects a physical memory size of each of the plurality of accelerator devices (“Assuming that we have a 40-GB data set, we observe the number of MapReduce jobs each node handles and the data size each node required to read for the jobs.” [pg. 62; col 1, lines 16-18]); groups job requests according to at least one of a model parameter, the model size, and the input dataset (Fig 2.); and P201707942US0130uses the model size and input dataset to compute a memory footprint for each training iteration (“Table I summarizes how much data each node reads, how many jobs each node executes, and how much computation each node requires for each partitioning pattern. The amount of computation is the product of data size and number of jobs.” [pg. 62; col 1, lines 24-28]).

Regarding claim 17, Tamano/Chen/Zhang/Campos/Kotthaus teaches the computer program product of claim 15, however the combination fails to explicitly teach further including an executable portion that caches the runtime metrics, the runtime metrics including at least a model size and an input dataset associated with the training dataset.
Koch teaches: further including an executable portion that caches the runtime metrics (“In an operation 622(Fig. 6A), the results(i.e runtime metrics) are stored in evaluation cache 316 and in model data 318 in association with the set of hyperparameter values.” [pg. 40; col 32, lines 1-3]), the runtime metrics including at least a model size and an input dataset associated with the training dataset. (“Evaluation cache 316, model data 318, and selected model data 320 are created from results(i.e. runtime metrics) generated by worker system 106.” [pg. 27; col 5, lines 8-10])
Tamano, Chen, Zhang, Campos, Kotthaus and Koch are all in the same field of endeavor of optimizing machine learning jobs and thus are analogous. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Kotthaus discloses a job scheduling approach aimed to increased parallel optimization by efficient resource utilization. Zhang 

Regarding claim 18, Tamano/Chen/Zhang/Campos/Kotthaus/Koch teaches the computer program product of claim 17, where Tamano further teaches: further including an executable portion that, pursuant to identifying the candidate jobs: collects a physical memory size of each of the plurality of accelerator devices (“Assuming that we have a 40-GB data set, we observe the number of MapReduce jobs each node handles and the data size each node required to read for the jobs.” [pg. 62; col 1, lines 16-18]); groups job requests according to at least one of a model parameter, the model size, and the input dataset (Fig 2.); and P201707942US0130uses the model size and input dataset to compute a memory footprint for each training iteration (“Table I summarizes how much data each node reads, how many jobs each node executes, and how much computation each node requires for each partitioning pattern. The amount of computation is the product of data size and number of jobs.” [pg. 62; col 1, lines 24-28]).

Claims 5, 12, and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tamano in view of Chen, Zhang, Campos, Kotthaus, and Koch and further in view of Panda et al. ("PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce", hereinafter "Panda").

Regarding claim 5, Tamano/Chen/Zhang/Campos/Kotthaus/Koch teaches: The method of claim 4, however the combination fails to explicitly teach wherein grouping the job groups further includes grouping the candidate jobs in a tree structure, the tree structure organized based on the input dataset and the model size.  
Panda teaches wherein grouping the job groups further includes grouping the candidate jobs in a tree structure (“The Controller constructs a tree using a set of MapReduce jobs, each of which builds different parts of the tree. At any point, the model file contains the entire tree constructed so far.” [pg. 4; col 1 lines 6-10]), the tree structure organized based on the input dataset and the model size. (Each MapReduce job takes as input a set of nodes (N), the training data set (D ∗ ), and the current state of the model (M). The Controller schedules two types of MapReduce jobs.” [Fig. 1; pg. 4; col 1, lines 27-29])
Tamano, Chen, Zhang, Campos, Kotthaus, Koch, and Panda are all in the same field of endeavor of optimizing machine learning jobs and thus are analogous. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Zhang discloses an efficient communication architecture for GPU clusters. Campos discloses training strategies on a distributed 

Regarding claim 12, Tamano/Chen/Zhang/Campos/Kotthaus/Koch teaches The system of claim 11, however the combination fails to explicitly teach wherein grouping the job groups further includes grouping the candidate jobs in a tree structure, the tree structure organized based on the input dataset and the model size.  
Panda teaches wherein grouping the job groups further includes grouping the candidate jobs in a tree structure (“The Controller constructs a tree using a set of MapReduce jobs, each of which builds different parts of the tree. At any point, the model file contains the entire tree constructed so far.” [pg. 4; col 1 lines 6-10]), the tree structure organized based on the input dataset and the model size. (Each MapReduce job takes as input a set of nodes (N), the training data set (D ∗ ), and the current state of the model (M). The Controller schedules two types of MapReduce jobs.” [Fig. 1; pg. 4; col 1, lines 27-29])
Tamano, Chen, Zhang, Campos, Kotthaus, Koch, and Panda are all in the same field of endeavor of optimizing machine learning jobs and thus are analogous. Tamano 

Regarding claim 19, Tamano/Chen/Zhang/Campos/Kotthaus/Koch teaches The computer program product of claim 18, however the combination fails to explicitly teach wherein grouping the job groups further includes grouping the candidate jobs in a tree structure, the tree structure organized based on the input dataset and the model size.  
Panda teaches wherein grouping the job groups further includes grouping the candidate jobs in a tree structure (“The Controller constructs a tree using a set of MapReduce jobs, each of which builds different parts of the tree. At any point, the model file contains the entire tree constructed so far.” [pg. 4; col 1 lines 6-10]), the tree structure organized based on the input dataset and the model size. (Each MapReduce job takes as input a set of nodes (N), the training data set (D ∗ ), and the current state of the model (M). The Controller schedules two types of MapReduce jobs.” [Fig. 1; pg. 4; col 1, lines 27-29])
Tamano, Chen, Zhang, Campos, Kotthaus, Koch, and Panda are all in the same field of endeavor of optimizing machine learning jobs and thus are analogous. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Zhang discloses an efficient communication architecture for GPU clusters. Campos discloses training strategies on a distributed GPU cluster. Kotthaus discloses a job scheduling approach aimed to increased parallel optimization by efficient resource utilization. Koch discloses a distributed computing system that caches results which includes model data and selected model data. Panda discloses using job scheduling and regression tree models. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the optimization algorithms of Tamano, Chen, Zhang, Campos, Kotthaus and the distributed computing system taught by Koch with the classification and regression tree models of Panda to further improve the efficiency of the claimed distributed computing system. 

Claims 6, 7, 13, 14, 20, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Tamano in view of Chen, Zhang, Campos, Kotthaus and further in view of Panda.

Regarding claim 6, Tamano/Chen/Zhang/Campos/Kotthaus teaches the method of claim 1, however the combination fails to explicitly teach further including performing 
Panda teaches further including performing the merging of the job groups within an execution engine upon receiving a merge request triggered by a scheduler. (“At the heart of PLANET is the Controller, a single machine that initiates, schedules and controls the entire tree induction process. The Controller has access to a compute cluster on which it schedules MapReduce jobs.” [pg. 4; col 1, lines 1-4])
Tamano, Chen, Zhang, Campos, Kotthaus, and Panda are all in the same field of endeavor of optimizing machine learning jobs. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Zhang discloses an efficient communication architecture for GPU clusters. Campos discloses training strategies on a distributed GPU cluster. Kotthaus discloses a job scheduling approach aimed to increased parallel optimization by efficient resource utilization. Panda discloses using job scheduling and regression tree models. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the machine learning algorithms of Tamano/Chen/Zhang/Campos/Kotthaus with the job scheduling, classification, regression tree models of Panda to further improve the efficiency of the claimed distributed computing system. One would be motivated to use a scheduler to reduce repetitive jobs.

Regarding claim 7, the combination of Tamano, Chen, Zhang, Campos Kotthaus and Panda teaches the method of claim 6, where Panda further teaches wherein (“In other words, we disabled the optimization to construct trees entirely in memory and limited forward scheduling to 1 MapReduce in order to evaluate the performance of the algorithm in a constrained (e.g. shared cluster) environment.” Fig. 3, Fig. 4 further shows results relating to running time and data size [pg. 9; col 2, lines 13-17]).
Tamano, Chen, Zhang, Campos, Kotthaus, and Panda are all in the same field of endeavor of optimizing machine learning jobs. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Zhang discloses an efficient communication architecture for GPU clusters. Campos discloses training strategies on a distributed GPU cluster. Kotthaus discloses a job scheduling approach aimed to increased parallel optimization by efficient resource utilization. Panda discloses using job scheduling and regression tree models. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the machine learning algorithms of Tamano/Chen/Zhang/Campos/Kotthaus with the job scheduling, classification, regression tree models of Panda to further improve the efficiency of the claimed distributed computing system. One would be motivated to use a scheduler to reduce repetitive jobs.

claim 13, Tamano/Chen/Zhang/Campos/Kotthaus teaches the system of claim 8, however the combination fails to explicitly teach wherein the processor performs the merging of the job groups within an execution engine upon receiving a merge request triggered by a scheduler.
Panda teaches wherein the processor performs the merging of the job groups within an execution engine upon receiving a merge request triggered by a scheduler. (“At the heart of PLANET is the Controller, a single machine that initiates, schedules and controls the entire tree induction process. The Controller has access to a compute cluster on which it schedules MapReduce jobs.” [pg. 4; col 1, lines 1-4])
Tamano, Chen, Zhang, Campos, Kotthaus, and Panda are all in the same field of endeavor of optimizing machine learning jobs. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Zhang discloses an efficient communication architecture for GPU clusters. Campos discloses training strategies on a distributed GPU cluster. Kotthaus discloses a job scheduling approach aimed to increased parallel optimization by efficient resource utilization. Panda discloses using job scheduling and regression tree models. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the machine learning algorithms of Tamano/Chen/Zhang/Campos/Kotthaus with the job scheduling, classification, regression tree models of Panda to further improve the efficiency of the claimed distributed computing system. One would be motivated to use a scheduler to reduce repetitive jobs.
claim 14, Tamano/Chen/Zhang/Campos/Kotthaus/Panda teaches the system of claim 13, where Panda further teaches wherein performing the merging, by the execution engine, further includes optimizing a model graph associated with the job groups including computing the merge request associated with the model graph to determine a cost of overall memory consumption (“In other words, we disabled the optimization to construct trees entirely in memory and limited forward scheduling to 1 MapReduce in order to evaluate the performance of the algorithm in a constrained (e.g. shared cluster) environment.” Fig. 3, Fig. 4 further shows results relating to running time and data size [pg. 9; col 2, lines 13-17]).
Tamano, Chen, Zhang, Campos, Kotthaus, and Panda are all in the same field of endeavor of optimizing machine learning jobs. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Zhang discloses an efficient communication architecture for GPU clusters. Campos discloses training strategies on a distributed GPU cluster. Kotthaus discloses a job scheduling approach aimed to increased parallel optimization by efficient resource utilization. Panda discloses using job scheduling and regression tree models. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the machine learning algorithms of Tamano/Chen/Zhang/Campos/Kotthaus with the job scheduling, classification, regression tree models of Panda to further improve the efficiency of the claimed distributed computing system. One would be motivated to use a scheduler to reduce repetitive jobs.
claim 20, Tamano/Chen/Zhang/Campos/Kotthaus teaches the computer program product of claim 15, however the combination fails to explicitly teach further including an executable portion that performs the merging of the job groups within an execution engine upon receiving a merge request triggered by a scheduler.
Panda teaches: further including an executable portion that performs the merging of the job groups within an execution engine upon receiving a merge request triggered by a scheduler. (“At the heart of PLANET is the Controller, a single machine that initiates, schedules and controls the entire tree induction process. The Controller has access to a compute cluster on which it schedules MapReduce jobs.” [pg. 4; col 1, lines 1-4])
Tamano, Chen, Zhang, Campos, Kotthaus, and Panda are all in the same field of endeavor of optimizing machine learning jobs. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Zhang discloses an efficient communication architecture for GPU clusters. Campos discloses training strategies on a distributed GPU cluster. Kotthaus discloses a job scheduling approach aimed to increased parallel optimization by efficient resource utilization. Panda discloses using job scheduling and regression tree models. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the machine learning algorithms of Tamano/Chen/Zhang/Campos/Kotthaus with the job scheduling, classification, regression tree models of Panda to further improve the efficiency of the claimed distributed computing system. One would be motivated to use a scheduler to reduce repetitive jobs.
claim 21, Tamano/Chen/Zhang/Campos/Kotthaus/Panda teaches the computer program product of claim 20, where Panda further teaches wherein performing the merging, by the execution engine, further includes optimizing a model graph associated with the job groups including computing the merge request associated with the model graph to determine a cost of overall memory consumption (“In other words, we disabled the optimization to construct trees entirely in memory and limited forward scheduling to 1 MapReduce in order to evaluate the performance of the algorithm in a constrained (e.g. shared cluster) environment.” Fig. 3, Fig. 4 further shows results relating to running time and data size [pg. 9; col 2, lines 13-17]).
Tamano, Chen, Zhang, Campos, Kotthaus, and Panda are all in the same field of endeavor of optimizing machine learning jobs. Tamano discloses a machine learning algorithm that collects results, group jobs into job groups, and merges the job groups. Chen discloses optimization methods to improve the performance of MapReduce on multicore. Zhang discloses an efficient communication architecture for GPU clusters. Campos discloses training strategies on a distributed GPU cluster. Kotthaus discloses a job scheduling approach aimed to increased parallel optimization by efficient resource utilization. Panda discloses using job scheduling and regression tree models. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the machine learning algorithms of Tamano/Chen/Zhang/Campos/Kotthaus with the job scheduling, classification, regression tree models of Panda to further improve the efficiency of the claimed distributed computing system. One would be motivated to use a scheduler to reduce repetitive jobs.
Response to Arguments
Applicant's arguments filed 12/21/2021 have been fully considered but they are not persuasive. 

Applicant’s remarks on pg. 12 regarding the cited prior arts of Tamano and Chen failing to teach “computing a memory footprint for each training iteration” has been considered but are not persuasive. Chen teaches this particular limitation (“The increment of memory consumption on Ostrich is less and more steady, since the Input Buffer and Intermediate Buffer are allocated in the first iteration and reused among the rest of the iterations” [see pg. 3:19, § 7.3.2. Memory Footprint]). Furthermore, the newly amended limitations of independent claims 1, 8, and 15 are now taught by the newly presented arts of Zhang and Campos therefore applicant’s arguments regarding those particular limitations and the previous cited prior arts are moot. Please see the updated 103 rejection above. 

Applicant’s arguments with respect to the rejections of the dependent claims have been fully considered but they are not persuasive as they rely upon the allowability of the independent claims.

Conclusion
Applicant's amendment necessitated the new grounds of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL H HOANG whose telephone number is (571)272-8491. The examiner can normally be reached Mon-Fri 8:30AM-4:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and 



/M.H.H./Examiner, Art Unit 2122                                                                                                                                                                                                        

/ERIC NILSSON/Primary Examiner, Art Unit 2122