DETAILED ACTION
Claims 1-20 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDSs) submitted on 8/13/2020, 01/26/2021, and 11/22/2021 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.
Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-10 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
The following claim language is unclear:
As per claim 1, lines 20-21 recite “control the communication interface to transmit learning data related to the plurality of tasks to the determined GPUs” it is unclear from the context of the claim what constitutes “control the communication interface”. For example, is the transmission of data throttled based on a particular criteria? Or does the limitation only intends to cover transmission of learning data to the tasks. For examination purposes, examiner interprets the limitation as transmission of learning data.
Claims 2-10 are dependent on claim 1 and fail to cure the deficiencies set forth above for claim 1. Therefore, it is rejected under the same rationale.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6, 8, 11-16, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Fong et al. (US 2018/0276044 A1) in view of Zhao et al. (US 10,325,343 B1), in further view of Seelam et al. (US 2020/0092392 A1).

Fong was cited in IDS

Regarding claim 1, Fong teaches the invention substantially as claimed including an electronic apparatus ([0078] apparatus (systems)) comprising: 
a communication interface configured to communicate with a plurality of external servers comprising a plurality of graphics processing units (GPUs) ([0021] GPUs on different servers via interconnect fabric such as Ethernet or InfiniBand.); 
a memory comprising at least one instruction ([0017] a memory 28 having instructions stored in a storage system to perform the steps); and 
a processor configured to control the electronic apparatus by executing the at least one instruction, wherein the processor is further configured to ([0079] These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.): 
receive, values of bandwidths of a plurality of GPU pairs, respectively, into which the plurality of GPUs are combined, and information on whether the plurality of GPUs are used ([0004] The resource selection would select resources with connection topology best matching the communication patterns of the workload. [0028] In step 103, basing on the scheduling requirements of an incoming workload from step 102, a set of feasible resources is created for the workload. The feasible resources are based on static resource information of CPU, memory, GPU and topology. [0030] That is, resource capability, capacity and topological configuration are all considered. [0037] FIGS. 4A-G depict a GPU and CPU architecture utilizing an NVLink connection technology such that the GPUs are interconnected to each other and the connection links between the GPU-GPU/CPU-GPU are each 80 GB/s as shown in FIG. 4A… GPUs are connected in pairs with the Nvlink technology. (i.e., topology) [0002] bandwidth for communication, including data transfer, between any pair of CPU and GPUs, or between any pair of GPUs, are topology-dependent and not identical.), 
based on an input job being received, identify a number of GPUs and a bandwidth value that are required for performing the input job ([0026] The workload may specify explicit requirements, for example, a minimum or maximum number of virtual machines and containers (i.e., container(s) is a generic term for both virtual machines and containers), a minimum or maximum of CPU cores per container, a minimum or maximum of GPUs per container, a desired amount of memory for containers and/or for the GPUs, an hardware architecture type, an operating system type, and a GPU type, and GPU-GPU communication (e.g., yes/no or explicit specification of the bandwidth required).), 
determine GPUs among the plurality of GPUs to perform the plurality of tasks based on the values of the bandwidths of the plurality of GPU pairs, the received information on whether the plurality of GPUs are used, and the number of GPUs and the bandwidth value that are required for performing the plurality of tasks ([0004] analyzing a resource scheduling requirement including its communication patterns for processes of a workload, creating feasible resources based on static resource information of the resources for the processes of the workload, and selecting an available resource of the feasible resources to assign the workload based on the resource scheduling requirement. The resource selection would select resources with connection topology best matching the communication patterns of the workload. [0027] In another case, the workload has large data exchange between GPUs, then the scheduling preference would be having two GPUs with NVLink connection with high bandwidth).

Fong does not expressly teach receive, through the communication interface from the plurality of external servers, values of bandwidths of a plurality of GPU pairs;
an input job related to machine learning a bandwidth value that are required for performing a plurality of tasks included in the input job;
determine GPUs among the plurality of GPUs to perform the plurality of tasks, and 
control the communication interface to transmit learning data related to the plurality of tasks to the determined GPUs.

However, Zhao teaches a method for allocating service requests to clusters of GPUs in GPU server nodes. Further, Zhao teaches receive, through the communication interface from the plurality of external servers, values of bandwidths of a plurality of GPU pairs (Fig. 8, step 704; Col. 2, line 54 through Col. 3, line 4: The client systems 110 and server cluster 120 are operatively connected over a communications network 130. The communications network 130 is configured to enable network communication between the client systems 110 and the server cluster 120, as well as to enable peer-to-peer network communication between the GPU servers 120-1, 120-2, . . . , 120-s of the server cluster 120. The computing system 100 further comprise a global GPU server allocation and scheduling system 140, which is configured to manage and schedule provisioning of multiple GPU resources over multiple GPU servers in the sever cluster 120 for a client system which requires access to a relatively large number of GPU devices which cannot be provisioned to the client system using GPU devices 124 on a single GPU server in the server cluster 120. Col. 11, lines 13-31: The topology detection and scoring module 232 implements methods that are configured to (i) detect the hardware elements (and properties) (e.g., GPUs, network adapters (IB, RoCE, IPoIB, Ethernet) and the hardware interconnect topology (e.g., PCIe, NVLink, other internal interconnection bus/link technologies, etc.), and (ii) generate a topology performance metrics table that is stored in the data store of performance metric tables 240. The topology detection and scoring module 232 would detect the hardware environment and interconnect topology for a given GPU server node, and generate a performance metrics table which includes performance metrics (e.g., priority scores) for the detected hardware environment and interconnect topology, and then store the performance metrics table in the data store 240 for subsequent access and use in GPU mapping/re-balancing operations. Col. 12, line 27 through Col. 13, line 15);

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Zhao with the teachings of Fong to allocate tasks to external GPU server nodes. The modification would have been motivated by the desire of provisioning resources to workloads of clients which require access to a relatively large number of GPU devices which cannot be provisioned in a single GPU server.

Zhao discusses that machine learning workloads have a parallel nature which reasonably suggests a plurality of tasks included in the input job Fong and Zhao do not expressly teach an input job related to machine learning a bandwidth value that are required for performing a plurality of tasks included in the input job;
determine GPUs among the plurality of GPUs to perform the plurality of tasks, and 
control the communication interface to transmit learning data related to the plurality of tasks to the determined GPUs.

However, Seelam teaches an input job related to machine learning a bandwidth value that are required for performing a plurality of tasks included in the input job ([0016] Training a DL model to reach a desired accuracy involves experimenting with many hyper parameters, and running the training operation many times over the same dataset. When training DL models, it becomes desirable to keep graphical processing unit (“GPU”) utilization at approximately 100%. For example, currently, distributed deep learning applications use highly parallel GPU hardware to process large amounts of data and the data has to be quickly read from storage and fed to the GPUs. The success of this approach is based on keeping GPUs busy with data to be processed for almost 100% of the time. However, the performance actually extracted out of each GPU is directly dependent on the input/output (“I/O”) bandwidth of the storage solution used to access the dataset. Thus, the storage bandwidth is critical to ensure the GPUs receive data to process at the rate the GPUs can consume the data. [0060] The distributed cache may be used to feed the GPUs with data providing near local storage I/O bandwidth. While deploying a DL distributed job, a number of nodes may be designated to cache the dataset from remote storage. Training jobs may be deployed preferably on those nodes. From the user perspective this reduces the training/inference time even if data is being sourced from a remote location.);
determine GPUs among the plurality of GPUs to perform the plurality of tasks ([0067] When a job is submitted for scheduling to the DL job scheduler 414, the cache microservice 412 and the cache controller 416 make decisions on which of the nodes will receive the job allocation. The decision depends on a number of factor such as, for example, job requirements, compute capacity at the nodes (e.g., GPU/CPUs/memory) and storage capacity at the nodes. [0079] The job is deployed with an indication of the selected dataset that is to be used and information about how many nodes the job requires [0080]), and 
control the communication interface to transmit learning data related to the plurality of tasks to the determined GPUs ([0067] Depending on the decision, the DL job scheduler 414, the cache microservice 412 can decide to cache the data in the global data store 420 and/or into the distributed cache in the distributed data cache 424 in order to speed up data access. Additionally, the cache microservice 412 can decide (i.e., control) to cache the data only on a storage of a subset of the compute nodes such as, for example, the same set of nodes responsible for the job allocation. [0069] As illustrated, a cache may be created on a subset of the nodes such as, for example, N1, N2, N3 for dataset 1. A data locality aware operation may be performed for jobs that access dataset 1 such that the jobs may have access to the cached data from N1, N2, N3 regardless of where the jobs are placed in a cluster of nodes rather than from the source data store 510, which may include also include dataset 1, dataset 2, dataset 3.).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Seelam with the teachings of Fong and Zhao to allocate tasks to external GPU node clusters and controlling how training data is allocated to the nodes to ensure high I/O bandwidth. The modification would have been motivated by the desire of ensuring a GPU has DL data locally to perform the DL job as the training code goes over the dataset multiple times (e.g., 10's to 100's of times) so the data is accessed many times potentially in different orders each time. See at least Seelam’s [0015].

Regarding claim 2, Fong teaches wherein the processor is further configured to: 
based on the received information on whether the plurality of GPUs are used, identify available GPUs among the plurality of GPUs, combine the available GPUs into available GPU pairs based on the number of GPUs and the bandwidth value that are required for performing the plurality of tasks, identify a plurality of groups by including into each of the plurality of groups at least one available GPU pair among the available GPU pairs, determine a group of the plurality of groups based on a predefined policy, and determine the GPUs included in the determined group as the GPUs to perform the plurality of tasks, respectively ([0004] analyzing a resource scheduling requirement including its communication patterns for processes of a workload, creating feasible resources based on static resource information of the resources for the processes of the workload, and selecting an available resource of the feasible resources to assign the workload based on the resource scheduling requirement. The resource selection would select resources with connection topology best matching the communication patterns of the workload. [0019]; [0023] Therefore, when scheduling/placing a workload on many available GPUs, GPUs with closer affinity should be selected to execute the workload, to achieve better performance and bring least impact to other concurrently running workloads. [0027] For example, a workload can be received in step 101 that requires execution using one container of two GPUs with frequent but small communication packets exchange for synchronization. In step 102, such requirements of the workload are analyzed and determined the scheduling preference in having two GPUs on the same PCIe socket. In another case, the workload has large data exchange between GPUs, then the scheduling preference would be having two GPUs with NVLink connection with high bandwidth. [0028] In step 103, basing on the scheduling requirements of an incoming workload from step 102, a set of feasible resources is created for the workload. The feasible resources are based on static resource information of CPU, memory, GPU and topology. For example, if the incoming workload requiring 2 GPUs per CPU, only CPU with two more GPUs will be included in the feasible set. Another example, a workload may have specific kernel implementation 150 for a specific type of GPUs, only CPUs with that type of GPUs would be placed into the feasible set. Yet another example, if incoming workload requiring communication bandwidth between GPU- and CPU is beyond PCI's bandwidth, then only CPUs with NVLink would be feasible resources 170.).

Regarding claim 3, Fong teaches wherein the processor is further configured to determine the group based on the predefined policy by identifying groups among the plurality of groups of which the at least one available GPU pair included therein has a value of the bandwidth greater than or equal to the bandwidth value required for performing the plurality of tasks, and determining the GPUs included in the group comprising the at least one available GPU pair having a smallest bandwidth value among the groups as the GPUs to perform the plurality of tasks ([0028] In step 103, basing on the scheduling requirements of an incoming workload from step 102, a set of feasible resources is created for the workload. The feasible resources are based on static resource information of CPU, memory, GPU and topology. For example, if the incoming workload requiring 2 GPUs per CPU, only CPU with two more GPUs will be included in the feasible set. Another example, a workload may have specific kernel implementation 150 for a specific type of GPUs, only CPUs with that type of GPUs would be placed into the feasible set. Yet another example, if incoming workload requiring communication bandwidth between GPU- and CPU is beyond PCI's bandwidth, then only CPUs with NVLink would be feasible resources 170. The feasible resources does not contain the dynamic resource usage incurred by other workloads and thus may not have the capacity available for the incoming workload. That is, the feasible resources has the containers or can be used to create containers with the requested compute, storage and network resources. For example, feasible resources are the ones with containers or have capability that containers can be created to have data transfer bandwidth greater than nnGB/s (NVLink capable). For another example, feasible resources are the ones with containers or that containers can be created to have the memory capacity, if specified. [0030] The GPU preference assignment rules are applied based on the GPU topology for performance gain. For example, workload characteristics are applied to order the priority of containers and GPUs. Similarly, GPUs with NVLink would be given high priority for workload with high rate/load in cross-GPU data transfer, GPUs on the same socket would prefer over those on different sockets for workload with high rate/load cross GPUs data transfer, and GPUs on different socket would be preferred for workload with low rate/load of cross-GPU data transfer and heavy CPU-GPU data transfer, and CPU and GPUs on different sockets would not be preferred for workload with heavy CPU-GPU data transfer. That is, resource capability, capacity and topological configuration are all considered. If enough resources meet the requirements, one set of resources is selected using service level agreement, if specified, for assignment and then the workload is dispatched for execution. If not enough containers with available resources, a suboptimal resource plan is generated and return to the user for further judgment, if necessary. [0033] For example, a GPU to CPU architecture including PCIe sockets is depicted in which memory (i.e., DDR4, DDR3.) is connected with a CPU which in turn is connected to the PCIe socket of which GPU1 and GPU2 are connected therewith with a 16 GB/s bandwidth connection. Further, a GPU to CPU architecture is depicted including an NVLink connection having a bandwidth of 80 GB/s, as of current generation of technology, in which the CPU is connected to each GPU and the GPUs are connected to each other directly. [0037]).

Regarding claim 4, Fong teaches wherein the processor is further configured to: 
in an absence of a group of which the bandwidth value of the at least one available GPU pair is greater than or equal to the bandwidth value required for performing the plurality of tasks among the plurality of groups, determine the GPUs included in a group of which a value of the bandwidth of the at least one available GPU pair included therein is the greatest among the plurality of groups, as the GPUs to perform the plurality of tasks ([0030] The GPU preference assignment rules are applied based on the GPU topology for performance gain. For example, workload characteristics are applied to order the priority of containers and GPUs. Similarly, GPUs with NVLink would be given high priority for workload with high rate/load in cross-GPU data transfer, GPUs on the same socket would prefer over those on different sockets for workload with high rate/load cross GPUs data transfer, and GPUs on different socket would be preferred for workload with low rate/load of cross-GPU data transfer and heavy CPU-GPU data transfer, and CPU and GPUs on different sockets would not be preferred for workload with heavy CPU-GPU data transfer. That is, resource capability, capacity and topological configuration are all considered. If enough resources meet the requirements, one set of resources is selected using service level agreement, if specified, for assignment and then the workload is dispatched for execution. If not enough containers with available resources, a suboptimal resource plan is generated and return to the user for further judgment, if necessary.).

Regarding claim 5, Fong teaches wherein the processor is further configured to: 
identify candidate GPU pairs of which the values of the bandwidths are greater than or equal to the identified bandwidth value required for performing the plurality of tasks, among the plurality of GPU pairs, based on an order of the values of the bandwidths and the number of GPUs required for performing the input job, sequentially identify GPU pairs having smallest bandwidth values among the candidate GPU pairs, and determine the GPUs included in the GPU pairs as the GPUs to perform the plurality of tasks, respectively ([0019-20]; [0028] if incoming workload requiring communication bandwidth between GPU- and CPU is beyond PCI's bandwidth, then only CPUs with NVLink would be feasible resources 170. The feasible resources does not contain the dynamic resource usage incurred by other workloads and thus may not have the capacity available for the incoming workload. That is, the feasible resources has the containers or can be used to create containers with the requested compute, storage and network resources. For example, feasible resources are the ones with containers or have capability that containers can be created to have data transfer bandwidth greater than nnGB/s (NVLink capable). [0033] For example, a GPU to CPU architecture including PCIe sockets is depicted in which memory (i.e., DDR4, DDR3.) is connected with a CPU which in turn is connected to the PCIe socket of which GPU1 and GPU2 are connected therewith with a 16 GB/s bandwidth connection. Further, a GPU to CPU architecture is depicted including an NVLink connection having a bandwidth of 80 GB/s, as of current generation of technology, in which the CPU is connected to each GPU and the GPUs are connected to each other directly. [0030] For example, workload characteristics are applied to order the priority of containers and GPUs. Similarly, GPUs with NVLink would be given high priority for workload with high rate/load in cross-GPU data transfer, GPUs on the same socket would prefer over those on different sockets for workload with high rate/load cross GPUs data transfer, and GPUs on different socket would be preferred for workload with low rate/load of cross-GPU data transfer and heavy CPU-GPU data transfer, and CPU and GPUs on different sockets would not be preferred for workload with heavy CPU-GPU data transfer.).
 
Regarding claim 6, Fong teaches wherein the plurality of GPU pairs include a first GPU pair including a first plurality of GPUs included in an external server and a second GPU pair including a second plurality of GPUs included in external servers different from each other, among the plurality of external servers (Fig. 2B shows Nvlink of 80GB/s between pair GPU1-2 and another pair using PCIe Socket, in different servers connected through an Infiniband. See at least [0021] ).

Regarding claim 8, Fong teaches wherein the memory is configured to store information on a value of the bandwidth of a GPU pair being used among the plurality of GPU pairs, and the processor is further configured to: determine the GPUs to perform the input job among the plurality of GPU pairs by excluding the GPU pair being used (Fig. 1 Static GPUs information: GPU #, memory capacity, topology; [0028] In step 103, basing on the scheduling requirements of an incoming workload from step 102, a set of feasible resources is created for the workload. The feasible resources are based on static resource information of CPU, memory, GPU and topology. For example, if the incoming workload requiring 2 GPUs per CPU, only CPU with two more GPUs will be included in the feasible set. Another example, a workload may have specific kernel implementation 150 for a specific type of GPUs, only CPUs with that type of GPUs would be placed into the feasible set.).

Regarding claim 11, it is a method type claim having similar limitations as claim 1 above. Therefore, it is rejected under the same rationale above.

Regarding claim 12, it is a method type claim having similar limitations as claim 2 above. Therefore, it is rejected under the same rationale above.

Regarding claim 13, it is a method type claim having similar limitations as claim 3 above. Therefore, it is rejected under the same rationale above.

Regarding claim 14, it is a method type claim having similar limitations as claim 4 above. Therefore, it is rejected under the same rationale above.

Regarding claim 15, it is a method type claim having similar limitations as claim 5 above. Therefore, it is rejected under the same rationale above.

Regarding claim 16, it is a method type claim having similar limitations as claim 6 above. Therefore, it is rejected under the same rationale above.

Regarding claim 18, it is a method type claim having similar limitations as claim 8 above. Therefore, it is rejected under the same rationale above.

Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Fong, Zhao, Seelam, as applied claim 1, in further view of YOU (US 2016/0306673 A1).

You was cited in IDS.

Regarding claim 7, Fong nor Zhao expressly teach wherein the processor is further configured to determine available bandwidth based on the receiving the values of the bandwidths of the plurality of GPU pairs measured by the plurality of external servers, wherein the values of the bandwidths are periodically received from the plurality of external servers through the communication interface.
	However, You teaches teach wherein the processor is further configured to determine available bandwidth based on the receiving the values of the bandwidths of the plurality of GPU pairs measured by the plurality of external servers, wherein the values of the bandwidths are periodically received from the plurality of external servers through the communication interface ([0065] Alternatively, a daemon program can be installed at all the processing nodes to measure and to obtain the component metrics thereof in a periodical manner. Finally, the obtained metric information is transmitted by the backend operating framework or the monitoring module or the daemon program to the provisioning apparatus. By obtaining component metric information of the pool of the processing nodes, the status of the usage and availability of the resources on the processing nodes capable of executing a target task can be better determined, contributing to the higher degree of accuracy and speed of resource provisioning. [0078] One exemplary approach to determine critical characteristics for a target task is to compute a relative ratio of each task characteristic and respective component metric information corresponding to all the processing nodes in the pool of processing nodes. In particular, an average of each task characteristic during a time period can be computed by analyzing each task characteristics of the target task. For example, during a period of time, an average of CPU usage of CPUtask, an average of memory bandwidth usage MEMtask, an average of storage I/O speed IOPStask, an average of storage I/O bandwidth usage IOtask, and an average of network bandwidth usage NETtask can be obtained for a target task. Next, the maximum values of the component metric information of all the processing nodes in the pool managed by the provisioning apparatus are ranked for each component respectively to obtain a maximum value and a minimum value for each component. For example, with a provisioning apparatus managing 5 processing nodes, a maximum CPU capacity of CPUnode (in the unit of Glops), maximum memory bandwidth of MEMnode (in the unit of MB/s), maximum storage I/O speed of IOPSnode (in the unit of IOPS), maximum storage I/O bandwidth of IOnode (in the unit of MB/s), and maximum network bandwidth of NETnode (in the unit of MB/s) are obtained for each of the 5 processing nodes respectively.).
	
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of You with the teachings of Fong and Zhao to periodically determine characteristics of the pool of processing nodes. The modification would have been motivated by the desire of improving resource provisioning to nodes in a resource pool that better match the target tasks.

Regarding claim 17, it is a method type claim having similar limitations as claim 7 above. Therefore, it is rejected under the same rationale above.

Claims 9-10 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Fong, Zhao, Seelam, as applied claim 1, in further view of Zhao et al. (US 2019/0324856 A1) hereinafter “Savic” (second named inventor).

Regarding claim 9, Fong and Zhao do not expressly teach wherein a plurality of local gradients acquired by inputting the learning data to each of the determined GPUs are mutually synchronized, the plurality of local gradients indicating a degree of change of parameters included in the determined GPUs based on the learning data.
	However, Savic teaches wherein a plurality of local gradients acquired by inputting the learning data to each of the determined GPUs are mutually synchronized, the plurality of local gradients indicating a degree of change of parameters included in the determined GPUs based on the learning data ([0024] In some embodiments, the DL frameworks supported by the deep learning compute module 140 implement a stochastic gradient descent (SGD) process to train deep neural network models. With a SGD training process, an error gradient with respect to each model parameter of a given DL model is calculated using multiple iterations of a backpropagation process. A backpropagation comprises a sequence of three cycles including (i) a forward process, (ii) a backward process, and (iii) a weight update process, wherein the backpropagation process is repeated for many iterations until a convergence criterion is met. Each iteration of the backpropagation process is performed on a mini-batch of data, wherein a mini-batch of data comprises a subset (or portion) of a total dataset of model training data. For each iteration, a mini-batch of data (e.g., M training samples) is read from disk to host memory. The mini-batch of data is transferred from host (CPU) memory to device memory (e.g., GPU memory 164). The GPU kernel functions are instantiated and launched to execute the backpropagation process. [0027] In data parallel training, for each iteration of a backpropagation process, a mini-batch of data samples is partitioned and evenly distributed to a plurality of GPU devices (workers), which can reside on the same or different server machines. With data parallelism, each GPU device has access to a complete copy of a given deep learning model, but for each iteration, each GPU device is only assigned a subset of the data samples of a current mini-batch for the given iteration. For each iteration, each GPU launches kernel functions to perform a forward propagation of the DL network model using its respective subset of data samples, followed by an error backpropagation process to compute the gradient of the loss with respect to the DL model parameters. The GPU devices perform the forward and backward propagation operations on their respective subsets of data in parallel. The gradient parameters computed by all GPU devices for the given iteration are then aggregated/synchronized (e.g. averaged) and the averaged gradient parameters are pushed to each GPU device so that each GPU device can perform a parameter update process using the averaged gradient parameters to update the model parameters of the DL network model.).
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Savic with the teachings of Fong and Zhao to allow for synchronization of operations. The modification would have been motivated by the desire of improving training of Deep learning models.

Regarding claim 10,  Savic teaches wherein the parameters included in the GPUs are trained based on the synchronized plurality of local gradients ([0024] With a SGD training process, an error gradient with respect to each model parameter of a given DL model is calculated using multiple iterations of a backpropagation process. A backpropagation comprises a sequence of three cycles including (i) a forward process, (ii) a backward process, and (iii) a weight update process, wherein the backpropagation process is repeated for many iterations until a convergence criterion is met).

Regarding claim 19, it is a method type claim having similar limitations as claim 9 above. Therefore, it is rejected under the same rationale above.

Regarding claim 20, it is a method type claim having similar limitations as claim 10 above. Therefore, it is rejected under the same rationale above.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JORGE A CHU JOY-DAVILA whose telephone number is (571)270-0692. The examiner can normally be reached Monday-Friday, 9:00am-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Meng-Ai T An can be reached on (571)-272-3756. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/JORGE A CHU JOY-DAVILA/Primary Examiner, Art Unit 2195