Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending for examination in this application.
Information Disclosure Statement
No copies of information disclosure statement (IDS) was submitted with this application.
Drawings
The drawings were received on 11/16/2017.  These drawings are objected to because of the following informalities:
Paragraph [0034] of specification: “bus 16” is not found in drawings.
Specification
The disclosure is objected to because of the following informalities:
Paragraph [0022]: “values some layers” should possibly be “values of some layers”.
Paragraph [0029]: “manages” should be “manage”.
Paragraph [0029]: “The GPS memory” should be “The GPU memory”.
Claim Interpretation
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification.
For the following terms/phrases found in the claims such as:
“Coordination service module” exchanges model parameter updates between computers according to Paragraph [0016] and computes network cost, according to [0004]
“Hybrid communication strategy” refers to two separate communication strategies for transmitting data between computers, according to paragraph [0016].
“Operator graph layers” refers to layers containing model parameters and intermediate values required by the neural network, according to paragraph [0020].
“GPU-CPU Synchronization” refers to data transfer between CPU and GPU memories and between computers of the neural network, according to paragraph [0021].
“Network cost” and “Transmission Scheme” refer to schemes A and B, according to Paragraph [0025].
“Input datum” refers to the loss function and the first derivative of the loss function, according to Paragraph [0028].
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 1-2, 4, 6-13, 15 and 17-20 are rejected under 35 U.S.C. 102(a)(1)s being anticipated by Henggang Cui (“GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server”), hereinafter “Cui”.
Regarding Claim 1, Cui teaches a distributed computing system comprising a 
computer comprising: (Cui teaches in [Abstract] on page 1: “This paper describes a new parameter server, called GeePS, that supports scalable deep learning across GPUs distributed among multiple machines, overcoming these obstacles.”)
	a graphics processing unit (GPU) memory; (Cui teaches under [1. Introduction]: “GeePS overcomes this apparent limitation by assuming control over memory management and placement, and carefully orchestrating data movement between CPU and GPU memory based on its observation of the access patterns at each layer of the neural network.”)
	a central processing unit (CPU) memory comprising a Key-Value Store (KVS) module; (Figure 5 on page 4 shows CPU memory comprising Parameter Server. Cui teaches on page 4 under [2.3 Scaling ML with a parameter server]: “All states shared among application workers (i.e., the model parameters being learned) is kept in distributed shared memory implemented as a specialized key-value store called a “parameter server”.”)
	an execution engine module configured to run a deep learning (DL) program to create a plurality of operator graph layers in the graphics processing unit memory; (Cui teaches on page 2 under [1. Introduction]: “Experiments also confirm the efficacy of GeePS’s support for data-parallel training of very large neural networks on GPUs. For example, results are shown for a 20 GB neural network (5.6 billion connections) trained on GPUs with only 5 GB memory, with the larger CPU memory holding most of the parameters and intermediate layer state most of the time.”) See Interpretation section for “operator graph layer”
	a client library module configured to create a GPU-CPU synchronization (GCS) module for each of the plurality of operator graph layers; (Cui teaches on page 6 under [Swapping data to CPU memory when it does not fit]: “The parameter server client library will be able to manage all the GPU memory on a machine, if the application keeps all its local data in the parameter server and uses the PS-managed buffers…When the application Reads parameter data that is stored in CPU memory, the parameter server will perform this read using CPU cores and copy the data from CPU memory to an allocated GPU buffer…”)
	a coordination service module configured to compute network cost of a first and a second communication scheme and select, based on the network cost, one of the first and second communication scheme for transmitting data associated with one of the plurality of operator graph layers from a corresponding GCS module; (Cui teaches on page 4 under [2.3]: “To reduce remote communication, a parameter server system includes client-side caches that serve most operations locally.” Cui teaches on page 8 under [4.2] two modes of communications: “Data movement across machines” and “Data movement inside a machine”. Cui teaches under [1. Introduction]: “GeePS overcomes this apparent limitation by assuming control over memory management and placement, and carefully orchestrating data movement between CPU and GPU memory based on its observation of the access patterns at each layer of the neural network.”)
	and wherein the client library module is further configured to initiate a data transfer from the GCS module using the selected communication scheme. (Cui teaches under [1. Introduction]: “GeePS overcomes this apparent limitation by assuming control over memory management and placement, and carefully orchestrating data movement between CPU and GPU memory based on its observation of the access patterns at each layer of the neural network.” Cui teaches on page 4 under [2.3]: “To reduce remote communication, a parameter server system includes client-side caches that serve most operations locally.” Cui teaches on page 8 under [4.2] two modes of communications: “Data movement across machines” and “Data movement inside a machine”)
	Regarding Claim 2, Cui teaches the system of claim 1, wherein the first communication scheme comprises broadcasting data associated with the one of the plurality of operator graph layers from the corresponding GCS module to one or more GCS modules directly. (Cui teaches under [3.1]: “Perhaps counter-intuitively, this change is not about reducing data movement between CPU memory and GPU memory—the updates from the local GPU must still be moved to CPU memory to be sent to other machines, and the updates from other machines must still be moved from CPU memory to GPU memory.” Data movement from GPU to CPU can be thought of as synchronization.) Refer to paragraph [0021]
	Regarding Claim 4, Cui teaches the system of claim 1, wherein the second communication scheme comprises using the KVS module as an intermediary to transmit data from one GCS to another GCS. (Cui teaches on page 4 under [2.3]: “Figure 4 illustrates the basic parameter server architecture. All states shared among application workers (i.e., the model parameters being learned) is kept in distributed shared memory implemented as a specialized key-value store called a “parameter server”.)
	Regarding Claim 6, Cui teaches the system of claim 1, wherein the client library module is further configured to create send and receive ports for each of the plurality of GCS modules. (Cui teaches on page 8 under [Data movement across machines]: “GeePS performs communication across machines asynchronously with three types of background threads: keeper threads manage the parameter data in parameter server shards; pusher threads send parameter data updates from parameter caches to parameter server shards, by sending messages to keeper threads; puller threads receive parameter data from parameter server shards to parameter caches, by receiving messages from keeper threads…The pusher/puller threads perform data movement between CPU memory and GPU memory using CUDA APIs.”)
	Regarding Claim 7, Cui teaches the system of claim 1, wherein the execution engine module running the DL program comprising populating two operator graphs' model parameters and intermediate values according to input datum. (Cui teaches on page 3 under [2.1]: “A common way of training a neural network is to use a stochastic gradient descent (SGD) algorithm. For each training image, a forward pass is done to activate all nodes using the current weights…Then, the error terms are propagated back through the network with a backward pass. During the backward pass, the gradient of each connection weight is calculated from the error terms and the retained node values, and the connection weights (i.e., the model parameters) are updated using these gradients.”)
	Regarding Claim 8, Cui teaches the system of claim 7, wherein the execution engine module is configured to populate the model parameters and intermediate values according to back propagation algorithm. (Cui teaches on page 3 under [2.1]: “A common way of training a neural network is to use a stochastic gradient descent (SGD) algorithm. For each training image, a forward pass is done to activate all nodes using the current weights…Then, the error terms are propagated back through the network with a backward pass. During the backward pass, the gradient of each connection weight is calculated from the error terms and the retained node values, and the connection weights (i.e., the model parameters) are updated using these gradients.”)
	Regarding Claim 9, Cui teaches the system of claim 1, wherein at least one of the GCS modules is in communication with the KVS module. (Cui teaches on page 4 under [2.3]: “Figure 4 illustrates the basic parameter server architecture. All states shared among application workers (i.e., the model parameters being learned) is kept in distributed shared memory implemented as a specialized key-value store called a “parameter server”. The “distributed shared memory” is the KVS module and this implies that it is connected to all GCS modules since there’s data movement between CPU and GPU)
	Regarding Claim 10, Cui teaches the system of claim 1, wherein at least one of the GCS modules is configured to receive data from another GCS module directly. (Cui teaches under [3.1]: “Perhaps counter-intuitively, this change is not about reducing data movement between CPU memory and GPU memory—the updates from the local GPU must still be moved to CPU memory to be sent to other machines, and the updates from other machines must still be moved from CPU memory to GPU memory.” Data movement from GPU to CPU can be thought of as synchronization.)
	Regarding Claim 11, Cui teaches the system of claim 1, wherein at least one of the GCS modules is configured to receive data from a KVS module. (Cui teaches on page 4 under [2.3]: “Figure 4 illustrates the basic parameter server architecture. All states shared among application workers (i.e., the model parameters being learned) is kept in distributed shared memory implemented as a specialized key-value store called a “parameter server”. The “distributed shared memory” is the KVS module and this implies that it is connected to all GCS modules since there’s data movement between CPU and GPU of the application workers. Further in the same paragraph Cui teaches “An ML application’s workers process their assigned input data and use simple Read and Update methods to fetch or apply a delta to parameter values, leaving the communication and consistency issues to the parameter server.” This implies the workers receive data from the KVS module/parameter server by fetching it.)
	Regarding Claim 12, Cui teaches A method of running a Deep Learning (DL) program comprising: parsing DL program code; (Cui teaches in [Abstract] on page 1: “This paper describes a new parameter server, called GeePS, that supports scalable deep learning across GPUs distributed among multiple machines, overcoming these obstacles.”)
	constructing a plurality of operator graph layers in a GPU memory; (Cui teaches under [1. Introduction]: “GeePS overcomes this apparent limitation by assuming control over memory management and placement, and carefully orchestrating data movement between CPU and GPU memory based on its observation of the access patterns at each layer of the neural network.”)
	creating a GCS module for each of the operator graph layers; (Cui teaches under [1. Introduction]: “GeePS overcomes this apparent limitation by assuming control over memory management and placement, and carefully orchestrating data movement between CPU and GPU memory based on its observation of the access patterns at each layer of the neural network.”)
	activating a KVS module in a CPU memory; (Figure 5 on page 4 shows CPU memory comprising Parameter Server. Cui teaches on page 4 under [2.3 Scaling ML with a parameter server]: “All states shared among application workers (i.e., the model parameters being learned) is kept in distributed shared memory implemented as a specialized key-value store called a “parameter server”.”)
	computing the network cost of a first and a second communication schemes for transmitting data; (Cui teaches on page 4 under [2.3]: “To reduce remote communication, a parameter server system includes client-side caches that serve most operations locally.” Cui teaches on page 8 under [4.2] two modes of communications: “Data movement across machines” and “Data movement inside a machine”)
	for each GCS module, selecting one of the communication schemes based on the network cost; (Cui teaches under [1. Introduction]: “GeePS overcomes this apparent limitation by assuming control over memory management and placement, and carefully orchestrating data movement between CPU and GPU memory based on its observation of the access patterns at each layer of the neural network.” Cui teaches on page 4 under [2.3]: “To reduce remote communication, a parameter server system includes client-side caches that serve most operations locally.” Cui teaches on page 8 under [4.2] two modes of communications: “Data movement across machines” and “Data movement inside a machine”)
	and transmitting data from each GCS module using the selected communication scheme; (Cui teaches on page 4 under [2.3]: “To reduce remote communication, a parameter server system includes client-side caches that serve most operations locally.” Cui teaches on page 8 under [4.2] two modes of communications: “Data movement across machines” and “Data movement inside a machine”. Cui teaches under [1. Introduction]: “GeePS overcomes this apparent limitation by assuming control over memory management and placement, and carefully orchestrating data movement between CPU and GPU memory based on its observation of the access patterns at each layer of the neural network.”)
	wherein at least one GCS module uses the first communication scheme and at least one GCS module uses the second communication scheme. (Cui teaches on page 4 under [2.3]: “To reduce remote communication, a parameter server system includes client-side caches that serve most operations locally.” Cui teaches on page 8 under [4.2] two modes of communications: “Data movement across machines” and “Data movement inside a machine”. Cui teaches under [1. Introduction]: “GeePS overcomes this apparent limitation by assuming control over memory management and placement, and carefully orchestrating data movement between CPU and GPU memory based on its observation of the access patterns at each layer of the neural network.”)
	Regarding Claim 13, Cui teaches the method of claim 12, where transmitting data using the first communication scheme comprises broadcasting data associated with the one of the plurality of operator graph layers from the corresponding GCS module to one or more other GCS modules directly. (Cui teaches under [3.1]: “Perhaps counter-intuitively, this change is not about reducing data movement between CPU memory and GPU memory—the updates from the local GPU must still be moved to CPU memory to be sent to other machines, and the updates from other machines must still be moved from CPU memory to GPU memory.” Data movement from GPU to CPU can be thought of as synchronization.)
	Regarding Claim 15, Cui teaches the method of claim 12, wherein transmitting data using the second communication scheme comprises using the KVS module as an intermediary to transmit data from one GCS to another GCS. (Cui teaches on page 4 under [2.3]: “Figure 4 illustrates the basic parameter server architecture. All states shared among application workers (i.e., the model parameters being learned) is kept in distributed shared memory implemented as a specialized key-value store called a “parameter server”.)
Regarding Claim 17, Cui teaches the method of claim 12, further comprising creating send and receive ports for each of the plurality of GCS modules. (Cui teaches on page 8 under [Data movement across machines]: “GeePS performs communication across machines asynchronously with three types of background threads: keeper threads manage the parameter data in parameter server shards; pusher threads send parameter data updates from parameter caches to parameter server shards, by sending messages to keeper threads; puller threads receive parameter data from parameter server shards to parameter caches, by receiving messages from keeper threads…The pusher/puller threads perform data movement between CPU memory and GPU memory using CUDA APIs.”)
Regarding Claim 18, Cui teaches the method of claim 12, wherein parsing the DL code comprises populating two operator graphs' model parameters and intermediate values according to input datum. (Cui teaches on page 3 under [2.1]: “A common way of training a neural network is to use a stochastic gradient descent (SGD) algorithm. For each training image, a forward pass is done to activate all nodes using the current weights…Then, the error terms are propagated back through the network with a backward pass. During the backward pass, the gradient of each connection weight is calculated from the error terms and the retained node values, and the connection weights (i.e., the model parameters) are updated using these gradients.”)
Regarding Claim 19, Cui teaches the method of claim 12, further comprising at least one of the GCS modules receiving data from another GCS module directly. (Cui teaches under [3.1]: “Perhaps counter-intuitively, this change is not about reducing data movement between CPU memory and GPU memory—the updates from the local GPU must still be moved to CPU memory to be sent to other machines, and the updates from other machines must still be moved from CPU memory to GPU memory.” Data movement from GPU to CPU can be thought of as synchronization.)
Regarding Claim 20, Cui teaches the method of claim 12, further comprising at least one of the GCS modules receiving data from a KVS module. (Cui teaches on page 4 under [2.3]: “Figure 4 illustrates the basic parameter server architecture. All states shared among application workers (i.e., the model parameters being learned) is kept in distributed shared memory implemented as a specialized key-value store called a “parameter server”.)
Conclusion
The following prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Lustig, D. (2013, February 23). Reducing GPU offload latency via fine-grained CPU-
GPU synchronization. ACM Digital Library.
https://dl.acm.org/doi/10.1109/HPCA.2013.6522332
	Dean, J. (2012, December 3). Large scale distributed deep networks. ACM Digital Library. https://dl.acm.org/doi/10.5555/2999134.2999271
	Heller, M. (2016, December 14). MXNet review: Amazon's scalable deep learning. InfoWorld. https://www.infoworld.com/article/3149598/mxnet-review-amazons-scalable-deep-learning.html
	Colah. (2015, August 31). Calculus on computational graphs: Backpropagation. colah's blog. https://colah.github.io/posts/2015-08-Backprop/
	Hadjis, S. (2016, June 14). Omnivore: An optimizer for multi-device deep learning on CPUs and GPUs. arXiv.org. https://arxiv.org/abs/1606.04487v1
	Chen, T. (2015, December 3). MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv.org. https://arxiv.org/abs/1512.01274
12.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to FRANCOIS A NDIAYE whose telephone number is (571)272-9952.  The examiner can normally be reached on M-F 8:30AM-6:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571) 270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/FRANCOIS A NDIAYE/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124