Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 01/03/2022 has been entered.

Status of Claims
This action is in response to the amendments filed 01/03/2022. Claims 1, 10-11, and 20 have been amended, claims 6-7 and 16-17 have been cancelled. Claims 1-5, 8-15, and 18-20 are currently pending.
	
Response to Arguments
Claims 6-7 and 16-17 have been cancelled, therefore the rejections of claim 6-7 and 16-17 no longer stand.
Applicant’s amendments and arguments regarding the prior art rejection have been fully considered but they are not persuasive. Applicant's argues on page 12 that Li does not suggest a differentiation between “a central BSP module” and a “parameter server (PS)”. In the interest of advancing prosecution, Examiner has brought in the Xing reference to more clearly 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-5, 8-9, 11-15, and 18-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Li et al* ("More Effective Synchronization Scheme in ML using Stale Parameters", herein Li), in view of Xing et al* (“Petuum – A New Platform for Distributed Machine Learning on Big Data”, herein Xing), in further view of Kadav et al** (“ASAP: Asynchronous Approximate Data-Parallel Computation”, herein Kadav).
*a copy of this document was provided with the non-final office action, therefore a copy has not been provided with this office action.
**this document was cited in the IDS filed on 07/01/2019, therefore a copy has not been provided with this office action.
Regarding claim 1, Li teaches a computer system, comprising:
a computer program product storing program code;
(Section 4, para. 2 recites “Each node contains 2x8-core 2.9GHz Intel(R) Xeon(R) CPUs and 8 NVIDIA Tesla K20M GPUs with 4799 MB memory (i.e. computer hardware configured to perform distributed training of a machine learning model). All cluster nodes are connected via two networks: 1000 MB Ethernet and 4 GB Infiniband. Our experiments use a different number of cluster nodes (varying from 1 to 4) with a different number (varying from 1 to 32) on the GPUs) comprising:
a central bulk synchronous parallel (BSP) (Section III A, para. 4 recites “Bounded asynchronous parallel becomes bulk synchronous parallel when the staleness threshold S is configured to 0” (Examiner’s note: Li et al have titled their approach “bounded asynchronous parallel” but it is clear that their process uses the bulk synchronous parallel framework with additional features)) control module, at least two local BSP modules, and at least two machine learning modules, wherein each of the at least two machine learning modules is associated with exactly one of the at least two local BSP modules (Fig. 5 and Section III B, para. 1 recite “The implementation of BAP is based on Caffe-inspur, a Caffe framework that runs on cluster nodes through data parallel and communicates through MPI between workers. As shown in Fig. 5, Caffe-inspur initiates (N+1) processes, including one Parameter Server (PS) process and N Caffe processes. Here PS process is mainly responsible for distributing data to Caffe processes, communicating and updating the new parameters. Caffe process chiefly computes the gradients” (i.e. each Caffe process is a local BSP module that contains a ML module, initiating N+1 processes would result in multiple – or at least two – local BSP modules));
(Section II B, para. 1 recites “In essence, a Parameter Server (PS) is a distributed shared memory system that workers can easily share access to the global model parameters. It is easy to implement a PS architecture on a data parallel ML program. The execution of the gradient Δ() comes up separately on each worker over its data subsets. The aggregating of the gradients to model parameters comes up on the server” (i.e. each local BSP module corresponds to a segment of a shared memory));
wherein, for each local BSP module of the at least two local BSP modules, the central BSP control module is configured to instruct a given local BSP module to store, in a given shared memory corresponding to the given local BSP module, a given local model (Section II B, para. 1 recites “typically, a great deal of workers are deployed, each of them storing a subset of the data” (i.e. a local model). Section III B, para. 2 recites “After receiving data, each Caffe process proceeds computing the gradients and send them to the PS process. Benefit from that each Caffe process preserve a local copy of the global parameters, Caffe process can use the local stale parameters for computation in the next iteration, instead of waiting for the slowest Caffe process, as long as its staleness distance is no more than the staleness threshold BAP” (i.e. each local BSP module stores a local copy of the model in its local memory));
wherein, for each machine learning module of the at least two machine learning modules, a given machine learning module is configured to:
read, the given local model from the given shared memory corresponding to the given local BSP module and the given machine learning module in a given pair including the given (Section II B, para. 1 recites “a Parameter Server (PS) is a distributed shared memory system that workers can easily share access to the global model parameters. Typically, a great deal of workers are deployed, each of them storing a subset of the data”. Section III B, para. 2 recites “At the beginning, PS process starts to distribute data to Caffe processes, while Caffe processes are blocked for receiving data. After receiving data, each Caffe process proceeds computing the gradients and send them to the PS process” (i.e. the local model is read from the shared memory)),
compute a gradient based on the given local model (Section II B, para. 1 recites “The execution of the gradient comes up separately on each worker over its data subsets” (i.e. each local module computes a gradient)),
and aggregate the gradient based on the given local model into an aggregated gradient stored in the given shared memory (Section II B, para. 1 recites “The aggregating of the gradients to model parameters comes up on the server” (i.e. the gradients are aggregated in each local module);
and wherein the central BSP control module is further configured to instruct each of the at least two local BSP modules to periodically read out the aggregated gradient stored in the shared memory corresponding to the respective local BSP module (Section III A, para. 4 recites “Only after receiving all the workers’ gradients of this iteration, the parameter server updates the parameters. Then send the updated parameters to the workers, and the workers always have the same parameters” (i.e. the central module receives the aggregated gradient from each local module));
(Fig. 6 and pg. 760, right column, para. 4 recite “After receiving the gradient data (i.e. after reading out the aggregated gradient received from the local modules), communication thread checks whether itis the last worker that arrives at this round of iteration. If it is the last one, then signal the PS process updating thread to calculate parameters”), the central BSP control module is further configured to:
instruct each of the at least two local BSP modules to provide, to a parameter server (PS), the aggregated gradient for updating, in the PS, a PS model (Fig. 6 and pg. 760, right column, para. 4 recite “After receiving the gradient data (i.e. after reading out the aggregated gradient), communication thread checks whether it is the last worker that arrives at this round of iteration. If it is the last one, then signal the PS process updating thread to calculate parameters” (i.e. updating the parameter server model based on the aggregated gradients provided to the parameter server from the local models))
wherein each of the at least two local BSP modules is configured to download the updated PS model from the PS and to use the updated PS to update the given local model stored in the shared memory corresponding to the respective local BSP module (Fig. 6 and pg. 760, right column para. 5 recite “PS process updating thread computes the parameters after receiving all the gradients, and then sends the updated parameters to the parameter receiving thread in Caffe processes” (i.e. each local module downloads the updated parameter server model from the parameter server and uses it to update its local model)).

Xing teaches that updating the PS model is decoupled from computing the gradient based on the given local model (figs. 5 and 7, and section 3.2.2 para. 1 recite “The scheduler system (Fig. 7) enables model-parallelism, by allowing users to control which model parameters are updated by worker machines. This is performed through a user-defined scheduling function schedule() (corresponding to Sp(t-1)() ), which outputs a set of parameters for each worker. The scheduler sends the identities of these parameters to workers via the scheduling control channel (Fig. 5), while the actual parameter values are delivered through the parameter server system” (i.e updating the PS model and computing the gradient are decoupled)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine these teachings by using the scheduling function from Xing to ensure that the processes of computing the gradient and updating parameter server models from Li are decoupled. Xing and Li are both directed to implementing distributed parallel machine learning, so one of ordinary skill would benefit by adding the scheduling functions from Xing into the BAP model from Li in order to have greater control over the ordering of model updates, as noted in Xing section 3 paragraph 1.
However, the combination of Li and Xing does not explicitly teach that the central BSP control module is further configured to send a notification to each of the at least two local BSP modules on availability of an updated PS model stored in the PS, and wherein each of the at least two local BSP modules is configured to download the updated PS model from the PS in 
Kadav teaches the central BSP control module is further configured to send a notification to each of the at least two local BSP modules on availability of an updated parameter server (PS) model stored in a PS (fig. 4 and pg. 7, para. 3-4 recite “In ASAP, with NOTIFY-ACK, the parallel workers compute and send their model parameters with notifications to other workers. They then proceed to wait to receive notifications from all its senders as defined by their node communication graphs as shown in figure 4. The wait operation counts the NOTIFY events and invokes the reduce when a worker has received notifications from all its senders as described by the node communication graph. Once all notifications have been received, it can perform a consistent reduce. After performing a reduce, the worker sends an ACK, indicating that the intermediate output in previous iteration has been consumed. Only when a worker receives an ACK for a previous send, indicating that the receiver has consumed the previously sent data, the worker may proceed to send the data for the next iteration” (i.e. the central module notifies the local module that updated parameters are available)); 
and wherein each of the at least two local BSP modules is configured to download the updated PS model from the PS in response to the notification from the central BSP control module and to use the updated PS to update the given local model stored in the shared memory corresponding to the respective local BSP module (fig. 4 and pg. 7, para. 3-4 recite “In ASAP, with NOTIFY-ACK, the parallel workers compute and send their model parameters with notifications to other workers. They then proceed to wait to receive notifications from all its senders as defined by their node communication graphs as shown in figure 4. The wait operation counts the NOTIFY events and invokes the reduce when a worker has received notifications from all its senders as described by the node communication graph. Once all notifications have been received, it can perform a consistent reduce. After performing a reduce, the worker sends an ACK, indicating that the intermediate output in previous iteration has been consumed. Only when a worker receives an ACK for a previous send, indicating that the receiver has consumed the previously sent data, the worker may proceed to send the data for the next iteration” (i.e. the local modules download a new model from the parameter server in response to receiving a notification from the central control module)).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine these teachings by integrating the flag notifications from Kadav into the bounded asynchronous parallel processing system from Li. Implementing a notification flag would allow a user to clearly determine when the global and local modules are communicating, which could save the user time when determining how smoothly the overall operation is running.
Regarding claim 2, the combination of Li, Xing, and Kadav teaches the computer system according to claim 1, wherein for each machine learning module of the at least two machine learning modules, the given machine learning module is further configured to:
(Li fig. 4 shows one time step, wherein worker 2 has completed three iterations (i.e. computed a plurality of gradients based on the local model));
and aggregate the plurality of gradients into the aggregated gradient stored in the shared memory corresponding to the respective local BSP module (Li fig. 4 shows at T1, the parameter server is updated with all three iterations from worker 2 (i.e. the aggregated plurality of gradients from the local model is aggregated into the global model)).
Regarding claim 3, the combination of Li, Xing, and Kadav teaches the computer system according to claim 1, wherein for each machine learning module of the at least two machine learning modules, the given machine learning module is further configured to:
obtain training data from the central BSP control module (Li fig. 6 shows the data distributing thread (i.e. the training data is stored in a central module));
and compute the gradient based on the given local model and the training data (Li fig. 6 shows the local gradient computing threads receiving the training data from the data distributing thread and computing the gradients).
Regarding claim 4, the combination of Li, Xing, and Kadav teaches the computer system according to claim 1, wherein for each machine learning module of the at least two machine learning modules, the given machine learning module is further configured to:
obtain training data pushed from the shared memory corresponding to the respective local BSP module (Li fig. 6 shows the data distributing thread pushing data out to the gradient computing threads (i.e. the local modules));
(Li fig. 6 shows the local gradient computing threads receiving the training data from the data distributing thread and computing the gradients).
Regarding claim 5, the combination of Li, Xing, and Kadav teaches the computer system according to claim 1, wherein for each machine learning module of the at least two machine learning modules, the given local BSP module is further configured to:
communicate with a parameter server (PS) in order to receive a PS model that is stored as the given local model (Li fig. 5 shows the MPI communication between the parameter server and the N+1 Caffe processes (i.e. the local models)).
Regarding claim 8, the combination of Li, Xing, and Kadav teaches the computer system according to claim 1, wherein:
each of the at least two machine learning modules is further configured to, in conjunction with storing in the given shared memory the aggregated gradient, set a gradient available flag (Li fig. 6 and pg. 760, right col. para. 3 recite “After getting data, each Caffe process proceeds computing the gradient and send it to PS process’s communication thread. Then it makes the decision to use the stale parameters or need to update parameters according to the flag "IsUpdate'. If 'IsUpdate' equals to 1, then Caffe process will update parameters for the gradient computation in the next iteration and reduce the staleness distance (i.e. the local module sets a flag to let the central module know a gradient is available to update));
and the central BSP control module is further configured to, in conjunction with periodically instructing each of the at least two local BSP modules to read out the aggregated (Li fig. 6 and pg. 760, right col. para. 3 recite “After getting data, each Caffe process proceeds to compute the gradient and send it to parameter server process’s communication thread. Then it makes the decision to use the stale parameters or need to update parameters according to the flag "IsUpdate’” (i.e. the local module sends its computed gradient to the central module after receiving the flag to update the gradient)).
Regarding claim 9, the combination of Li, Xing, and Kadav teaches the computer system according to claim 1, wherein:
the central BSP control module is further configured to instruct each of the at least two local BSP modules, in conjunction with storing or updating the given local model in the shared memory corresponding to the respective local BSP module, to set a model available flag (Li fig. 6 and pg. 760, right column para. 3 recite “After getting data, each Caffe process proceeds computing the gradient and send it to PS process’s communication thread. Then it makes the decision to use the stale parameters or need to update parameters according to the flag 'IsUpdate'” (i.e. the local module determines its availability to update parameters via a flag);
and each of the at least two machine learning modules is further configured to read, from the shared memory corresponding to the respective local BSP, the given local model in response to determining that the model available flag is set (Li fig. 6 and pg. 760, right column para. 3 recite “If ‘lsUpdate' equals to 1, then Caffe process will update parameters for the gradient computation in the next iteration and reduce the staleness distance” (i.e. the local module updates its parameters from the shared memory after receiving the flag notification)).
Claim 11 is a method claim and its limitation is included in claim 1. The only difference is that claim 11 requires a method for distributed training of a machine learning model (the abstract from Li recites a “Bounded Asynchronous Parallel (BAP) model of computation (i.e. a method) that allows computations using stale model parameters in order to reduce synchronization overheads”). Therefore, claim 11 is rejected for the same reasons as claim 1.
Claim 12 is a method claim and its limitation is included in claim 2. Claim 12 is rejected for the same reasons as claim 2.
Claim 13 is a method claim and its limitation is included in claim 3. Claim 13 is rejected for the same reasons as claim 3.
Claim 14 is a method claim and its limitation is included in claim 4. Claim 14 is rejected for the same reasons as claim 4.
Claim 15 is a method claim and its limitation is included in claim 5. Claim 15 is rejected for the same reasons as claim 5.
Claim 18 is a method claim and its limitation is included in claim 8. Claim 18 is rejected for the same reasons as claim 8.
Claim 19 is a method claim and its limitation is included in claim 9. Claim 19 is rejected for the same reasons as claim 9.
Claim 20 is a non-transitory computer program product claim and its limitation is included in claim 1. The only difference is that claim 20 requires a non-transitory computer program product storing program code for performing, when running on a computer, a method (Li section 4, para. 2 recites “Each node contains 2x8-core 2.9GHz Intel(R) Xeon(R) CPUs and 8 NVIDIA Tesla K20M GPUs with 4799 MB memory (i.e. the computer program product is not simply directed to signals)). Therefore, claim 20 is rejected for the same reasons as claim 1.

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Li et al (Li et al ("More Effective Synchronization Scheme in ML using Stale Parameters", herein Li) in view of Xing et al, (“Petuum – A New Platform for Distributed Machine Learning on Big Data”, herein Xing), in further view of Kadav et al (“ASAP: Asynchronous Approximate Data-Parallel Computation”, herein Kadav), in further view of Wei et al** (“Managed Communication and Consistency for Fast Data-Parallel Iterative Analytics”, herein Wei).
**this document was cited in the IDS filed on 07/01/2019, therefore a copy has not been provided with this office action
Regarding claim 10, the combination of Li, Xing, and Kadav teaches the computer system according to claim 1.
However, the combination of Li, Xing, and Kadav does not explicitly teach the central BSP control module is further configured to instruct each of the at least two local BSP modules to store, in the shared memory corresponding to the respective local BSP module, a global minimum clock calculated based on clock information obtained from each of the at least two machine learning modules; and each of the at least two machine learning modules is further configured to read, from the shared memory corresponding to the respective local BSP module, the global minimum clock, and interrupt, if a difference of a local clock of the given machine 
Wei teaches the central BSP control module is further configured to instruct each of the at least two local BSP modules to store, in the shared memory corresponding to the respective local BSP module, a global minimum clock calculated based on clock information obtained from each of the at least two machine learning modules (Fig. 2 and its caption recite “Updates that are generated in completed iterations (i.e. clock ticks) by other workers (blue) are highly likely visible as they are propagated at the end of each clock. Updates generated in incomplete iterations (white) are not visible as they are not yet communicated” (i.e. the global clock minimum is based on the information sent from each local module));
and each of the at least two machine learning modules is further configured to read, from the shared memory corresponding to the respective local BSP module, the global minimum clock, and interrupt, if a difference of a local clock of the given machine learning module and the global minimum clock exceeds a predefined threshold, its computation until the global minimum clock advances and a difference of the local clock of the given machine learning module and the global minimum clock is bounded by the predefined threshold (Section 3.1.2 para. 2 recites “The consistency manager works by blocking client process worker threads when reading parameters, until the local model image A has been updated to meet the consistency requirements. Bounded staleness puts constraints on parameter age; Bosen will block if A is older than the worker's current iteration by S or more (i.e., Currentlteration(worker) — Age(A) 2 S), where S is the user-defined staleness threshold. A’s age is defined as the oldest iteration such that some updates generated within that iteration are missing from A”).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine these teachings by integrating the use of global and local clocks from Wei into the bounded asynchronous parallel processing system from Li (as modified by Xing and Kadav). Li states that “for a worker at iteration T, if its parameters are older than (T - S - 1), then the parameters must be updated” (Section Ill A, para. 4), which implies but does not explicitly base the bounds of a local module on a set of clocks. Wei more explicitly defines the boundaries of the staleness threshold to be dependent on the global and local clocks, which would allow one of ordinary skill a clearer, unambiguous way to track the progress of each local module.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEAH M FEITL whose telephone number is (571)272-8350. The examiner can normally be reached on M-F 0800-1700.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll- free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/L.M.F./Examiner, Art Unit 2121                      
                                                                                                                                                                     
/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121