DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 01/19/2021 has been entered.
 	
EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in an email correspondence (See attached) with Jaclyn Schade on 3/12/2021 following an interview conducted on 03/11/2021.
The application has been amended as follows: 

1. 	(Currently Amended)  A method implemented on at least one computing device each of which has at least one processor, storage, and a communication platform connected to a network for estimating one or more parameters of a machine learning model, the method comprising: 	receiving a request for estimating one or more parameters in a vector associated with the machine learning model;
dividing a set of data into a plurality of sub-sets of data, each of which is allocated to corresponding one of a plurality of nodes; and  	receiving, from one of the plurality of nodes, estimated values of the one or more parameters, wherein each of the plurality of nodes is configured to 
	estimate, by a plurality of processing units in the node, values of the one or more parameters based on a sub-set of data allocated to the node, wherein one of the plurality of processing units in the node is preselected as a representative processing unit, wherein the preselected processing unit is a graphical processing unit (GPU),
	aggregate, by the representative processing unit of the node, values of the one or more parameters estimated by the plurality of processing units in the node to generate an estimated vector,
	divide, by the representative processing unit of the node, the estimated vector into a plurality of portions, each portion of the plurality of portions being designated to one of the plurality of nodes,
	collect, by the representative processing unit of the node, estimates of a portion of the vector designated to the node, from representative processing units of all other nodes of the plurality of nodes, to generate a first estimate of the portion of the vector, wherein the first estimate is generated based on the plurality of sub-sets of data,
	broadcast, by the representative processing unit of the node, the first estimate of the portion of the vector to the representative processing units of the all other nodes of the plurality of nodes,
	receive, by the representative processing unit of the node, from the representative processing units of the all other nodes of the plurality of nodes, second estimates of  corresponding portions of the vector designated to the all other nodes, and

   
2. 	(Previously Presented)  The method of claim 1, wherein each of the plurality of nodes performs the following: 	obtaining training data to be used to estimate the one or more parameters; 	generating updated parameter estimates based on a corresponding sub-set of data; 	exchanging the updated parameter estimates with other nodes of the plurality of nodes to generate a state of the one or more parameters that is shared by the plurality of nodes; and 	repeating the steps of generating and exchanging until a predetermined condition is met to generate the estimated values of the one or more parameters.
3. 	(Canceled)  
4. 	(Previously Presented)  The method of claim 1, further comprising: 	detecting a failure at one of the plurality of nodes; and 	instructing remaining nodes in the plurality of nodes to continue estimating the one or more parameters. 
5. 	(Canceled)  
6. 	(Original)  The method of claim 1, further comprising: 	determining a number of the plurality of nodes in accordance with the request; and 	determining a location of the set of data for estimating the one or more parameters associated with the machine learning model based on the request.
7. 	(Previously Presented)  The method of claim 1, further comprising: 	instructing one of the plurality of nodes to store a snapshot of a state of the one or more parameters of the machine learning model in a storage outside the plurality of nodes.
8. 	(Previously Presented)  The method of claim 7, further comprising: 	detecting a failure with respect to estimating the one or more parameters; 	instructing the plurality of nodes to retrieve the snapshot of the state of the one or more 
9. 	(Previously Presented)  The method claim 1, wherein the plurality of nodes are synchronized via a Message Passing Interface (MPI) AllReduce based Application Program Interface (API) using a network interface implemented on each of the plurality of nodes, wherein the network interface includes an Ethernet interface, an Infiniband interface, or the Ethernet interface and the Infiniband interface. 
10. 	(Original)  The method of claim 1, wherein: 	the set of data for estimating the one or more parameters is stored on a Hadoop Distributed File System (HDFS); and 	the plurality of nodes are implemented in a Spark framework.
11. 	(Currently Amended)  A system, having at least one processor, storage, and a communication platform connected to a network for estimating one or more parameters of a machine learning model, the system comprising: 	a configuration information identifier implemented by the at least one processor and configured for receiving a request for estimating one or more parameters in a vector associated with the machine learning model; 		a training data distributor implemented by the at least one processor and configured for: 		dividing a set of data into a plurality of sub-sets of data, each of which corresponds to one of the plurality of nodes; and 	a training model determiner implemented by the at least one processor and configured for:      	receiving, from one of the plurality of nodes, estimated values of the one or more parameters, wherein each of the plurality of nodes is configured to 
	 estimate, by a plurality of processing units in the node, values of the one or more parameters based on a sub-set of data allocated to the node, wherein one of the plurality of preselected as a representative processing unit, wherein the preselected processing unit is a graphical processing unit (GPU),
	aggregate, by the representative processing unit of the node, values of the one or more parameters estimated by the plurality of processing units in the node to generate an estimated vector,
	divide, by the representative processing unit of the node, the estimated vector into a plurality of portions, each portion of the plurality of portions being designated to one of the plurality of nodes,
	collect, by the representative processing unit of the node, estimates of a portion of the vector designated to the node, from representative processing units of all other nodes of the plurality of nodes, to generate a first estimate of the portion of the vector, wherein the first estimate is generated based on the plurality of sub-sets of data,
	broadcast, by the representative processing unit of the node, the first estimate of the portion of the vector to the representative processing units of the all other nodes of the plurality of nodes,
	receive, by the representative processing unit of the node, from the representative processing units of the all other nodes of the plurality of nodes, second estimates of  corresponding portions of the vector designated to the all other nodes, and
	broadcast, by the representative processing unit of the node, the first estimate and the received second estimates, to all other processing units of the plurality of processing units included in the node.

12. 	(Previously Presented)  The system of claim 11, wherein each of the plurality of nodes performs the following: 	obtaining training data to be used to estimate the one or more parameters; 	generating updated parameter estimates based on a corresponding sub-set of data; 	exchanging the updated parameter estimates with other nodes of the plurality of nodes to generate a state of the one or more parameters that is shared by the plurality of nodes; and 	repeating the steps of generating and exchanging until a predetermined condition is met to generate the estimated values of the one or more parameters.
13. 	(Canceled) 

15. 	(Canceled)  

16. 	(Previously Presented)  The system of claim 11, further comprising: 	an operation node selector implemented by the at least one processor and configured for determining a number of the plurality of nodes in accordance with the request; and 	the training data locator implemented by the at least one processor and configured for determining a location of the set of data for estimating the one or more parameters associated with the machine learning model based on the request.
17. 	(Previously Presented)  The system of claim 11, further comprising: 	a training instruction generator implemented by the at least one processor and configured for instructing one of the plurality of nodes to store a snapshot of a state of the one or more parameters of the machine learning model in a storage outside the plurality of nodes.
18. 	(Previously Presented)  The system of claim 17, further comprising: 	a node failure detector implemented by the at least one processor and configured for detecting a failure with respect to estimating the one or more parameters; and 	the training instruction generator is further configured for: 		instructing the plurality of nodes to retrieve the snapshot of the state of the one or more parameters of the machine learning model from the storage, and 		instructing the plurality of nodes to continue estimating the one or more parameters based on the last state of the one or more parameters. 
19. 	(Previously Presented)  The system of claim 11, wherein the plurality of nodes are synchronized via Message Passing Interface (MPI) AllReduce based Application Program Interface (API) using a network interface implemented on each of the plurality of nodes, wherein 
20. 	(Currently Amended)  A machine-readable tangible and non-transitory medium having information for estimating one or more parameters of a machine learning model, wherein the information, when read by the machine, causes the machine to perform the following: 	  receiving a request for estimating one or more parameters in a vector associated with the machine learning model;
dividing a set of data into a plurality of sub-sets of data, each of which is allocated to corresponding one of a plurality of nodes; and  	receiving, from one of the plurality of nodes, estimated values of the one or more parameters, wherein each of the plurality of nodes is configured to 
	estimate, by a plurality of processing units in the node, values of the one or more parameters based on a sub-set of data allocated to the node, wherein one of the plurality of processing units in the node is preselected as a representative processing unit, wherein the preselected processing unit is a graphical processing unit (GPU),
	aggregate, by the representative processing unit of the node, values of the one or more parameters estimated by the plurality of processing units in the node to generate an estimated vector,
	divide, by the representative processing unit of the node, the estimated vector into a plurality of portions, each portion of the plurality of portions being designated to one of the plurality of nodes,
	collect, by the representative processing unit of the node, estimates of a portion of the vector designated to the node, from representative processing units of all other nodes of the plurality of nodes, to generate a first estimate of the portion of the vector, wherein the first estimate is generated based on the plurality of sub-sets of data,
	broadcast, by the representative processing unit of the node, the first estimate of the portion of the vector to the representative processing units of the all other nodes of the plurality of nodes,
	receive, by the representative processing unit of the node, from the representative processing units of the all other nodes of the plurality of nodes, second estimates of corresponding portions of the vector designated to the all other nodes, and


21. 	(Previously Presented)  The method of claim 1, wherein the machine learning model includes a cluster, the cluster being a single cluster that enables a deep machine learning to be performed along with a non-deep machine learning and other data processing in the single cluster, with a single program.

22. 	(Original)  The method of claim 21, wherein: 	some data processing is performed in the cluster for producing input datasets for the deep machine learning;  	the deep machine learning is performed in the cluster for extracting features; 	the non-deep machine learning is performed in the cluster to generate a classification model based on the features; and 	the other data processing in the cluster includes applying the classification model against a big data set.
23. 	(Previously Presented)  The method of claim 1, wherein: 	the cluster is an Apache Hadoop cluster on top of Apache Spark; and 	the plurality of nodes in the cluster communicate with each other via an Application Program Interface (API) similar to Message Passing Interface (MPI) AllReduce. 
24. 	(Canceled)  
25.	(Canceled) 
26.	(Previously Presented) The method of claim 1, further comprising:
selecting a number of  processing units in the node for estimating the one or more parameters of the machine learning model based on a workload of each processing unit in the node. 

Allowable Subject Matter
Claims 1-2, 4, 6-12, 14, 16-23, and 26 are allowed.

REASONS FOR ALLOWANCE
The following is an examiner’s statement of reasons for allowance: 

The prior art of record, alone or in any hypothetical combination, fail to disclose the above examiner’s amendments to the claims. 

	Machine learning, especially deep machine learning, is an emerging field of technology. Machine learning models while being useful for tasks including prediction, classification, etc, are notoriously hard to train. Because these models require large amounts of data and have thousands, if not millions, of parameters, training such a model takes an unacceptable amount of time. 
	To solve this problem, the sub-field of “distributed machine learning” was created. In distributed machine learning, large networks of computer nodes (e.g. servers, data farms, etc) work together to quickly train a machine learning model. 
	The instant invention introduces into the prior art a novel and unique way to more efficiently train a machine learning model in such a way. 
	At least one novel aspect of the instant invention, is the concept of a “representative processing unit” (See at least Claim 1). In essence, a representative processing unit, is a “leader” of a particular computer node which, as claimed, is at least a plurality of processing units. This representative processing unit is responsible for the 
	The closest prior art of record is Zhang et al (“Poseidon: A system for Architecture for efficient GPU-based Deep learning on multiple machines”, NPL 2015). For clarity of record, the examiner notes that this NPL reference was included in an applicant filed IDS dated 07/02/2017. Similar to the instant invention, Zhang outlines a distributed machine learning system using GPUs in a peer-to-peer network (i.e. node to node). However, Zhang differs from the instant invention in at least two key ways:
	1. As shown in at least Zhang Figure 2, at least part of the parameters are communicated to a Server. Conversely, as recited in the claim language, at least the collection, broadcast, and receive step, are all carried out by the representing processing unit of a node. That is, they do not communicate through a server for at least these steps. Clearly then, Zhang cannot teach, at least, “by the representative processing unit of the node” as recited in at least Claim 1. 
	2. As a consequence of the above, Zhang similarly cannot teach or render obvious that one of the processing units (i.e. GPU) in a node is preselected as a representative processing unit. As correctly noted by the applicant in the remarks filed 01/19/2021 “…Zhang is silent with respect to [selecting], as a representative processing unit, one of a plurality of processing units of/in a node…and that [Emphasis added] the representative processing unit conducts aggregating, dividing, collecting, and broadcasting operations, without help from a server or master node [Emphasis Added]. 

	1. Arguably, Figure 3 of Jacobsen shows a Host CPU (central processing unit). Under the Broadest Reasonable Interpretation, a Host CPU discloses a “representative processing unit”. However, importantly, the claims explicitly recite that the representative processing unit is a GPU (Graphical Processing Unit). Thus, Jacobsen fails to teach let alone fairly suggest the recited claim language. 
	
Therefore, either alone or in combination, at least the art of Zhang and Jacobsen fails to teach or even fairly suggest the recited claim language. Thus, Claims 1, 11, and 20, and by virtue of dependency, Claims 2, 4, 6-10, 12, 14, 16-19, 21-23, and 26 are allowable over the prior art of record. 

Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FEN TAMULONIS whose telephone number is (571)272-0934.  The examiner can normally be reached on 7:30AM-5:30PM MON-FRI EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571)-272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ANN J LO/Supervisory Patent Examiner, Art Unit 2126