DETAILED ACTION
This action is in response to claims filed 22 August, 2018 for application 16/078,983 filed 22 August, 2018. Currently claims 1-20 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 7, 8, 16 and 17 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
The term "equal approximately to 1" in claim 7 and 16 is a relative term which renders the claim indefinite.  The term "approximately" is not defined by the claims, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention.  
The terms "relatively small" and “small” in claims 8 and 17 are relative terms which renders the claim indefinite.  The terms "relatively small" and “small” is not defined by the claims, the specification does not provide a standard for ascertaining the 
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-3, 5, 9-14 and 18-20 are rejected under 35 U.S.C. 102(A)(1) as being anticipated by Moritz et al. (SPARKNET: TRAINING DEEP NETWORKS IN SPARK).

Regarding claims 1, 11 and 19, Moritz discloses: A method for training a learning machine, comprising:
broadcasting an initial global model for a training cycle to a plurality of worker nodes (Fig 1, “Our implementation works well out-of-the box on a five-node EC2 cluster in which broadcasting and collecting model parameters (several hundred megabytes per worker) takes on the order of 20 seconds, and performing a single minibatch gradient computation requires about 2 seconds (for AlexNet).” P2 ¶2); 
(“Our implementation works well out-of-the box on a five-node EC2 cluster in which broadcasting and collecting model parameters (several hundred megabytes per worker) takes on the order of 20 seconds, and performing a single minibatch gradient computation requires about 2 seconds (for AlexNet). We achieve this by providing a simple algorithm for parallelizing SGD that involves minimal communication and lends itself to straightforward implementation in batch computational frameworks” p2 ¶2, “This figure depicts a parallel run of SGD on K = 4 machines under SparkNet’s parallelization scheme. At each step, each machine runs SGD with batch size b for τ iterations, after which the models are aggregated, averaged, and broadcast to the workers. The quantity Ma(b, K, τ ) is the number of rounds (of τ iterations) required to obtain an accuracy of a. The total number of parallel iterations of SGD under SparkNet’s parallelization scheme required to obtain an accuracy of a is then τMa(b, K, τ ).” Figure 2.c, Workers are independent as shown in figure 1); 
aggregating the plurality of updated local models to obtain an aggregated model (“This figure depicts a parallel run of SGD on K = 4 machines under SparkNet’s parallelization scheme. At each step, each machine runs SGD with batch size b for τ iterations, after which the models are aggregated, averaged, and broadcast to the workers. The quantity Ma(b, K, τ ) is the number of rounds (of τ iterations) required to obtain an accuracy of a. The total number of parallel iterations of SGD under SparkNet’s parallelization scheme required to obtain an accuracy of a is then τMa(b, K, τ ).” Figure 2.c); and 
generating an updated global model for the training cycle based at least on the aggregated model and historical information which is obtained from a preceding training cycle (Fig 2c, note: under Broadest Reasonable Interpretation the gradients are based on historical information from the previous iteration and are aggregated to update the global model).

Regarding claims 2 and 12, Moritz discloses: The method of claim 1, wherein generating an updated global model for the training cycle further comprises:
determining a first global model update based on the aggregated model and the initial global model for the training cycle (“This figure depicts a parallel run of SGD on K = 4 machines under SparkNet’s parallelization scheme. At each step, each machine runs SGD with batch size b for τ iterations, after which the models are aggregated, averaged, and broadcast to the workers. The quantity Ma(b, K, τ ) is the number of rounds (of τ iterations) required to obtain an accuracy of a. The total number of parallel iterations of SGD under SparkNet’s parallelization scheme required to obtain an accuracy of a is then τMa(b, K, τ ).” Figure 2.c, “We recommend initializing the network by running SGD for a small number of iterations on the master.” P4 ¶1); 
determining a second global model update based on the historical information from the preceding training cycle and the first global model update (“This figure depicts a parallel run of SGD on K = 4 machines under SparkNet’s parallelization scheme. At each step, each machine runs SGD with batch size b for τ iterations, after which the models are aggregated, averaged, and broadcast to the workers. The quantity Ma(b, K, τ ) is the number of rounds (of τ iterations) required to obtain an accuracy of a. The total number of parallel iterations of SGD under SparkNet’s parallelization scheme required to obtain an accuracy of a is then τMa(b, K, τ ).” Figure 2.c); 
generating the updated global model for the training cycle based on an updated global model for the preceding training cycle and the second global model update (Fig 2c, note: under Broadest Reasonable Interpretation the gradients are based on historical information from the previous iteration and are aggregated to update the global model).
 
Regarding claims 3 and 13, Moritz discloses: The method of claim 1, wherein the initial global model for the training cycle is an updated global model for the preceding training cycle (“This figure depicts a parallel run of SGD on K = 4 machines under SparkNet’s parallelization scheme. At each step, each machine runs SGD with batch size b for τ iterations, after which the models are aggregated, averaged, and broadcast to the workers. The quantity Ma(b, K, τ ) is the number of rounds (of τ iterations) required to obtain an accuracy of a. The total number of parallel iterations of SGD under SparkNet’s parallelization scheme required to obtain an accuracy of a is then τMa(b, K, τ ).” Figure 2.c); or the initial global model for the training cycle is determined based on an updated global 

Regarding claims 5 and 14, Moritz discloses: The method of claim 2, further comprising:
generating historical information from the training cycle based on the second global model update  (Fig 2c, note: under Broadest Reasonable Interpretation the gradients are based on historical information from the previous iteration and are aggregated to update the global model).

Regarding claims 9 and 18, Moritz discloses: The method of claim 1, wherein aggregating the plurality of updated local models further comprises: averaging the plurality of updated local models to obtain the aggregated model (“This figure depicts a parallel run of SGD on K = 4 machines under SparkNet’s parallelization scheme. At each step, each machine runs SGD with batch size b for τ iterations, after which the models are aggregated, averaged, and broadcast to the workers. The quantity Ma(b, K, τ ) is the number of rounds (of τ iterations) required to obtain an accuracy of a. The total number of parallel iterations of SGD under SparkNet’s parallelization scheme required to obtain an accuracy of a is then τMa(b, K, τ ).” Figure 2.c).

Regarding claims 10, Moritz discloses: The method of claim 1, further comprising:
(“The rightmost column of each heatmap corresponds to the case τ = 1, where we synchronize after every iteration of SGD. This is equivalent to running serial SGD with a batch size of Kb, where b is the batchsize on each worker (in these experiments we use b = 100). In this column, the speedup should increase sublinearly with K. We note that it is slightly surprising that the speedup does not increase monotonically from left to right as τ decreases. Intuitively, we might expect more synchronization to be strictly better (recall we are disregarding the overhead due to synchronization). However, our experiments suggest that modest delays between synchronizations can be beneficial.” P5 §3.1.2 ¶5).

Regarding claim 20, Moritz discloses: A system, comprising:
one or more storage devices for storing training data for training a learning machine (Fig 1); 
a plurality of worker nodes (Fig 1); and 
a master node (Fig 1) for performing acts of claim 1 (please see rejection of claim 1).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Moritz in view of Povey et al. (PARALLEL TRAINING OF DNNS WITH NATURAL GRADIENT AND PARAMETER AVERAGING).

Regarding claim 4, Moritz discloses: The method of claim 1, wherein each updated local model is generated by one of the following algorithms: stochastic gradient descent “This figure depicts a parallel run of SGD on K = 4 machines under SparkNet’s parallelization scheme. At each step, each machine runs SGD with batch size b for τ iterations, after which the models are aggregated, averaged, and broadcast to the workers. The quantity Ma(b, K, τ ) is the number of rounds (of τ iterations) required to obtain an accuracy of a. The total number of parallel iterations of SGD under SparkNet’s parallelization scheme required to obtain an accuracy of a is then τMa(b, K, τ ).” Figure 2.c).
However, Moritz does not explicitly disclose: wherein each updated local model is generated by one of the following algorithms: a one-sweep mini-batch stochastic gradient descent (SGD) with momentum trick, a natural gradient SGD, and an asynchronous SGD (ASGD).

Povey teaches:  a natural gradient SGD (“Parallel training of neural networks generally makes use of some combination of model parallelism and data parallelism (Dean et al., 2012), and the normal approach to data parallelism involves communication of model parameters for each minibatch. Here we describe our neural-net training framework which uses a different version of data parallelism: we have multiple SGD processes on separate machines, and only infrequently (every minute or so) average the model parameters and redistribute them to the individual machines. This is very effective for us for large-scale training of systems for speech recognition– but in our case it only works well when combined with an efficient implementation of natural gradient stochastic gradient descent (NG-SGD) that we have developed. We don’t attempt in this paper to develop a framework that explains why parameter averaging should work well despite non-convexity of DNNs, or why NG-SGD is so helpful. The point of this paper is to describe our methods and to establish that empirically they work well. The significance of this work is that we show that it is possible to get a linear speedup when increasing the number of GPUs, without requiring frequent data transfer (however, this only holds up to about 4 or 8 GPUs).” P1 §introduction ¶1).

Moritz and Povey are both in the same field of endeavor of distributed stochastic gradient descent (SGD) methods for training neural networks and are analogous. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the SGD with master and worker nodes as discloses by Moritz with the natural gradient SGD to improve the speed of the system. One would have been motivated to combine as Povey shows that using a natural gradient SGD empirically speeds up the process for multiple GPU systems (p1 §1 ¶1).

Allowable Subject Matter
Claims 6-8 and 15-17 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims and overcoming all other rejections. 
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Dean et al. (Large Scale Distributed Deep Networks) discloses exemplary methods for asynchronous SGD. Strom (US Patent 10,152,676) discloses distributed training of models with SGD.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC NILSSON whose telephone number is (571)272-5246.  The examiner can normally be reached on M-F: 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571)-272-3719.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  






/ERIC NILSSON/           Primary Examiner, Art Unit 2122