DETAILED ACTION
This action is in response to the Applicant Response filed 18 April 2022 for application 16/204,770 filed 29 November 2018.
Claims 1-3, 5, 7, 9-11, 13, 15-17, 19-23 are currently amended.
Claims 6, 8, 14 are cancelled.
Claims 1-5, 7, 9-13, 15-23 are pending.
Claims 1-5, 7, 9-13, 15-23 are rejected.

	
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 18 April 2022 has been entered.
 
Response to Arguments
Applicant’s arguments regarding the objections to the claims have been fully considered and, in light of the amendments to the claims are persuasive. 

Applicant’s arguments regarding the 35 U.S.C. 102 rejections of claims 1, 3, 5, 9, 11, 13, 15, 17, 19 and the 35 U.S.C 103 rejections of claims 2, 4, 7, 10, 12, 16, 18, 20-23 have been fully considered but are not persuasive.
It is noted while the Examiner may appreciate differences between the applied art and features described in the originally filed specification, any such features must be explicitly recited in the claims themselves and/or definitively and comprehensively defined in the specification in order to be considered and impact BRI of the metes and bounds of the claim terms. Applicant is respectfully reminded that during examination, the BRI of the claim terms consistent with the specification applies, and thus, the applicant is encouraged to amend the claims or point to portion(s) of the originally filed specification that prevent the BRI interpretation of the claim terms (MPEP 2173.01) enabling correspondence to the applied art.
Applicant argues that the cited references do not teach the amended limitations of claim 1 (similarly claims 7-8), particularly the following:
a pointer component that identifies one or more compressed gradient weights, from a first group of second learning entities of a distributed machine learning system, not present in a first concatenated compressed gradient weight for a first learning entity of the distributed machine learning system based on a second group of second learning entities of the distributed machine learning system, and that was previously sent to the first learning entity, wherein the first group of second learning entities is less than an entirety of second learning entities of the distributed machine learning system and is different from the second group of second learning entities; 
a compression component that computes a second concatenated compressed gradient weight for the first learning entity, based on a function that uses the one or more compressed gradient weights and does not use any compressed gradient weights employed to generate the first concatenated compressed gradient weight, to update a weight of the first learning entity; and 
a transmit component that transmits, via a network, to the first learning entity, the second concatenated compressed gradient weight to initiate the first learning entity to update the weight of the first learning entity using the second concatenated compressed gradient weight (emphasis by applicant).
Applicant first argues that Wen teaches updating the first learning entity using synchronous SGD, not asynchronous SGD. Examiner respectfully disagrees, in part, with applicant’s arguments. As noted by applicant, Wen teaches calculating a local gradient (which is ternarized [compressed]) for each of the worker computers and sending the local gradients to the parameter server (Wen, section 3.1). Further, although not specifically addressed by applicant, Wen teaches that this process occurs for every iteration t (Wen section 3.1). Because, for each iteration, a global gradient is calculated [second concatenated compressed gradient weight] using the local gradients for that iteration, and none of the local gradients used for a global gradient for a previous iteration [first concatenated compressed gradient weight] are used, Wen teaches generating a global gradient without using local gradients from previous iterations (Wen, section 3.1). Wen goes on to teach distributing the global gradient for any given iteration to each worker (Wen, section 3.1). As noted by applicant, Wen describes this process in a synchronous manner. However, Wen notes that this process can be implemented using asynchronous SGD but is silent as to the steps of asynchronous SGD (Wen, section 3.1).
Additionally, applicant argues that Zhang does not cure the deficiencies of Wen. Examiner respectfully disagrees. While applicant asserts a failure to cure the deficiencies of Wen, applicant provides no evidence to support this assertion. However, as discussed in more detail below, Zhang teaches asynchronous SGD. Zhang teaches an n-softsync protocol, i.e., asynchronous SGD (Zhang, section 2.2). Zhang teaches, for each iteration, receiving local gradients from a group of learners (workers) that does not include the entirety of the workers, generating a global gradient based on the received local gradients and sending the global results to the learners [including the first learning entity] (Zhang, section 2.2). Zhang further teaches repeating this process for each iteration using a different group of local gradients for each iteration (Zhang, section 2.2), therefore, teaching asynchronous SGD.
Lastly, applicant argues that Wen does not teach a concatenated compressed gradient weight, but instead teaches an averaged compressed gradient weight. While applicant agrees that Wen teaches an averaged global gradient, Examiner respectfully disagrees that this does not teach a concatenated compressed gradient. Applicant points to paragraph 0091 of the instant application which states that the concatenated vector can be a concatenation of the local gradients; however, this is simply exemplary language of one method to create the concatenated gradients. The specification of the instant applicant also states that the concatenated compressed gradients can be formed using formulas 202 (Fig. 2) and 404 (Fig. 4) (e.g., Specification, ¶¶0056, 0066, 0073, 0079-0080, 0090). Both of these equations compute the concatenated compressed gradients as the average of the local gradients. Therefore, Wen does in fact teach a concatenated compressed gradient weight and the combination of Wen and Zhang teach all of the limitations of claim 1. 
Therefore, claim 1 is rejected under 35 U.S.C. 103 as unpatentable over Wen in view of Zhang. For similar reasons, claims 9, 15 are also rejected as unpatentable over Wen in view of Zhang. Moreover, the rejections of claims 1, 9, 15 apply to all dependent claims which are dependent on claims 1, 9, 15, including claims 2-3, 5, 7, 10-11, 13, 16-17, 19-23 which are also as unpatentable over Wen in view of Zhang and claims 4, 12, 18 which are unpatentable over Wen in view of Zhang and further in view of Lim.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-3, 5, 7, 9-11, 13, 15-17, 19-23 are rejected under 35 U.S.C. 103 as being unpatentable over Wen et al. (TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning, hereinafter referred to as "Wen") in view of Zhang et al. (Staleness-aware Async-SGD for Distributed Deep Learning, hereinafter referred to as "Zhang").

Regarding claim 1 (Currently Amended), Wen teaches a system, comprising: 
a memory that stores computer executable components (Wen, Appendix B – teaches CPU/GPU based deep learning systems with distributed TensorFlow on a cluster of 4 machines, each of which had 4 GTX 1080 GPUs); and 
a processor that executes the computer executable components stored in the memory (Wen, Appendix B – teaches CPU/GPU based deep learning systems with distributed TensorFlow on a cluster of 4 machines, each of which had 4 GTX 1080 GPUs), wherein the computer executable components comprise: 
a pointer component that identifies one or more compressed gradient weights, from a first group of second learning entities of a distributed machine learning system, not present in a first concatenated compressed gradient weight for a first learning entity of the distributed machine learning system based on a second group of second learning entities of the distributed machine learning system, and that was previously sent to the first learning entity (Wen, section 3.1 – teaches that at iteration t, each worker computer generates local gradients, ternarizes [compresses] the gradients, and sends them to the parameter server [ternarized local gradients interpreted as one or more compressed gradient weights], where the parameter server averages the gradients from all the workers and sends the averaged gradients back to the workers; see also Wen, Figure 1, Algorithm 1 [Because the local gradients are calculated during iteration t, they are not present in the previously calculated averaged gradients at t-1 (first concatenated compressed gradient weights).]), wherein the first group of second learning entities is less than an entirety of second learning entities of the distributed machine learning system and is different from the second group of second learning entities (Wen, section 3.1 – teaches that the gradient compression method can be implemented using asynchronous SGD); 
a compression component that computes a second concatenated compressed gradient weight for the first learning entity, based on a function that uses the one or more compressed gradient weights and does not use any compressed gradient weights employed to generate the first concatenated compressed gradient weight, to update a weight of the first learning entity (Wen, section 3.1 – teaches that at iteration t, each worker computers local ternarized gradients [compressed gradient weights] and sends them to the parameter server, where the parameter server averages [concatenates] the gradients from all the workers and sends the averaged gradients [second concatenated compressed gradient weight] back to the workers to update the workers [learning entities]; see also Wen, Figure 1, Algorithm 1 [Because the compressed gradient weights are computed at each iteration and used to computer concatenated compressed gradient weights for that iteration, no local compressed gradient weights are used for concatenated gradient weights of a different iteration]; see also Wen, section 3.1 – teaches that the gradient compression method can be implemented using asynchronous SGD); and 
a transmit component that transmits, via a network, to the first learning entity, the second concatenated compressed gradient weight to initiate the first learning entity to update the weight of the first learning entity using the second concatenated compressed gradient weight (Wen, section 3.1 – teaches transmitting the averaged gradients for iteration t [second concatenated compressed gradient weight] to the workers [including the first entity] for update; see also Wen, section 3.1 – teaches that the gradient compression method can be implemented using asynchronous SGD).
While Wen teaches the method for compressed gradient concatenation, and Wen notes that the method can be used with asynchronous SGD, Wen does not explicitly teach the steps of asynchronous SGD.
Zhang teaches
a pointer component that identifies one or more … gradient weights, from a first group of second learning entities of a distributed machine learning system, not present in a first concatenated … gradient weight for a first learning entity of the distributed machine learning system based on a second group of second learning entities of the distributed machine learning system, and that was previously sent to the first learning entity (Zhang, section 2.2 - teaches , at iteration t, generating local gradients [one or more gradient weights] from a group of learners [first group of distributed second learning entities] for a first learner [first learning entity] which were not present in aggregated gradient weights at iteration t-1 [first concatenated gradient weight] which was applied to a first learner [first learning entity], wherein the aggregated gradient weights at iteration t-1 were generated using local gradient weights for a different group of learners [second group of distributed second learning entities]; see also Zhang, section 2.4 [Because the compressed gradient weights are computed at each iteration and used to computer concatenated compressed gradient weights for that iteration, no local compressed gradient weights are used for concatenated gradient weights of a different iteration]), wherein the first group of second learning entities is less than an entirety of second learning entities of the distributed machine learning system and is different from the second group of second learning entities (Zhang, section 2.2 – teaches that the groups of learners are subsets of the entire group of learners and the subsets at each iteration are different; see also Zhang, section 2.4); 
a compression component that computes a second concatenated … gradient weight for the first learning entity, based on a function that uses the one or more … gradient weights and does not use any … gradient weights employed to generate the first concatenated compressed gradient weight, to update a weight of the first learning entity (Zhang, section 2.2 – teaches that at each iteration only the local gradients for that iteration are used to generate the concatenated gradients for that iteration and no local gradients for a given iteration are used to generate a concatenated gradient for a different iteration); and 
a transmit component that transmits, via a network, to the first learning entity, the second concatenated … gradient weight to initiate the first learning entity to update the weight of the first learning entity using the second concatenated … gradient weight (Zhang, section 2.2 – teaches updating the weights of the first learner using the concatenated weights at each iteration).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to modify Wen with the teachings of Zhang in order to accelerate training of large-scale deep networks in a distributed environment compared to SSGD and conventional ASGD algorithms in the field of distributed deep learning (Zhang, Abstract - "Deep neural networks have been shown to achieve state-of-the-art performance in several machine learning tasks. Stochastic Gradient Descent (SGD) is the preferred optimization algorithm for training these networks and asynchronous SGD (ASGD) has been widely adopted for accelerating the training of large-scale deep networks in a distributed computing environment. However, in practice it is quite challenging to tune the training hyperparameters (such as learning rate) when using ASGD so as achieve convergence and linear speedup, since the stability of the optimization algorithm is strongly influenced by the asynchronous nature of parameter updates. In this paper, we propose a variant of the ASGD algorithm in which the learning rate is modulated according to the gradient staleness and provide theoretical guarantees for convergence of this algorithm. Experimental verification is performed on commonly-used image classification benchmarks: CIFAR10 and Imagenet to demonstrate the superior effectiveness of the proposed approach, compared to SSGD (Synchronous SGD) and the conventional ASGD algorithm.").

Regarding claim 2 (Currently Amended), Wen in view of Zhang teaches all of the limitations of the system of claim 1 as noted above. Zhang further teaches wherein the pointer component identifies the one or more compressed gradient weights based on a first timestamp corresponding to the first concatenated compressed gradient weight and one or more second timestamps corresponding respectively to the one or more compressed gradient weights (Zhang, section 2.4 – teaches that the weights are updated when it has received a given number, e.g., 30 in the reference, of gradients from any of the learners [This demonstrates that the system has to have a timestamp of the last update (first timestamp) and a timestamp of the incoming gradients (second timestamps) to make sure the new gradients are identified after the last update. Further, while Zhang sends updated weights back to the learners, it would be obvious to a person having ordinary skill, especially in light of Wen, that the gradients could be sent to the learners.]).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Wen and Zhang in order to use timestamps for local compressed gradients to accelerate the training of large-scale deep networks in a distributed computing environment (Zhang, Abstract).

Regarding claim 3 (Currently Amended), Wen in view of Zhang teaches all of the limitations of the system of claim 1 as noted above. Wen further teaches wherein the compression component computes the first concatenated compressed gradient weight based on one or more second compressed gradient weights of the second group of second learning entities of the distributed machine learning system (Wen, section 3.1 – teaches that at iteration t, each worker computers local gradients and sends them to the parameter server, where the parameter server averages the gradients from all the workers and sends the averaged gradients back to the workers; see also Wen, Figure 1, Algorithm 1 [Because this happens at each iteration, the averaged gradients at t-1 (first concatenated compressed gradient weight) was based on local gradients at t-1 (second compressed gradient weights)]).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Wen and Zhang for the same reasons as disclosed in claim 1 above.

Regarding claim 5 (Currently Amended), Wen in view of Zhang teaches all of the limitations of the system of claim 1 as noted above. Wen further teaches wherein the second concatenated compressed gradient weight comprises a windowed concatenated compressed gradient weight having only the one or more compressed gradient weights (Wen, section 3.1 – teaches calculating average gradients for the iteration [second concatenated compressed gradient weight] using the only the local gradients for each learner for that particular iteration [only the one or more compressed gradient weights]), thereby facilitating at least one of: 
improved processing efficiency associated with the processor (Wen, section 5 – teaches improved processing efficiency); or 
reduced storage consumption associated with the memory.
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Wen and Zhang for the same reasons as disclosed in claim 1 above.

Regarding claim 7 (Currently Amended), Wen in view of Zhang teaches all of the limitations of the system of claim 1 as noted above. Zhang further teaches wherein the distributed machine learning system comprises at least one of an asynchronous machine learning system or an asynchronous stochastic gradient descent system (Zhang, section 2.2 – teaches implementing an asynchronous SGD).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Wen and Zhang in order to use asynchronous stochastic gradient descent to accelerate the training of large-scale deep networks in a distributed computing environment (Zhang, Abstract).

Regarding claim 9 (Currently Amended), it is the computer-implemented method embodiment of claim 1 with similar limitations to claim 1 and is rejected using the same reasoning found in claim 1.

Regarding claim 10 (Currently Amended), the rejection of claim 9 is incorporated herein. Further, the limitations in this claim are taught by Wen in view of Zhang for the reasons set forth in the rejection of claim 2.

Regarding claim 11 (Currently Amended), the rejection of claim 9 is incorporated herein. Further, the limitations in this claim are taught by Wen in view of Zhang for the reasons set forth in the rejection of claim 3.

Regarding claim 13 (Currently Amended), Wen in view of Zhang teaches all of the limitations of the method of claim 9 as noted above. Wen further teaches wherein the second concatenated compressed gradient weight comprises a windowed concatenated compressed gradient weight having only the one or more compressed gradient weights (Wen, section 3.1 – teaches calculating average gradients for the iteration [second concatenated compressed gradient weight] using the only the local gradients for each learner for that particular iteration [only the one or more compressed gradient weights]), thereby facilitating improved processing efficiency associated with the processor (Wen, section 5 – teaches improved processing efficiency).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Wen and Zhang for the same reasons as disclosed in claim 9 above.

Regarding claim 15 (Currently Amended), it is the computer program product embodiment of claim 1 with similar limitations to claim 1 and is rejected using the same reasoning found in claim 1.  Wen further teaches a computer program product facilitating a gradient weight compression process (Wen, section 3.1 – teaches ternary compression of gradients), the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor (Wen, Appendix B – teaches CPU/GPU based deep learning systems with distributed TensorFlow on a cluster of 4 machines, each of which had 4 GTX 1080 GPUs) ...
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Wen and Zhang for the same reasons as disclosed in claim 1 above.

Regarding claim 16 (Currently Amended), the rejection of claim 15 is incorporated herein. Further, the limitations in this claim are taught by Wen in view of Zhang for the reasons set forth in the rejection of claim 2.

Regarding claim 17 (Currently Amended), the rejection of claim 15 is incorporated herein. Further, the limitations in this claim are taught by Wen in view of Zhang for the reasons set forth in the rejection of claim 3.

Regarding claim 19 (Currently Amended), Wen in view of Zhang teaches all of the limitations of the computer program product of claim 15 as noted above.  Wen further teaches wherein the second concatenated compressed gradient weight comprises a windowed concatenated compressed gradient weight having only the one or more compressed gradient weights (Wen, section 3.1 – teaches calculating average gradients for the iteration [second concatenated compressed gradient weight] using the only the local gradients for each learner for that particular iteration [only the one or more compressed gradient weights]).

Regarding claim 20 (Currently Amended), the rejection of claim 15 is incorporated herein. Further, the limitations in this claim are taught by Wen in view of Zhang for the reasons set forth in the rejection of claim 7.

Regarding claim 21 (Currently Amended), Wen in view of Zhang teaches all of the limitations of the computer program product of claim 15 as noted above. Zhang further teaches encode, by the processor, a timestamp on the second concatenated compressed gradient weight (Zhang, section 2.1 – teaches encoding a timestep counter with each gradient/weight transfer).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to combine the teachings of Wen and Zhang in order to encode timestamps for concatenated compressed gradients to accelerate the training of large-scale deep networks in a distributed computing environment (Zhang, Abstract).

Regarding claim 22 (Currently Amended), the rejection of claim 9 is incorporated herein. Further, the limitations in this claim are taught by Wen in view of Zhang for the reasons set forth in the rejection of claim 7.

Regarding claim 23 (Currently Amended), the rejection of claim 9 is incorporated herein. Further, the limitations in this claim are taught by Wen in view of Zhang for the reasons set forth in the rejection of claim 21.

Claims 4, 12, 18 are rejected under 35 U.S.C. 103 as being unpatentable over Wen in view of Zhang and further in view of Lim et al. (US 2018/0336076 A1 – Parameter-Sharing Apparatus and Method, hereinafter referred to as “Lim”).

Regarding claim 4 (Previously Presented), Wen in view of Zhang teaches all of the limitations of the system of claim 3 as noted above. However, Wen in view of Zhang does not explicitly teach wherein the transmit component transmits to the respective learning entities of the distributed machine learning system respective sizes of the one or more second compressed gradient weights.
Lim teaches wherein the transmit component transmits to the respective learning entities of the distributed machine learning system respective sizes of the one or more second compressed gradient weights (Lim, ¶¶0087, 0114-0115 – teaches a distributed system [multiple learning entities] with a central parameter server receiving parameter information, including parameter size, such as size of memory needed to store the parameter, when transferring parameter values [gradient weights]).
It would have been obvious to one of ordinary skill in the art before the filing date of the claimed invention to modify Wen in view of Zhang with the teachings of Lim in order to accelerate training of distributed deep learning systems in the field of distributed deep learning (Lim, ¶0015 – “Due to these features, it is impossible to use the memory box using only a distributed deep-learning framework which uses the existing parameter server. When parameters are shared using the memory box, training of the deep learning may be accelerated owing to the high access speed of the memory box. However, in order to use the memory box, a distributed deep-learning framework must be modified such that parameters are shared through the memory box.”).

Regarding claim 12 (Previously Presented), the rejection of claim 11 is incorporated herein. Further, the limitations in this claim are taught by Wen in view of Zhang and further in view of Lim for the reasons set forth in the rejection of claim 4.

Regarding claim 18 (Previously Presented), the rejection of claim 17 is incorporated herein. Further, the limitations in this claim are taught by Wen in view of Zhang and further in view of Lim for the reasons set forth in the rejection of claim 4.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure:
Dean et al. (Large Scale Distributed Deep Networks) teaches asynchronous stochastic gradient descent.
Any inquiry concerning this communication or earlier communication from the examiner should be directed to MARSHALL WERNER whose telephone number is (469) 295-9143. The examiner can normally be reached on Monday – Thursday 7:30 AM – 4:30 PM ET.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar, can be reached at (571) 272-7796. The fax number for the organization where this application or proceeding is assigned is (571) 273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).

/MARSHALL L WERNER/               Examiner, Art Unit 2125