DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . This action is responsive to the Application filed on 03/23/2020. Claims 1-20 are pending in the case. Claims 1, 5, and 13 are independent claims.

Specification
The disclosure is objected to because of the following informalities:
In paragraph 46, reference is made to figure 2, and in particular to reference number 202a. In accordance with figure 2, throughout the specification 202a references forward propagation. However, in paragraph 46, 202a is used to reference “backward propagation.” It is believed that this is a typographical error.
Similarly, in paragraph 46, reference 202b is used to reference “backward propagation,” whereas the figure and the remainder of the specification refer to forward propagation.
Appropriate correction is required.

Claim Rejections - 35 U.S.C. § 101
35 U.S.C. § 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. § 101 because the claimed invention is directed to an abstract idea without significantly more.

Independent claim 1 is rejected under 35 U.S.C. § 101 because the claimed invention is directed to an abstract idea without significantly more.
	Step 1:
The claim is directed towards the statutory category of a process.
Step 2A Prong 1:
The claim recites a mental process. The mental process recited is: a method of exchanging compressed gradient data within a… system for training a… model, the method comprising: computing… a set of gradients using the neural network model and a set of weights associated with the neural network model; performing… a sparsity analysis on the set of gradients to determine a threshold; clipping… each of the set of gradients having a value less than the threshold, resulting in the set of gradients comprising non-clipped data elements and clipped data elements; generating… a mapping that indicates which of the set of gradients correspond to the non-clipped data elements and which of the set of gradients correspond to the clipped data elements; generating… compressed data comprising the non- clipped data elements from the set of gradients;… generating… decompressed data by combining the non-clipped data elements from the compressed data with the clipped data elements using the mapping, such that the decompressed data includes the set of gradients comprising the non- clipped data elements and the clipped data elements; and computing… a set of synchronized gradients based on the set of gradients and other gradients received….
	Under the broadest reasonable interpretation, these limitations are process steps that cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the human mind or with the aid of pencil and paper but for the recitation of a generic computer component. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the "Mental Process" grouping of abstract ideas. A person would readily be able to perform this process either mentally or with the assistance of pen and paper. See MPEP § 2106.04(a)(2).
Step 2A Prong 2: 
The claimed invention does not recite any additional elements that integrate the judicial exception into a practical application. Refer to MPEP §2106.04(d). 
The following limitations are merely reciting the words "apply it" (or an equivalent) with the judicial exception, or merely including instructions to implement an abstract idea on a computer, or merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f): at a transmitting worker node of the distributed system; and at the receiving worker node.
The following limitations are adding insignificant extra-solution activity to the judicial exception, as discussed in MPEP § 2106.05(g): transmitting the mapping and the compressed data from the transmitting worker node to a receiving worker node of the distributed system.
The following limitations are generally linking the use of a judicial exception to a particular technological environment or field of use, as discussed in MPEP § 2106.05(h): distributed system; and neural network model.
A claim that integrates a judicial exception into a practical application will apply, rely on, or use the judicial exception in a manner that imposes a meaningful limit on the judicial exception, such that the claim is more than a drafting effort designed to monopolize the judicial exception. See MPEP § 2106.04(d). 
Step 2B:
The claimed invention does not recite any additional elements/limitations that amount to significantly more. 
The following limitations are merely reciting the words "apply it" (or an equivalent) with the judicial exception, or merely including instructions to implement an abstract idea on a computer, or merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f): at a transmitting worker node of the distributed system; and at the receiving worker node.
The following limitations are adding insignificant extra-solution activity to the judicial exception, as discussed in MPEP § 2106.05(g): transmitting the mapping and the compressed data from the transmitting worker node to a receiving worker node of the distributed system. The court decisions cited in MPEP 2106.05(d)(II) indicate that merely “receiving and transmitting data over a network” is a well‐understood, routine, conventional function when it is claimed in a merely generic manner (as it is in the present claim).
The following limitations are generally linking the use of a judicial exception to a particular technological environment or field of use, as discussed in MPEP § 2106.05(h): distributed system; and neural network model.
The claimed invention recites an abstract idea without significantly more.

Dependent claim 2 is rejected under 35 U.S.C. § 101 because the claimed invention is directed to an abstract idea without significantly more.
The claim recites a mental process. The mental process recited is: forming… a header comprising the mapping and an original length of the set of gradients, the original length corresponding to a number of the non-clipped data elements and the clipped data elements.
The claimed invention does not recite any additional elements that integrate the judicial exception into a practical application. Refer to MPEP §2106.04(d). The following limitations are generally linking the use of a judicial exception to a particular technological environment or field of use, as discussed in MPEP § 2106.05(h): at the transmitting worker node.
The claimed invention does not recite any additional elements/limitations that amount to significantly more. The following limitations are generally linking the use of a judicial exception to a particular technological environment or field of use, as discussed in MPEP § 2106.05(h): at the transmitting worker node.

Dependent claim 3 is rejected under 35 U.S.C. § 101 because the claimed invention is directed to an abstract idea without significantly more.
The claim recites a mental process. The mental process recited is: the mapping includes a bitmap with binary values indicating locations of the non-clipped data elements and the clipped data elements.
The claimed invention does not recite any additional elements that integrate the judicial exception into a practical application. Refer to MPEP §2106.04(d). The claimed invention does not recite any additional elements/limitations that amount to significantly more. 

Dependent claim 4 is rejected under 35 U.S.C. § 101 because the claimed invention is directed to an abstract idea without significantly more.
The claim recites a mental process. The mental process recited is: clipping each of the set of gradients includes setting the value equal to zero such that the clipped data elements are zero data elements and the non-clipped data elements are non-zero data elements.
The claimed invention does not recite any additional elements that integrate the judicial exception into a practical application. Refer to MPEP §2106.04(d). The claimed invention does not recite any additional elements/limitations that amount to significantly more. 

Dependent claim 12 is rejected under 35 U.S.C. § 101 because the claimed invention is directed to an abstract idea without significantly more.
The claim recites a mental process. The mental process recited is: performing the sparsity analysis includes: calculating an average for the set of gradients; calculating a standard deviation for the set of gradients; and determining the threshold based on the average and the standard deviation.
The claimed invention does not recite any additional elements that integrate the judicial exception into a practical application. Refer to MPEP §2106.04(d). The claimed invention does not recite any additional elements/limitations that amount to significantly more. 

The remaining claims 5-11 and 13-20 are rejected under 35 U.S.C. § 101 because the claimed invention is directed to an abstract idea without significantly more for at least the same reasons as those given above with respect to claims 1-4 and 12 with only the addition of generic computer components under step 2A prong 1. Under the broadest reasonable interpretation, these limitations are process steps that cover mental processes including an observation, evaluation, judgment or opinion that could be performed in the human mind or with the aid of pencil and paper but for the recitation of a generic computer component. If a claim, under its broadest reasonable interpretation, covers a mental process but for the recitation of generic computer components, then it falls within the "Mental Process" grouping of abstract ideas. A person would readily be able to perform this process either mentally or with the assistance of pen and paper. See MPEP § 2106.04(a)(2). Limitations that merely reciting the words "apply it" (or an equivalent) with the judicial exception, or merely including instructions to implement an abstract idea on a computer, or merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f). These additional elements do not integrate the judicial exception into a practical application under step 2A prong 2. Refer to MPEP §2106.04(d). Moreover, the limitations are merely reciting the words "apply it" (or an equivalent) with the judicial exception, or merely including instructions to implement an abstract idea on a computer, or merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f). These additional elements do not recite any additional elements/limitations that amount to significantly more. Accordingly, the claimed invention recites an abstract idea without significantly more.

Claim Rejections - 35 U.S.C. § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. §§ 102 and 103 (or as subject to pre-AIA  35 U.S.C. §§ 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.

The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant are advised of the obligation under 37 C.F.R. § 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. § 102(b)(2)(C) for any potential 35 U.S.C. § 102(a)(2) prior art against the later invention.

Claims 1, 5, 6, 11, 13, 14, and 19 are rejected under 35 U.S.C. § 103 as being unpatentable over Chetlur et al. (U.S. Pat. App. Pub. No. 2021/0133583, hereinafter Chetlur) in view of Fang et al. (Fang, Jiarui, Haohuan Fu, Guangwen Yang, and Cho-Jui Hsieh. "RedSync: reducing synchronization bandwidth for distributed deep learning training system." Journal of Parallel and Distributed Computing 133 (2019): 30-39, hereinafter Fang).

As to independent claim 1, Chetlur teaches:
A method of exchanging compressed gradient data within a distributed system for training a neural network model, the method comprising (Title and paragraph 52):
computing, at a transmitting worker node of the distributed system, a set of gradients using the neural network model and a set of weights associated with the neural network model (Figure 7, blocks 704, 706, and 710. Paragraph 83, in at least one embodiment, workers can be executed on separate threads, processors, processor cores, or processes on a computer system. In at least one embodiment, gradients are produced at least in part in parallel by workers. Paragraph 83, weight updates distributed to individual workers based on available processing power);… and
computing, at the receiving worker node, a set of synchronized gradients based on the set of gradients and other gradients received at the receiving worker node (Figure 7, blocks 712, 714, and 716. Paragraph 83, in at least one embodiment, workers can be executed on separate threads, processors, processor cores, or processes on a computer system. In at least one embodiment, gradients are produced at least in part in parallel by workers. Paragraph 83, weight updates distributed to individual workers based on available processing power).
Chetlur does not appear to expressly teach performing, at the transmitting worker node, a sparsity analysis on the set of gradients to determine a threshold; clipping, at the transmitting worker node, each of the set of gradients having a value less than the threshold, resulting in the set of gradients comprising non-clipped data elements and clipped data elements; generating, at the transmitting worker node, a mapping that indicates which of the set of gradients correspond to the non-clipped data elements and which of the set of gradients correspond to the clipped data elements; generating, at the transmitting worker node, compressed data comprising the non- clipped data elements from the set of gradients; transmitting the mapping and the compressed data from the transmitting worker node to a receiving worker node of the distributed system; and generating, at the receiving worker node, decompressed data by combining the non-clipped data elements from the compressed data with the clipped data elements using the mapping, such that the decompressed data includes the set of gradients comprising the non- clipped data elements and the clipped data elements.
Fang teaches performing, at the transmitting worker node, a sparsity analysis on the set of gradients to determine a threshold (Page 31, section 2.1 Parallel-friendly residual sparsification, a relative large threshold value is chosen according to mean and maximum value, for example, 0.8×(max−mean)+mean. Operation count_nonzero gets the number of elements whose absolute values are greater than the threshold. If the number is smaller than k (the number of top-0.1% elements), we dynamically decrease the threshold until we find the number of parameters whose absolute value above the threshold is larger than k); clipping, at the transmitting worker node, each of the set of gradients having a value less than the threshold, resulting in the set of gradients comprising non-clipped data elements and clipped data elements (Page 32, we trim all elements that are less than the threshold and perform a top-k selection operation using radixSelect on the remaining elements. The trimming reads on the claimed clipping); generating, at the transmitting worker node, a mapping that indicates which of the set of gradients correspond to the non-clipped data elements and which of the set of gradients correspond to the clipped data elements (Page 33, section 2.3 sparse synchronization and decompression, the message representing compressed residuals of each node should include the information of indices and values of elements in communication-set. The information of indices and values of elements in communication-set read on the claimed mapping); generating, at the transmitting worker node, compressed data comprising the non- clipped data elements from the set of gradients (Page 33, section 2.3 sparse synchronization and decompression, the message representing compressed residuals of each node should include the information of indices and values of elements in communication-set. When using threshold binary search selection, the length of each node’s message is different. As a result, the packaged message should also include an initial element, which indicates the length of the compressed elements. Instead of using two allgather operations for indices and values message separately, we package the indices and values into a single message to reduce latency. The indicies and values read on the claimed compressed data); transmitting the mapping and the compressed data from the transmitting worker node to a receiving worker node of the distributed system (Page 33, section 2.3 sparse synchronization and decompression, the message representing compressed residuals of each node. The message reads on the claimed transmitting); and generating, at the receiving worker node, decompressed data by combining the non-clipped data elements from the compressed data with the clipped data elements using the mapping, such that the decompressed data includes the set of gradients comprising the non- clipped data elements and the clipped data elements (Page 37, a decompression operation on collected data is required to add the sparse residual array to dense weight array. The sparse residual array and dense weight array read on the claimed clipped and non-clipped data elements. See also Page 33, section 2.3 sparse synchronization and decompression).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the synchronization techniques of Fang to reduce communication bandwidth while introducing limited overhead (see Fang at abstract and introduction).

As to independent claim 5, Chetlur teaches:
A method comprising (Title and paragraph 52):
computing, at a first worker node of a distributed system, a set of gradients using a neural network model and a set of weights associated with the neural network model (Figure 7, blocks 704, 706, and 710. Paragraph 83, in at least one embodiment, workers can be executed on separate threads, processors, processor cores, or processes on a computer system. In at least one embodiment, gradients are produced at least in part in parallel by workers. Paragraph 83, weight updates distributed to individual workers based on available processing power)….
Chetlur does not appear to expressly teach clipping each of the set of gradients having a value less than a threshold, resulting in the set of gradients comprising non-clipped data elements and clipped data elements; generating a mapping that indicates which of the set of gradients correspond to the non-clipped data elements and which of the set of gradients correspond to the clipped data elements; generating compressed data based on the non-clipped data elements from the set of gradients; and transmitting the mapping and the compressed data from the first worker node to a second worker node of the distributed system.
Fang teaches clipping each of the set of gradients having a value less than a threshold, resulting in the set of gradients comprising non-clipped data elements and clipped data elements (Page 32, we trim all elements that are less than the threshold and perform a top-k selection operation using radixSelect on the remaining elements. The trimming reads on the claimed clipping); generating a mapping that indicates which of the set of gradients correspond to the non-clipped data elements and which of the set of gradients correspond to the clipped data elements (Page 33, section 2.3 sparse synchronization and decompression, the message representing compressed residuals of each node should include the information of indices and values of elements in communication-set. The information of indices and values of elements in communication-set read on the claimed mapping); generating compressed data based on the non-clipped data elements from the set of gradients (Page 33, section 2.3 sparse synchronization and decompression, the message representing compressed residuals of each node should include the information of indices and values of elements in communication-set. When using threshold binary search selection, the length of each node’s message is different. As a result, the packaged message should also include an initial element, which indicates the length of the compressed elements. Instead of using two allgather operations for indices and values message separately, we package the indices and values into a single message to reduce latency. The indicies and values read on the claimed compressed data); and transmitting the mapping and the compressed data from the first worker node to a second worker node of the distributed system (Page 33, section 2.3 sparse synchronization and decompression, the message representing compressed residuals of each node. The message reads on the claimed transmitting).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the synchronization techniques of Fang to reduce communication bandwidth while introducing limited overhead (see Fang at abstract and introduction).

As to dependent claim 6, Fang further teaches generating, at the second worker node, decompressed data by combining the non- clipped data elements from the compressed data with the clipped data elements using the mapping to obtain the set of gradients comprising the non-clipped data elements and the clipped data elements (Page 37, a decompression operation on collected data is required to add the sparse residual array to dense weight array. The sparse residual array and dense weight array read on the claimed clipped and non-clipped data elements. See also Page 33, section 2.3 sparse synchronization and decompression).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the synchronization techniques of Fang to reduce communication bandwidth while introducing limited overhead (see Fang at abstract and introduction).

As to dependent claim 11, Fang further teaches performing a sparsity analysis on the set of gradients to determine the threshold (Page 31, section 2.1 Parallel-friendly residual sparsification, a relative large threshold value is chosen according to mean and maximum value, for example, 0.8×(max−mean)+mean. Operation count_nonzero gets the number of elements whose absolute values are greater than the threshold. If the number is smaller than k (the number of top-0.1% elements), we dynamically decrease the threshold until we find the number of parameters whose absolute value above the threshold is larger than k).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the synchronization techniques of Fang to reduce communication bandwidth while introducing limited overhead (see Fang at abstract and introduction).

As to independent claim 13, Chetlur teaches:
A non-transitory computer-readable medium having stored therein instructions that, when executed by one or more processors, cause the one or more processors to perform operations including (Title and paragraphs 52 and 474):
computing, at a first worker node of a distributed system, a set of gradients using a neural network model and a set of weights associated with the neural network model (Figure 7, blocks 704, 706, and 710. Paragraph 83, in at least one embodiment, workers can be executed on separate threads, processors, processor cores, or processes on a computer system. In at least one embodiment, gradients are produced at least in part in parallel by workers. Paragraph 83, weight updates distributed to individual workers based on available processing power)….
Chetlur does not appear to expressly teach clipping each of the set of gradients having a value less than a threshold, resulting in the set of gradients comprising non-clipped data elements and clipped data elements; generating a mapping that indicates which of the set of gradients correspond to the non-clipped data elements and which of the set of gradients correspond to the clipped data elements; generating compressed data based on the non-clipped data elements from the set of gradients; and transmitting the mapping and the compressed data from the first worker node to a second worker node of the distributed system.
Fang teaches clipping each of the set of gradients having a value less than a threshold, resulting in the set of gradients comprising non-clipped data elements and clipped data elements (Page 32, we trim all elements that are less than the threshold and perform a top-k selection operation using radixSelect on the remaining elements. The trimming reads on the claimed clipping); generating a mapping that indicates which of the set of gradients correspond to the non-clipped data elements and which of the set of gradients correspond to the clipped data elements (Page 33, section 2.3 sparse synchronization and decompression, the message representing compressed residuals of each node should include the information of indices and values of elements in communication-set. The information of indices and values of elements in communication-set read on the claimed mapping); generating compressed data based on the non-clipped data elements from the set of gradients (Page 33, section 2.3 sparse synchronization and decompression, the message representing compressed residuals of each node should include the information of indices and values of elements in communication-set. When using threshold binary search selection, the length of each node’s message is different. As a result, the packaged message should also include an initial element, which indicates the length of the compressed elements. Instead of using two allgather operations for indices and values message separately, we package the indices and values into a single message to reduce latency. The indicies and values read on the claimed compressed data); and transmitting the mapping and the compressed data from the first worker node to a second worker node of the distributed system (Page 33, section 2.3 sparse synchronization and decompression, the message representing compressed residuals of each node. The message reads on the claimed transmitting).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the synchronization techniques of Fang to reduce communication bandwidth while introducing limited overhead (see Fang at abstract and introduction).

As to dependent claim 14, Fang further teaches generating, at the second worker node, decompressed data by combining the non- clipped data elements from the compressed data with the clipped data elements using the mapping to obtain the set of gradients comprising the non-clipped data elements and the clipped data elements (Page 37, a decompression operation on collected data is required to add the sparse residual array to dense weight array. The sparse residual array and dense weight array read on the claimed clipped and non-clipped data elements. See also Page 33, section 2.3 sparse synchronization and decompression).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the synchronization techniques of Fang to reduce communication bandwidth while introducing limited overhead (see Fang at abstract and introduction).

As to dependent claim 19, Fang further teaches performing a sparsity analysis on the set of gradients to determine the threshold (Page 31, section 2.1 Parallel-friendly residual sparsification, a relative large threshold value is chosen according to mean and maximum value, for example, 0.8×(max−mean)+mean. Operation count_nonzero gets the number of elements whose absolute values are greater than the threshold. If the number is smaller than k (the number of top-0.1% elements), we dynamically decrease the threshold until we find the number of parameters whose absolute value above the threshold is larger than k).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the synchronization techniques of Fang to reduce communication bandwidth while introducing limited overhead (see Fang at abstract and introduction).

Claims 2, 7, 8, 15, and 16 are rejected under 35 U.S.C. § 103 as being unpatentable over Chetlur in view of Fang and Alistarh et al. (U.S. Pat. App. Pub. No. 2018/0075347, hereinafter Alistarh).

As to dependent claim 2, the rejection of claim 1 is incorporated.
Chetlur as modified by Fang does not appear to expressly teach forming, at the transmitting worker node, a header comprising the mapping and an original length of the set of gradients, the original length corresponding to a number of the non-clipped data elements and the clipped data elements.
Alistarh teaches forming, at the transmitting worker node, a header comprising the mapping and an original length of the set of gradients, the original length corresponding to a number of the non-clipped data elements and the clipped data elements (Paragraph 52, the decoder reads off a fixed number of bits at a header of the encoded stochastic gradient vector to obtain the magnitude of the original stochastic gradient vector).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the distributed neural network training techniques of Alistarh to compress data for transmission between the computation nodes during training (see Alistarh at paragraph 16).

As to dependent claim 7, the rejection of claim 5 is incorporated.
Chetlur as modified by Fang does not appear to expressly teach forming a header comprising the mapping, wherein the header and the compressed data are transmitted from the first worker node to the second worker node.
Alistarh teaches forming a header comprising the mapping, wherein the header and the compressed data are transmitted from the first worker node to the second worker node (Paragraph 52, the decoder reads off a fixed number of bits at a header of the encoded stochastic gradient vector to obtain the magnitude of the original stochastic gradient vector).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the distributed neural network training techniques of Alistarh to compress data for transmission between the computation nodes during training (see Alistarh at paragraph 16).

As to dependent claim 8, Alistarh further teaches the header further comprises an original length of the set of gradients, the original length corresponding to a number of the non-clipped data elements and the clipped data elements (Paragraph 52, the decoder reads off a fixed number of bits at a header of the encoded stochastic gradient vector to obtain the magnitude of the original stochastic gradient vector).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the distributed neural network training techniques of Alistarh to compress data for transmission between the computation nodes during training (see Alistarh at paragraph 16).

As to dependent claim 15, the rejection of claim 13 is incorporated.
Chetlur as modified by Fang does not appear to expressly teach the operations further comprise:
forming a header comprising the mapping, wherein the header and the compressed data are transmitted from the first worker node to the second worker node.
Alistarh teaches the operations further comprise:
forming a header comprising the mapping, wherein the header and the compressed data are transmitted from the first worker node to the second worker node (Paragraph 52, the decoder reads off a fixed number of bits at a header of the encoded stochastic gradient vector to obtain the magnitude of the original stochastic gradient vector).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the distributed neural network training techniques of Alistarh to compress data for transmission between the computation nodes during training (see Alistarh at paragraph 16).

As to dependent claim 16, Alistarh further teaches the header further comprises an original length of the set of gradients, the original length corresponding to a number of the non-clipped data elements and the clipped data elements (Paragraph 52, the decoder reads off a fixed number of bits at a header of the encoded stochastic gradient vector to obtain the magnitude of the original stochastic gradient vector).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the distributed neural network training techniques of Alistarh to compress data for transmission between the computation nodes during training (see Alistarh at paragraph 16).

Claims 3, 4, 9, 10, 17, and 18 are rejected under 35 U.S.C. § 103 as being unpatentable over Chetlur in view of Fang and Wang et al. (Wang, Linnan, Wei Wu, Junyu Zhang, Hang Liu, George Bosilca, Maurice Herlihy, and Rodrigo Fonseca. "SuperNeurons: FFT-based Gradient Sparsification in the Distributed Training of Deep Neural Networks." arXiv preprint arXiv:1811.08596 (2018), hereinafter Wang).

As to dependent claim 3, the rejection of claim 1 is incorporated.
While Fang discusses using an extra bitmap to record the sign of each value element (see Page 33), Fang does not appear to expressly teach the mapping includes a bitmap with binary values indicating locations of the non-clipped data elements and the clipped data elements.
Wang teaches the mapping includes a bitmap with binary values indicating locations of the non-clipped data elements and the clipped data elements (Page 5, the status vector is a bitmap that tracks the location of non-zero elements, and its length in bits is the same as the gradient vector).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the gradient sparsification for distributed training techniques of Wang to strike a balance between compression ratio, accuracy, and computational overhead (see Wang at abstract and introduction).

As to dependent claim 4, the rejection of claim 1 is incorporated.
While Fang teaches sparsification with non-zero data (see Page 32 algorithms using nonzero counts and indices), Fang does not appear to expressly teach clipping each of the set of gradients includes setting the value equal to zero such that the clipped data elements are zero data elements and the non-clipped data elements are non-zero data elements.
Wang teaches clipping each of the set of gradients includes setting the value equal to zero such that the clipped data elements are zero data elements and the non-clipped data elements are non-zero data elements (Page 3, where small gradient are treated as zero and not transmitted. Page 5, the status vector is a bitmap that tracks the location of non-zero elements, and its length in bits is the same as the gradient vector).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the gradient sparsification for distributed training techniques of Wang to strike a balance between compression ratio, accuracy, and computational overhead (see Wang at abstract and introduction).

As to dependent claim 9, the rejection of claim 5 is incorporated.
While Fang discusses using an extra bitmap to record the sign of each value element (see Page 33), Fang does not appear to expressly teach the mapping includes a bitmap with binary values indicating locations of the non-clipped data elements and the clipped data elements.
Wang teaches the mapping includes a bitmap with binary values indicating locations of the non-clipped data elements and the clipped data elements (Page 5, the status vector is a bitmap that tracks the location of non-zero elements, and its length in bits is the same as the gradient vector).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the gradient sparsification for distributed training techniques of Wang to strike a balance between compression ratio, accuracy, and computational overhead (see Wang at abstract and introduction).

As to dependent claim 10, the rejection of claim 5 is incorporated.
While Fang teaches sparsification with non-zero data (see Page 32 algorithms using nonzero counts and indices), Fang does not appear to expressly teach clipping each of the set of gradients includes setting the value equal to zero such that the clipped data elements are zero data elements and the non-clipped data elements are non-zero data elements.
Wang teaches clipping each of the set of gradients includes setting the value equal to zero such that the clipped data elements are zero data elements and the non-clipped data elements are non-zero data elements (Page 3, where small gradient are treated as zero and not transmitted. Page 5, the status vector is a bitmap that tracks the location of non-zero elements, and its length in bits is the same as the gradient vector).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the gradient sparsification for distributed training techniques of Wang to strike a balance between compression ratio, accuracy, and computational overhead (see Wang at abstract and introduction).

As to dependent claim 17, the rejection of claim 13 is incorporated.
While Fang discusses using an extra bitmap to record the sign of each value element (see Page 33), Fang does not appear to expressly teach the mapping includes a bitmap with binary values indicating locations of the non-clipped data elements and the clipped data elements.
Wang teaches the mapping includes a bitmap with binary values indicating locations of the non-clipped data elements and the clipped data elements (Page 5, the status vector is a bitmap that tracks the location of non-zero elements, and its length in bits is the same as the gradient vector).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the gradient sparsification for distributed training techniques of Wang to strike a balance between compression ratio, accuracy, and computational overhead (see Wang at abstract and introduction).

As to dependent claim 18, the rejection of claim 13 is incorporated.
While Fang teaches sparsification with non-zero data (see Page 32 algorithms using nonzero counts and indices), Fang does not appear to expressly teach clipping each of the set of gradients includes setting the value equal to zero such that the clipped data elements are zero data elements and the non-clipped data elements are non-zero data elements.
Wang teaches clipping each of the set of gradients includes setting the value equal to zero such that the clipped data elements are zero data elements and the non-clipped data elements are non-zero data elements (Page 3, where small gradient are treated as zero and not transmitted. Page 5, the status vector is a bitmap that tracks the location of non-zero elements, and its length in bits is the same as the gradient vector).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the gradient sparsification for distributed training techniques of Wang to strike a balance between compression ratio, accuracy, and computational overhead (see Wang at abstract and introduction).

Claims 12 and 20 are rejected under 35 U.S.C. § 103 as being unpatentable over Chetlur in view of Fang and Xu et al. (U.S. Pat. App. Pub. No. 2019/0362235, hereinafter Xu).

As to dependent claim 12, the rejection of claim 11 is incorporated.
Fang further teaches performing the sparsity analysis includes: calculating an average for the set of gradients (Page 31, section 2.1 Parallel-friendly residual sparsification, a relative chosen according to mean).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the synchronization techniques of Fang to reduce communication bandwidth while introducing limited overhead (see Fang at abstract and introduction).
Chetlur as modified by Fang does not appear to expressly teach calculating a standard deviation for the set of gradients; and determining the threshold based on the average and the standard deviation.
Xu teaches calculating a standard deviation for the set of gradients (Paragraph 56, computed from the mean and standard deviation of the weights); and determining the threshold based on the average and the standard deviation (Paragraph 56, threshold that is computed from the mean and standard deviation).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the sparsity analysis techniques of Xu to increase performance over other sparsified neural networks (see Xu at paragraph 56). 

As to dependent claim 20, the rejection of claim 19 is incorporated.
Fang further teaches performing the sparsity analysis includes: calculating an average for the set of gradients (Page 31, section 2.1 Parallel-friendly residual sparsification, a relative chosen according to mean).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the synchronization techniques of Fang to reduce communication bandwidth while introducing limited overhead (see Fang at abstract and introduction).
Chetlur as modified by Fang does not appear to expressly teach calculating a standard deviation for the set of gradients; and determining the threshold based on the average and the standard deviation.
Xu teaches calculating a standard deviation for the set of gradients (Paragraph 56, computed from the mean and standard deviation of the weights); and determining the threshold based on the average and the standard deviation (Paragraph 56, threshold that is computed from the mean and standard deviation).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, having the distributed neural network training of Chetlur to include the sparsity analysis techniques of Xu to increase performance over other sparsified neural networks (see Xu at paragraph 56). 

Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicant's disclosure. Applicant is required under 37 C.F.R. § 1.111(c) to consider these references fully when responding to this action.
It is noted that any citation to specific pages, columns, lines, or figures in the prior art references and any interpretation of the references should not be considered to be limiting in any way. A reference is relevant for all it contains and may be relied upon for all that it would have reasonably suggested to one having ordinary skill in the art. In re Heck, 699 F.2d 1331, 1332-33, 216 U.S.P.Q. 1038, 1039 (Fed. Cir. 1983) (quoting In re Lemelson, 397 F.2d 1006, 1009, 158 U.S.P.Q. 275, 277 (C.C.P.A. 1968)).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Casey R. Garner whose telephone number is 571-272-2467. The examiner can normally be reached on Monday to Friday, 8am to 5pm, Eastern Time.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached on 571-270-3428. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Casey R. Garner/Examiner, Art Unit 2123