DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The present application was filed on 09/22/2019.
This office action is responsive to the amendments and/or remarks filed on 08/31/2022. In the current amendments, claims 1-4, 10-13 and 15 have been amended and claims 16-20 have been added. Claims 1-20 are currently pending and have been examined. 
In response to arguments and/or remarks filed on 08/31/2022, the 35 U.S.C 101 software per se rejections made in the previous Office Action have been withdrawn. 
In response to arguments and/or remarks filed on 08/31/2022, the 35 U.S.C 102 rejections made in the previous Office Action have been withdrawn. 


Claim Rejections - 35 USC § 101 – Abstract Idea
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 for containing an abstract idea without significantly more. 

Regarding Claim 1:
Step 1 – Is the claim to a process, machine, manufacture or composition of matter?
Yes, the claim is a process.
Step 2A – Prong 1 – Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes, the claim recites an abstract idea.
“determining differences between the previous values and current values of the set of parameters;” – This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgment, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) III. C.).
“calculating a zero-order term and a first-order term of a series expansion based on the feedback data and the differences; and updating the current values based on the feedback data and the differences zero-order term and the first-order term to obtain updated values of the set of the parameters.” – This limitation under its broadest reasonable interpretation recites math and a human could performed math using pen and paper. 
Step 2A – Prong 2 – Does the claim recite additional elements that integrate the judicial exception into a practical application?
No, there are no additional elements that integrate the judicial exception into a practical application. The additional elements:
“A computer-implemented method, comprising:” – This limitation is directed to merely using a generic computer as a tool (see MPEP 2106.04(d)).
“receiving, from a worker, feedback data generated by training a machine learning model, the feedback data being associated with previous values of a set of parameters of the machine learning model at the worker;” – This limitation is directed to insignificant extra-solution activity (see MPEP 2106.05(g)). 
“processor,” - 
“and updating the current values based on the feedback data and the differences to obtain updated values of the set of the parameters.” – This limitation is directed to mere data gathering which has been recognized by the courts (as per Ultramercial, 772 F.3d at 715, 112 USPQ2d at 1754) as insignificant extra-solution activity (see MPEP 2106.05(g)). 
Step 2B - Does the claim recite additional elements that amount to significantly more than the judicial exception?
	No, there are no additional elements that amount to significantly more than the judicial exception.
“receiving, from a worker, feedback data generated by training a machine learning model, the feedback data being associated with previous values of a set of parameters of the machine learning model at the worker;” – This limitation is directed to receiving or transmitting data over a network. The courts (as per Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362) have recognized receiving or transmitting data over a network as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity (see MPEP 2106.05(d) II.).

Regarding Claim 2:
Claim 2 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which included an abstract idea (see rejection for claim 1). This claim merely recites a further limitation on the receiving limitation which was directed to well-understood, routine, conventional activity. The additional limitation “wherein the feedback data indicate significant trends of change of an optimization objective of the machine learning model with respect to the previous values of the set of parameters” is directed to field of use (see MPEP 2106.05(h)) as it is merely limiting the field of the feedback data. Thus, the judicial exception is not integrated into a practical application (see MPEP 2106.04(d) I.), failing step 2A Prong 2. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception under step 2B. 

Regarding Claim 3:
Claim 3 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which included an abstract idea (see rejection for claim 1). This claim merely recites a further limitation on the updating limitation from claim 1 which was directed to insignificant extra-solution activity. The claim cites additional abstract ideas:
“determining coefficients of a transformation based on the significant trends of change;” – This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgment, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) III. C.).
“and determining differential amounts between the current values and the updated values by applying the transformation on the differences.” – This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgment, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) III. C.).
Thus, the judicial exception is not integrated into a practical application (see MPEP 2106.04(d) I.), failing step 2A Prong 2. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception under step 2B. 

Regarding Claim 4:
Claim 4 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which included an abstract idea (see rejection for claim 1). This claim merely recites a further limitation on the determining coefficients of a transformation limitation from claim 3 which was directed to the abstract idea of a mental process. The additional limitation:
“wherein the transformation is a linear transformation, the coefficients are linear rates of change, and the significant trends of change are represented by a gradient of the optimization objective with respect to the previous values of the set of parameters.” – This limitation is directed to the field of use (see MPEP 2106.05(h)) as it merely limiting the fields of the transformation, coefficients, and trends of change. 
Thus, the judicial exception is not integrated into a practical application (see MPEP 2106.04(d) I.), failing step 2A Prong 2. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception under step 2B. 

Regarding Claim 5:
Claim 5 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which included an abstract idea (see rejection for claim 1). This claim merely recites a further limitation on the determining coefficients of a transformation limitation from claim 3 which was directed to the abstract idea of a mental process. The additional limitation:
“computing a tensor product of the gradient as unbiased estimates of the linear rates of change.” – This limitation is directed to the abstract idea of mathematical concepts (see MPEP 2106.04(a)(2)) as it involves calculations of tensor products. 
Thus, the judicial exception is not integrated into a practical application (see MPEP 2106.04(d) I.), failing step 2A Prong 2. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception under step 2B. 

Regarding Claim 6:
Claim 6 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which included an abstract idea (see rejection for claim 1). This claim merely recites a further limitation on the determining coefficients of a transformation limitation from claim 3 which was directed to the abstract idea of a mental process. The additional limitations:
“determining, based on the gradient, magnitudes of rates of change of the optimization objective with respect to respective parameters in the set of parameters;” – This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgment, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) III. C.).
“and determining the linear rates of change based on the magnitudes of the rates of change.” – This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgment, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) III. C.).
Thus, the judicial exception is not integrated into a practical application (see MPEP 2106.04(d) I.), failing step 2A Prong 2. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception under step 2B. 

Regarding Claim 7:
Claim 7 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which included an abstract idea (see rejection for claim 1). This claim merely recites a further limitation on the determining linear rates of change limitation from claim 6 which was directed to the abstract idea of a mental process. The additional limitations:
“computing squares of the magnitudes of the rates of change;” – This limitation is directed to the abstract idea of mathematical concepts (see MPEP 2106.04(a)(2)) as it involves computing squares. 
“and determining the linear rates of change based on the squares of the magnitudes of the rates of change.” – This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgment, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) III. C.).
Thus, the judicial exception is not integrated into a practical application (see MPEP 2106.04(d) I.), failing step 2A Prong 2. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception under step 2B. 

Regarding Claim 8:
Claim 8 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which included an abstract idea (see rejection for claim 1). The additional limitations:
“receiving a request for the set of parameters from the worker;” – This limitation is directed to receiving or transmitting data over a network. The courts (as per Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362) have recognized receiving or transmitting data over a network as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity (see MPEP 2106.05(d) II.).
“and in response to the request, transmitting the updated values of the set of parameters to the worker.” – This limitation is directed to receiving or transmitting data over a network. The courts (as per Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362) have recognized receiving or transmitting data over a network as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity (see MPEP 2106.05(d) II.).
Thus, the judicial exception is not integrated into a practical application (see MPEP 2106.04(d) I.), failing step 2A Prong 2. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception under step 2B. 

Regarding Claim 9:
Claim 9 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 1 which included an abstract idea (see rejection for claim 1). This claim merely recites a further limitation on the machine learning model of the receiving limitation from claim 1 which was directed to well-understood, routine, conventional activity. The additional limitation:
“wherein the machine learning model includes a neural network model and the optimization objective is represented by a cross entropy loss function.” – This limitation is directed to the field of use (see MPEP 2106.05(h)) as it merely limiting the fields of the machine learning model and optimization objective. 
Thus, the judicial exception is not integrated into a practical application (see MPEP 2106.04(d) I.), failing step 2A Prong 2. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception under step 2B. 

Regarding Claim 10:
Step 1 – Is the claim to a process, machine, manufacture or composition of matter?
Yes, the claim is a product.
Step 2A – Prong 1 – Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes, the claim recites an abstract idea.
“determining differences between the previous values and current values of the set of parameters;” – This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgment, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) III. C.).
Step 2A – Prong 2 – Does the claim recite additional elements that integrate the judicial exception into a practical application?
No, there are no additional elements that integrate the judicial exception into a practical application. The additional elements:
“An electronic device, comprising: a processing unit; a memory coupled to the processing unit and storing instructions for execution by the processing unit, the instructions, when executed by the processing unit, causing the electronic device to perform acts comprising:” – This limitation is directed to merely using a generic computer as a tool (see MPEP 2106.04(d)).
“receiving, from a worker, feedback data generated by training a machine learning model, the feedback data being associated with previous values of a set of parameters of the machine learning model at the worker;” – This limitation is directed to insignificant extra-solution activity (see MPEP 2106.05(g)). 
“and updating the current values based on the feedback data and the differences to obtain updated values of the set of the parameters.” – This limitation is directed to mere data gathering which has been recognized by the courts (as per Ultramercial, 772 F.3d at 715, 112 USPQ2d at 1754) as insignificant extra-solution activity (see MPEP 2106.05(g)). 
Step 2B - Does the claim recite additional elements that amount to significantly more than the judicial exception?
	No, there are no additional elements that amount to significantly more than the judicial exception.
“receiving, from a worker, feedback data generated by training a machine learning model, the feedback data being associated with previous values of a set of parameters of the machine learning model at the worker;” – This limitation is directed to receiving or transmitting data over a network. The courts (as per Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362) have recognized receiving or transmitting data over a network as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity (see MPEP 2106.05(d) II.).

Regarding Claim 11:
Claim 11 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 10 which included an abstract idea (see rejection for claim 10). This claim merely recites a further limitation on the receiving limitation which was directed to well-understood, routine, conventional activity. The additional limitation “wherein the feedback data indicate significant trends of change of an optimization objective of the machine learning model with respect to the previous values of the set of parameters” is directed to field of use (see MPEP 2106.05(h)) as it is merely limiting the field of the feedback data. Thus, the judicial exception is not integrated into a practical application (see MPEP 2106.04(d) I.), failing step 2A Prong 2. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception under step 2B. 

Regarding Claim 12:
Claim 12 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 10 which included an abstract idea (see rejection for claim 10). This claim merely recites a further limitation on the updating limitation from claim 10 which was directed to insignificant extra-solution activity. The claim cites additional abstract ideas:
“determining coefficients of a transformation based on the significant trends of change;” – This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgment, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) III. C.).
“and determining differential amounts between the current values and the updated values by applying the transformation on the differences.” – This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgment, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) III. C.).
Thus, the judicial exception is not integrated into a practical application (see MPEP 2106.04(d) I.), failing step 2A Prong 2. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception under step 2B. 

Regarding Claim13:
Claim 13 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 10 which included an abstract idea (see rejection for claim 10). This claim merely recites a further limitation on the determining coefficients of a transformation limitation from claim 12 which was directed to the abstract idea of a mental process. The additional limitation:
“wherein the transformation is a linear transformation, the coefficients are linear rates of change, and the significant trends of change are represented by a gradient of the optimization objective with respect to the previous values of the set of parameters.” – This limitation is directed to the field of use (see MPEP 2106.05(h)) as it merely limiting the fields of the transformation, coefficients, and trends of change. 
Thus, the judicial exception is not integrated into a practical application (see MPEP 2106.04(d) I.), failing step 2A Prong 2. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception under step 2B. 

Regarding Claim 14:
Claim 14 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claim is dependent on claim 10 which included an abstract idea (see rejection for claim 10). This claim merely recites a further limitation on the determining coefficients of a transformation limitation from claim 12 which was directed to the abstract idea of a mental process. The additional limitation:
“computing a tensor product of the gradient as unbiased estimates of the linear rates of change.” – This limitation is directed to the abstract idea of mathematical concepts (see MPEP 2106.04(a)(2)) as it involves calculations of tensor products. 
Thus, the judicial exception is not integrated into a practical application (see MPEP 2106.04(d) I.), failing step 2A Prong 2. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception under step 2B. 

Regarding Claim 15:
Step 1 – Is the claim to a process, machine, manufacture or composition of matter?
Yes, the claim is a product.
Step 2A – Prong 1 – Does the claim recite an abstract idea, law of nature, or natural phenomenon?
Yes, the claim recites an abstract idea.
“determine differences between the previous values and current values of the set of parameters;” – This limitation is directed to the abstract idea of a mental process (including an observation, evaluation, judgment, opinion) which can be performed in the human mind, or by a human using pen and paper (see MPEP 2106.04(a)(2) III. C.).
Step 2A – Prong 2 – Does the claim recite additional elements that integrate the judicial exception into a practical application?
No, there are no additional elements that integrate the judicial exception into a practical application. The additional elements:
“A computer program product stored in a computer storage medium and comprising machine executable instructions which, when executed in a device, cause the device to:” – This limitation is directed to merely using a generic computer as a tool (see MPEP 2106.04(d)).
“receive, from the worker, feedback data generated by training a machine learning model, the feedback data being associated with previous values of the set of parameters of the machine learning model at the worker;” – This limitation is directed to insignificant extra-solution activity (see MPEP 2106.05(g)). 
“and update the current values based on the feedback data and the differences to obtain updated values of the set of the parameters.” – This limitation is directed to mere data gathering which has been recognized by the courts (as per Ultramercial, 772 F.3d at 715, 112 USPQ2d at 1754) as insignificant extra-solution activity (see MPEP 2106.05(g)). 
Step 2B - Does the claim recite additional elements that amount to significantly more than the judicial exception?
	No, there are no additional elements that amount to significantly more than the judicial exception.
“receiving, from a worker, feedback data generated by training a machine learning model, the feedback data being associated with previous values of a set of parameters of the machine learning model at the worker;” – This limitation is directed to receiving or transmitting data over a network. The courts (as per Symantec, 838 F.3d at 1321, 120 USPQ2d at 1362) have recognized receiving or transmitting data over a network as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity (see MPEP 2106.05(d) II.).




Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-5, 9-11 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over De et al. (“Scaling Up Distributed Stochastic Gradient Descent Using Variance Reduction”) (herein thereafter De) in view of Pedrycz et al. (“Cluster-Centric Fuzzy Modeling”) and further in view of Strom (US Pat No. 10152676 B1). 
Regarding Claim 1 (Currently Amended)
	De teaches a computer-implemented method, (De discloses methods for stochastic gradient descent implemented on distributed systems in sec. 4 ¶1-2, “We now consider the distributed setting, with a single central server and p local client servers, each of which contains a portion of the data set. […] Our goal is to derive stochastic algorithms in this distributed setting that scale linearly to high p, while remaining stable even under low communication frequencies between local and central nodes.”)
comprising: receiving, from a worker …feedback data generated by training a machine learning model, the feedback data being associated with previous values of a set of parameters of the machine learning model at the worker…; (De discloses receiving data from workers in Algorithm 3. The data received, Δx, is associated with previous values of a set of parameters.)
determining differences between the previous values and current values of the set of parameters; (De discloses calculating differences between previous and current parameters in Algorithm 3 line 13) 
De does not teach …processor …calculating a zero-order term and a first-order term of a series expansion based on the feedback data and the differences; 
and updating the current values based on the zero-order term and the first-order term to obtain updated values of the set of the parameters.
Pedrycz teaches calculating a zero-order term (1587 “We briefly review a standard least-squares error (LSE) technique along with fuzzy clustering to construct a zero-order T–S fuzzy model. Next, we propose a cluster-based fuzzy model, and finally expand the constructed zero-order models to the first-order models.”) and a first-order term of a series expansion based on the feedback data and the differences; (pg. 1590 right col “As shown in Table II, expanding the zero-order model to the first-order reduces the values of the training and testing errors. When comparing the first and the second strategy that are used to design first-order models, in most cases, the differences between the results are negligible. In comparing the complexities of the two strategies (refer to Fig. 1), one may realize that in the first method, the parameters of the first-order model are calculated only once, but for the second strategy, these parameters should be estimated for all values of λ.”)
and updating the current values based on the zero-order term and the first-order term to obtain updated values of the set of the parameters. (Pg. 1589 left col “In the first approach, Fig. 1(a), we construct the zero-order model and at this stage, the value of the parameter λ is optimized (see the feedback loop shown in the figure). In the sequel, the first-order fuzzy model is formed. In contrast, as illustrated in Fig. 1(b), the zero-order model is formed, afterward refined by building the first-order model, and at this stage, the value of λ is optimized; note a feedback loop shown in Fig. 1(b).”)
De and Pedrycz are analogous art because they are both directed to Machine Learning. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have combine the distributed stochastic gradient descent method disclosed by De to include updating the current values based on the zero-order term and the first-order term to obtain updated values of the set of the parameters of Pedrycz in order to reduce training parameters and testing errors as disclosed by Pedrycz (pg. 1590 right col “As shown in Table II, expanding the zero-order model to the first-order reduces the values of the training and testing errors. When comparing the first and the second strategy that are used to design first-order models, in most cases, the differences between the results are negligible. In comparing the complexities of the two strategies (refer to Fig. 1), one may realize that in the first method, the parameters of the first-order model are calculated only once, but for the second strategy, these parameters should be estimated for all values of λ.”). 
De in view of Pedrycz does not teach processor.
Strom teaches processor (col 12 lines 61-65 “A processor device can also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors”).
De, Pedrycz and Strom, and the instant application are analogous art because they are all directed to training of machine learning models in distributed environments.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the distributed stochastic gradient descent method disclosed by De in view of Pedrycz to include the machine learning model includes a neural network model implemented on processor as taught by Strom. 
One would be motivated to do so to efficiently train machine learning models and reduce bandwidth, as suggested by Strom (Strom col 3 lines 3-42: “Aspects of this disclosure relate to efficiently distributing the training of models across multiple computing nodes (e.g., two or more separate computing devices). […] In order to reduce the bandwidth required to continuously or periodically exchange such update data among the multiple computing devices, only those updates which are expected to provide a substantive change to the model may be applied and exchanged. […] This can improve the efficiency of the distributed training process by substantially reducing the volume of data that is transmitted and the number of times a given parameter is updated.”).



Regarding Claim 2 (Currently Amended)
De in view of Pedrycz with Strom teaches “The method of claim 1” as seen above.

    PNG
    media_image1.png
    29
    254
    media_image1.png
    Greyscale
De further teaches wherein the feedback data indicate trends of change of an optimization objective of the machine learning model with respect to the previous values of the set of parameters. (Examiner notes that significant is not defined and is interpreted as any trend of change. Examiner further notes that trends of change is equivalent to gradient as gradient is rate of change. De discloses that the feedback data received from workers is used to show the trend of change (i.e. gradient) of the objective function, f(x), in sec. 4 ¶5, “Thus, when the central server receives parameters from a local node s, the updates it performs have the form where                         
                            
                                
                                    
                                        
                                            x
                                        
                                        ^
                                    
                                
                                
                                    s
                                
                            
                        
                     and                         
                            
                                
                                    
                                        
                                            g
                                        
                                        ^
                                    
                                
                                
                                    s
                                
                            
                        
                     are given) 
 
Regarding claim 11
Claim 11 recites analogous limitations to claim 2 and therefore is rejected on the same ground as claim 2. 

Regarding Claim 3 (Currently Amended)
De in view of Pedrycz with Strom teaches “The method of claim 2” as seen above.  
De further teaches wherein updating the current values comprises: (De discloses updating in Algorithm 3, as shown on line 21). 
determining coefficients of a transformation based on the significant trends of change; (Examiner notes that “transformation” is not well defined and is interpreted to be anything that results in a change of values. De discloses a transformation in which the coefficients Δx and Δg are based on the significant trends of change, as shown on line 20 on Algorithm 3. 
and determining differential amounts between the current values and the updated values by applying the transformation on the differences. (As per ¶38 lines 7-8 of the instant specifications, “differential amounts” is interpreted to be the update amounts of model parameters. De discloses determining differential amounts in sec. 4 ¶6, “Sending the change in the local parameter values, rather than the local parameters themselves, ensures that when updating the central parameter, the previous contribution to the average from that local worker is just replaced by the new value.” De further discloses that the transformation (shown in Algorithm 3 line 20 below) is applied to the differences (shown in Algorithm 3 line 13). 
Regarding claim 12
Claim 12 recites analogous limitations to claim 3 and therefore is rejected on the same ground as claim 3. 

Regarding Claim 4 (Currently Amended)
De in view of Pedrycz with Strom teaches “The method of claim 3” as seen above.  
	De further teaches wherein the transformation is a linear transformation, the coefficients are linear rates of change, (De discloses the transformations in Algorithm 3, on line 20. It can be seen that the transformations are linear.  

    PNG
    media_image1.png
    29
    254
    media_image1.png
    Greyscale
and the trends of change are represented by a gradient of the optimization objective with respect to the previous values of the set of parameters. (De discloses that the significant trends of change are gradients in in sec. 4 ¶5, “Thus, when the central server receives parameters from a local node s, the updates it performs have the form                          where                         
                            
                                
                                    
                                        
                                            x
                                        
                                        ^
                                    
                                
                                
                                    s
                                
                            
                        
                     and                         
                            
                                
                                    
                                        
                                            g
                                        
                                        ^
                                    
                                
                                
                                    s
                                
                            
                        
                     are now given by 

    PNG
    media_image2.png
    94
    394
    media_image2.png
    Greyscale
)
Regarding claim 13
Claim 13 recites analogous limitations to claim 4 and therefore is rejected on the same ground as claim 4. 

Regarding Claim 5:
De in view of Pedrycz with Strom teaches “The method of claim 4” as seen above.  
De further teaches wherein determining the coefficients of the transformation comprises: (Examiner notes that “transformation” is not well defined and is interpreted to be anything that results in a change of values. De discloses a transformation in which the coefficients Δx and Δg are determined based on the significant trends of change, shown on Algorithm 3 line 20.)
computing a tensor product of the gradient as unbiased estimates of the linear rates of change. (De discloses unbiased estimates in sec. 2 ¶4, “Second, in one epoch, we traverse over the dataset using a random permutation over the indices (i.e., indices are chosen without replacement), instead of a random access (with replacement, as in SVRG or SAGA). This ensures that the average gradient we accumulate over one epoch is unbiased, and thus is a good estimate of the true gradient.” De further discloses computing a tensor product (i.e. multiplication of a scalar and vector which are types of tensors) for the unbiased estimate in the highlighted section of Algorithm 3, shown below. 

    PNG
    media_image3.png
    523
    422
    media_image3.png
    Greyscale


Regarding claim 14
Claim 14 recites analogous limitations to claim 5 and therefore is rejected on the same ground as claim 5. 

Regarding Claim 9
De in view of Pedrycz with Strom teaches “The method of claim 1” as seen above.  
Strom further teaches wherein the machine learning model includes a neural network model (Strom teaches in col 4 lines 29-32, “aspects of the embodiments described in the disclosure will focus, for the purpose of illustration, on distributed execution of stochastic gradient descent to train neural network-based models”)
and the optimization objective is represented by a cross entropy loss function. (Strom discloses in col 9 lines 49-53, “As shown in FIG. 3, the gradient computation module may use an objective function 308 to determine the error 310 for the output vector 306 in comparison with the known correct output for the particular input vector 302. For example, L2-norm or cross entropy may be used.”) 
De, Pedrycz and Strom, and the instant application are analogous art because they are all directed to training of machine learning models in distributed environments.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the distributed stochastic gradient descent method disclosed by De in view of Pedrycz to include the “wherein the machine learning model includes a neural network model and the optimization objective is represented by a cross entropy loss function” taught by Strom. One would be motivated to do so to efficiently train machine learning models and reduce bandwidth, as suggested by Strom (Strom col 3 lines 3-42: “Aspects of this disclosure relate to efficiently distributing the training of models across multiple computing nodes (e.g., two or more separate computing devices). […] In order to reduce the bandwidth required to continuously or periodically exchange such update data among the multiple computing devices, only those updates which are expected to provide a substantive change to the model may be applied and exchanged. […] This can improve the efficiency of the distributed training process by substantially reducing the volume of data that is transmitted and the number of times a given parameter is updated.”).
Regarding claim 19
Claim 19 recites analogous limitations to claim 9 and therefore is rejected on the same ground as claim 9. 

Regarding Claim 10:
Claim 10 is a product claim, corresponding to computer-implemented method claim 1. The only difference is that claim 10 recites an electronic device with a processor and memory. 
Strom teaches:
An electronic device, comprising: a processing unit; (col 12 lines 61-65 “A processor device can also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors”) 
a memory coupled to the processing unit and storing instructions for execution by the processing unit, the instructions, when executed by the processing unit, causing the electronic device to perform acts comprising: (col 13 lines 10-15 “The elements of a method, process, routine, or algorithm described in connection with the embodiments disclosed herein can be embodied directly in hardware, in a software module executed by a processor device, or in a combination of the two. A software module can reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of a non-transitory computer-readable storage medium. An exemplary storage medium can be coupled to the processor device such that the processor device can read information from, and write information to, the storage medium. In the alternative, the storage medium can be integral to the processor device. The processor device and the storage medium can reside in an ASIC. The ASIC can reside in a user terminal.”) 
The rest of the limitations of claim 10 are rejected for the same reasons as claim 1.
Regarding Claim 15
Claim 15 recites analogous limitations to independent claims 1 and 10 and therefore is rejected on the same ground as independent claim 1.

Regarding Claim 17 (New)
De in view of Pedrycz with Strom teaches “The method of claim 15” as seen above. 
Pedrycz further teaches wherein series expansion corresponds to Taylor expansion and the other order terms of the series expansion are not used to update the current values. (Pg. 1588 right col “Our objective is to refine the zero-order fuzzy model by raising its order. In the realization of the construct, we follow a generic idea of a Taylor expansion of a function around a given point specified in the input space.” also see pg. 1596 “the zero-order fuzzy models are more readable and interpretable, higher order models are more interesting for approximation purposes. Therefore, the developed zero-order fuzzy model was expanded to order one using a Taylor expansion technique leading to the increase of its efficiency. Two key design strategies are investigated”)
De, Pedrycz and Strom, and the instant application are analogous art because they are all directed to training of machine learning models in distributed environments.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the distributed stochastic gradient descent method disclosed by De in view of Strom to include Taylor series expansion as taught by Pedrycz. One would be motivated to make accurate estimate of what a function would look like as suggested by Pedrycz (pg. 1596 “Although the zero-order fuzzy models are more readable and interpretable, higher order models are more interesting for approximation purposes. Therefore, the developed zero-order fuzzy model was expanded to order one using a Taylor expansion technique leading to the increase of its efficiency.”).

Regarding Claim 18 (New)
De in view of Pedrycz with Strom teaches “The method of claim 15” as seen above. 
De further teaches wherein the first-order term reflects a rate of change of a gradient of an optimization objective. (Examiner notes that significant is not defined and is interpreted as any trend of change. Examiner further notes that trends of change is equivalent to gradient as gradient is rate of change. De discloses that the feedback data received from workers is used to show the trend of change (i.e. gradient) of the objective function, f(x), in sec. 4 ¶5, “Thus, when the central server receives parameters from a local node s, the updates it performs have the form where                         
                            
                                
                                    
                                        
                                            x
                                        
                                        ^
                                    
                                
                                
                                    s
                                
                            
                        
                     and                         
                            
                                
                                    
                                        
                                            g
                                        
                                        ^
                                    
                                
                                
                                    s
                                
                            
                        
                     are given)

Claims 6-7 are rejected under 35 U.S.C. 103 as being unpatentable over De et al. in view of Pedrycz et al. (“Cluster-Centric Fuzzy Modeling”) in view of Strom (US Pat No. 10152676 B1) and further in view of Hsu et al. ("Parallel Online Learning”) (herein thereafter Hsu). 
Regarding Claim 6
De in view of Pedrycz with Strom teaches “The method of claim 4” as seen above.  
De further teaches wherein determining the coefficients of the transformation comprises: (Examiner notes that “transformation” is not well defined and is interpreted to be anything that results in a change of values. De discloses a transformation in which the coefficients Δx and Δg are determined based on the significant trends of change, shown on Algorithm 3 lines 20.)
De in view of Pedrycz with Strom does not teach “determining, based on the gradient, magnitudes of rates of change of the optimization objective with respect to respective parameters in the set of parameters; and determining the linear rates of change based on the magnitudes of the rates of change.”
Hsu teaches determining, based on the gradient, magnitudes of rates of change of the optimization objective with respect to respective parameters in the set of parameters; (Hsu discloses calculating the magnitude of the gradient (i.e. rates of change) of the optimization objective (referred to as l by Hsu) in sec. 0.6.5 ¶2, “Apart from the weight vector wt, nonlinear CG maintains a direction vector dt and updates are performed in the following way: 
                
                    
                        
                            d
                        
                        
                            t
                        
                    
                    =
                     
                    -
                    
                        
                            g
                        
                        
                            t
                        
                    
                    +
                    
                        
                            β
                        
                        
                            t
                        
                    
                    
                        
                            d
                        
                        
                            t
                            -
                            1
                        
                    
                
            
                
                    
                        
                            w
                        
                        
                            t
                            +
                            1
                        
                    
                    =
                    
                        
                            w
                        
                        
                            t
                        
                    
                    +
                    
                        
                            α
                        
                        
                            t
                        
                    
                    
                        
                            d
                        
                        
                            t
                        
                    
                
            
where                         
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                            =
                            
                                
                                    ∑
                                    
                                        τ
                                        ∈
                                        m
                                        (
                                        t
                                        )
                                    
                                
                                
                                    
                                        
                                            ∇
                                        
                                        
                                            w
                                        
                                    
                                    l
                                    (
                                    
                                        
                                            w
                                            ,
                                            
                                                
                                                    x
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                    ,
                                     
                                    
                                        
                                            y
                                        
                                        
                                            t
                                        
                                    
                                    )
                                    
                                        
                                            
                                                
                                                    ​
                                                
                                            
                                        
                                        
                                            w
                                            =
                                            
                                                
                                                    w
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                
                            
                        
                     is the gradient computed on the t-th minibatch of examples, denoted by m(t). We set                         
                            
                                
                                    β
                                
                                
                                    t
                                
                            
                        
                     according to a widely used formula (Gilbert and Nocedal, 1992):                         
                            
                                
                                    β
                                
                                
                                    t
                                
                            
                            =
                            m
                            a
                            x
                            {
                            0
                            ,
                            
                                
                                    
                                        
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                
                                            
                                            ,
                                             
                                             
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                
                                            
                                            -
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                    -
                                                    1
                                                
                                            
                                        
                                    
                                
                                
                                    
                                        
                                            
                                                
                                                    
                                                        
                                                            g
                                                        
                                                        
                                                            t
                                                            -
                                                            1
                                                        
                                                    
                                                
                                            
                                        
                                        
                                            2
                                        
                                    
                                
                            
                        
                     ”)
and determining the linear rates of change based on the magnitudes of the rates of change. (Hsu discloses determining linear rates of change in the form of βt in sec. 0.6.5 ¶2, “Apart from the weight vector wt, nonlinear CG maintains a direction vector dt and updates are performed in the following way: 
                
                    
                        
                            d
                        
                        
                            t
                        
                    
                    =
                     
                    -
                    
                        
                            g
                        
                        
                            t
                        
                    
                    +
                    
                        
                            β
                        
                        
                            t
                        
                    
                    
                        
                            d
                        
                        
                            t
                            -
                            1
                        
                    
                
            
                
                    
                        
                            w
                        
                        
                            t
                            +
                            1
                        
                    
                    =
                    
                        
                            w
                        
                        
                            t
                        
                    
                    +
                    
                        
                            α
                        
                        
                            t
                        
                    
                    
                        
                            d
                        
                        
                            t
                        
                    
                
            
where                         
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                            =
                            
                                
                                    ∑
                                    
                                        τ
                                        ∈
                                        m
                                        (
                                        t
                                        )
                                    
                                
                                
                                    
                                        
                                            ∇
                                        
                                        
                                            w
                                        
                                    
                                    l
                                    (
                                    
                                        
                                            w
                                            ,
                                            
                                                
                                                    x
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                    ,
                                     
                                    
                                        
                                            y
                                        
                                        
                                            t
                                        
                                    
                                    )
                                    
                                        
                                            
                                                
                                                    ​
                                                
                                            
                                        
                                        
                                            w
                                            =
                                            
                                                
                                                    w
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                
                            
                        
                     is the gradient computed on the t-th minibatch of examples, denoted by m(t). We set                         
                            
                                
                                    β
                                
                                
                                    t
                                
                            
                        
                     according to a widely used formula (Gilbert and Nocedal, 1992):                         
                            
                                
                                    β
                                
                                
                                    t
                                
                            
                            =
                            m
                            a
                            x
                            {
                            0
                            ,
                            
                                
                                    
                                        
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                
                                            
                                            ,
                                             
                                             
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                
                                            
                                            -
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                    -
                                                    1
                                                
                                            
                                        
                                    
                                
                                
                                    
                                        
                                            
                                                
                                                    
                                                        
                                                            g
                                                        
                                                        
                                                            t
                                                            -
                                                            1
                                                        
                                                    
                                                
                                            
                                        
                                        
                                            2
                                        
                                    
                                
                            
                        
                     ”)
De, Pedrycz, Strom and Hsu and the instant application are analogous art because they are all directed to training machine learning models.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the distributed stochastic gradient descent method disclosed by De in view of Pedrycz with Strom to include the “determining, based on the gradient, magnitudes of rates of change of the optimization objective with respect to respective parameters in the set of parameters; and determining the linear rates of change based on the magnitudes of the rates of change” taught by Hsu. One would be motivated to do so to reduce training time, as suggested by Hsu (Hsu sec. 0.6.5 ¶1: “An algorithm that is slightly more sophisticated than gradient descent is the nonlinear conjugate gradient (CG) method. Nonlinear CG can be thought as gradient descent with momentum where principled ways for setting the momentum and the step sizes are used. Empirically, CG can converge much faster than gradient descent when noise does not drive it too far astray.”).


Regarding Claim 7
De in view of Pedrycz with Strom and Hsu teaches “The method of claim 6” as seen above.  
 Hsu further teaches wherein determining the linear rates of change based on the magnitudes of the rates of change comprises: (Hsu discloses determining linear rates of change in the form of βt in sec. 0.6.5 ¶2, “Apart from the weight vector wt, nonlinear CG maintains a direction vector dt and updates are performed in the following way: 
                
                    
                        
                            d
                        
                        
                            t
                        
                    
                    =
                     
                    -
                    
                        
                            g
                        
                        
                            t
                        
                    
                    +
                    
                        
                            β
                        
                        
                            t
                        
                    
                    
                        
                            d
                        
                        
                            t
                            -
                            1
                        
                    
                
            
                
                    
                        
                            w
                        
                        
                            t
                            +
                            1
                        
                    
                    =
                    
                        
                            w
                        
                        
                            t
                        
                    
                    +
                    
                        
                            α
                        
                        
                            t
                        
                    
                    
                        
                            d
                        
                        
                            t
                        
                    
                
            
where                         
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                            =
                            
                                
                                    ∑
                                    
                                        τ
                                        ∈
                                        m
                                        (
                                        t
                                        )
                                    
                                
                                
                                    
                                        
                                            ∇
                                        
                                        
                                            w
                                        
                                    
                                    l
                                    (
                                    
                                        
                                            w
                                            ,
                                            
                                                
                                                    x
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                    ,
                                     
                                    
                                        
                                            y
                                        
                                        
                                            t
                                        
                                    
                                    )
                                    
                                        
                                            
                                                
                                                    ​
                                                
                                            
                                        
                                        
                                            w
                                            =
                                            
                                                
                                                    w
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                
                            
                        
                     is the gradient computed on the t-th minibatch of examples, denoted by m(t). We set                         
                            
                                
                                    β
                                
                                
                                    t
                                
                            
                        
                     according to a widely used formula (Gilbert and Nocedal, 1992):                         
                            
                                
                                    β
                                
                                
                                    t
                                
                            
                            =
                            m
                            a
                            x
                            {
                            0
                            ,
                            
                                
                                    
                                        
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                
                                            
                                            ,
                                             
                                             
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                
                                            
                                            -
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                    -
                                                    1
                                                
                                            
                                        
                                    
                                
                                
                                    
                                        
                                            
                                                
                                                    
                                                        
                                                            g
                                                        
                                                        
                                                            t
                                                            -
                                                            1
                                                        
                                                    
                                                
                                            
                                        
                                        
                                            2
                                        
                                    
                                
                            
                        
                     ”)
computing squares of the magnitudes of the rates of change; (Hsu discloses calculating square of the magnitude of the gradient (i.e. rates of change) in sec. 0.6.5 ¶2, “Apart from the weight vector wt, nonlinear CG maintains a direction vector dt and updates are performed in the following way: 
                
                    
                        
                            d
                        
                        
                            t
                        
                    
                    =
                     
                    -
                    
                        
                            g
                        
                        
                            t
                        
                    
                    +
                    
                        
                            β
                        
                        
                            t
                        
                    
                    
                        
                            d
                        
                        
                            t
                            -
                            1
                        
                    
                
            
                
                    
                        
                            w
                        
                        
                            t
                            +
                            1
                        
                    
                    =
                    
                        
                            w
                        
                        
                            t
                        
                    
                    +
                    
                        
                            α
                        
                        
                            t
                        
                    
                    
                        
                            d
                        
                        
                            t
                        
                    
                
            
where                         
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                            =
                            
                                
                                    ∑
                                    
                                        τ
                                        ∈
                                        m
                                        (
                                        t
                                        )
                                    
                                
                                
                                    
                                        
                                            ∇
                                        
                                        
                                            w
                                        
                                    
                                    l
                                    (
                                    
                                        
                                            w
                                            ,
                                            
                                                
                                                    x
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                    ,
                                     
                                    
                                        
                                            y
                                        
                                        
                                            t
                                        
                                    
                                    )
                                    
                                        
                                            
                                                
                                                    ​
                                                
                                            
                                        
                                        
                                            w
                                            =
                                            
                                                
                                                    w
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                
                            
                        
                     is the gradient computed on the t-th minibatch of examples, denoted by m(t). We set                         
                            
                                
                                    β
                                
                                
                                    t
                                
                            
                        
                     according to a widely used formula (Gilbert and Nocedal, 1992):                         
                            
                                
                                    β
                                
                                
                                    t
                                
                            
                            =
                            m
                            a
                            x
                            {
                            0
                            ,
                            
                                
                                    
                                        
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                
                                            
                                            ,
                                             
                                             
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                
                                            
                                            -
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                    -
                                                    1
                                                
                                            
                                        
                                    
                                
                                
                                    
                                        
                                            
                                                
                                                    
                                                        
                                                            g
                                                        
                                                        
                                                            t
                                                            -
                                                            1
                                                        
                                                    
                                                
                                            
                                        
                                        
                                            2
                                        
                                    
                                
                            
                        
                     ”)
and determining the linear rates of change based on the squares of the magnitudes of the rates of change. (Hsu discloses linear rates of change in the form of βt via squaring the magnitude of the gradient in sec. 0.6.5 ¶2, “Apart from the weight vector wt, nonlinear CG maintains a direction vector dt and updates are performed in the following way: 
                
                    
                        
                            d
                        
                        
                            t
                        
                    
                    =
                     
                    -
                    
                        
                            g
                        
                        
                            t
                        
                    
                    +
                    
                        
                            β
                        
                        
                            t
                        
                    
                    
                        
                            d
                        
                        
                            t
                            -
                            1
                        
                    
                
            
                
                    
                        
                            w
                        
                        
                            t
                            +
                            1
                        
                    
                    =
                    
                        
                            w
                        
                        
                            t
                        
                    
                    +
                    
                        
                            α
                        
                        
                            t
                        
                    
                    
                        
                            d
                        
                        
                            t
                        
                    
                
            
where                         
                            
                                
                                    g
                                
                                
                                    t
                                
                            
                            =
                            
                                
                                    ∑
                                    
                                        τ
                                        ∈
                                        m
                                        (
                                        t
                                        )
                                    
                                
                                
                                    
                                        
                                            ∇
                                        
                                        
                                            w
                                        
                                    
                                    l
                                    (
                                    
                                        
                                            w
                                            ,
                                            
                                                
                                                    x
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                    ,
                                     
                                    
                                        
                                            y
                                        
                                        
                                            t
                                        
                                    
                                    )
                                    
                                        
                                            
                                                
                                                    ​
                                                
                                            
                                        
                                        
                                            w
                                            =
                                            
                                                
                                                    w
                                                
                                                
                                                    t
                                                
                                            
                                        
                                    
                                
                            
                        
                     is the gradient computed on the t-th minibatch of examples, denoted by m(t). We set                         
                            
                                
                                    β
                                
                                
                                    t
                                
                            
                        
                     according to a widely used formula (Gilbert and Nocedal, 1992):                         
                            
                                
                                    β
                                
                                
                                    t
                                
                            
                            =
                            m
                            a
                            x
                            {
                            0
                            ,
                            
                                
                                    
                                        
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                
                                            
                                            ,
                                             
                                             
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                
                                            
                                            -
                                            
                                                
                                                    g
                                                
                                                
                                                    t
                                                    -
                                                    1
                                                
                                            
                                        
                                    
                                
                                
                                    
                                        
                                            
                                                
                                                    
                                                        
                                                            g
                                                        
                                                        
                                                            t
                                                            -
                                                            1
                                                        
                                                    
                                                
                                            
                                        
                                        
                                            2
                                        
                                    
                                
                            
                        
                     ”)
It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to modify Pedrycz with Strom the teachings of Hsu for at least the same reasons as discussed above in claim 6.

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over De et al. in view of Pedrycz et al. (“Cluster-Centric Fuzzy Modeling”) in view of Strom (US Pat No. 10152676 B1) and further in view of Corrado et al. (US9218573) (herein thereafter Corrado). 
Regarding Claim 8
De in view of Pedrycz with Strom teaches “The method of claim 1” as seen above. 
De further teaches [and in response to the request,] transmitting the updated values of the set of parameters to the worker. (De discloses in transmitting the updated values in Algorithm 3, shown highlighted below. 

    PNG
    media_image4.png
    523
    422
    media_image4.png
    Greyscale

De in view of Pedrycz with Strom does not explicitly teach “further comprising: receiving a request for the set of parameters from the worker; and in response to the request, transmitting the updated values of the set of parameters to the worker.”
Corrado teaches the method further comprising: receiving a request for the set of parameters from the worker; (Corrado discloses in col 3 lines 57-60, “The replica obtains the refreshed value of a parameter by submitting a request to the parameter server shard that maintains the values of the parameter.” The worker is referred to as the replica by Corrado.) 
and in response to the request, transmitting the updated values of the set of parameters to the worker. (Corrado discloses that the updated parameters are transmitted to the worker (i.e. the replica) in col 3 lines 55-57, “As part of the parameter updating aspect 210, the replica obtains refreshed parameter values (step 211) and overwrites current values of the parameters (data 212).”) 
De, Pedrycz, Strom and Corrado are analogous art because they are all directed to training machine learning models.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the distributed stochastic gradient descent method disclosed by De in view of Pedrycz with Strom to include the “receiving a request for the set of parameters from the worker; and in response to the request, transmitting the updated values of the set of parameters to the worker” taught by Corrado. One would be motivated to do so to efficiently train machine learning models, as suggested by Corrado (Corrado col 1-2 lines 58-3: “Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. Machine learning models with large numbers of parameters can be trained efficiently and effectively. […] Because model replicas operate asynchronously, problems caused by hardware failures and slow processing speeds are mitigated.”).


Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over De et al. in view of Pedrycz et al. (“Cluster-Centric Fuzzy Modeling”) in view of Strom (US Pat No. 10152676 B1) and further in view of Agarwal et al. (“Distributed Delayed Stochastic Optimization”).
Regarding Claim 16 (New)
De in view of Pedrycz with Strom teaches “The method of claim 15” as seen above. 
De in view of Pedrycz with Strom does not teach wherein the machine learning model is trained by the working using training data that is randomly sampled from a complete set of training data.  
Higgins teaches wherein the machine learning model is trained by the working using training data that is randomly sampled from a complete set of training data. (Pg. 3 section 3.1 “To ensure smooth affine object transforms, each two subsequent values for each factor vk were chosen to ensure minimal differences in pixel space given 64x64 pixel image resolution. We used randomly sampled batches of size 100 to train a fully connected VAE with m = 10 latent units and various β values until convergence (see Tbl. 1 in Appendix for details). After training, a VAE with β = 0.01 learnt a good (while not perfect) disentangled representation of the data generative factors, and its decoder learnt to act as a rendering engine (Fig. 2A).”) 
De, Pedrycz, Strom and Higgins are analogous art because they are all directed to training machine learning models.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the distributed stochastic gradient descent method disclosed by De in view of Pedrycz with Strom to include “randomly sampled from a complete set of training data” as taught by Higgins. One would be motivated to  effectively choose random sample from a large population as disclosed by Higgins (pg. 3 section 3.1 “To ensure smooth affine object transforms, each two subsequent values for each factor vk were chosen to ensure minimal differences in pixel space given 64x64 pixel image resolution. We used randomly sampled batches of size 100 to train a fully connected VAE with m = 10 latent units and various β values until convergence (see Tbl. 1 in Appendix for details).”).

Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over De et al. in view of Pedrycz et al. (“Cluster-Centric Fuzzy Modeling”) in view of Strom (US Pat No. 10152676 B1) and further in view of Agarwal et al. (“Distributed Delayed Stochastic Optimization”).
Regarding Claim 20 (New)
De in view of Pedrycz with Strom teaches “The method of claim 15” as seen above. 
De in view of Pedrycz with Strom does not teach wherein updating the current values provides compensation for delay between a plurality of workers providing respective feedback data generated by training the machine learning model, the plurality of workers comprising the worker.
Agarwal teaches wherein updating the current values provides compensation for delay between a plurality of workers providing respective feedback data generated by training the machine learning model, (FIG. 1 “Cyclic delayed update architecture. Workers compute gradients cyclically and in parallel, passing out-of-date information to master. Master responds with current parameters. Diagram shows parameters and gradients communicated between rounds t and t + n − 1.”)
the plurality of workers comprising the worker. (Pg. 8 second paragraph “We divide the N samples among n workers so that each worker has an N/n-sized subset of data. In streaming applications, the distribution P is the unknown distribution generating the data, and each worker receives a stream of independent data points”)
De, Pedrycz, Strom and Agarwal are analogous art because they are all directed to training machine learning models.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the distributed stochastic gradient descent method disclosed by De in view of Pedrycz with Strom to include distributed delayed stochastic optimization as taught by Agarwal. One would be motivated to smooth stochastic for optimal convergence as disclosed by Agarwal (abstract “We take motivation from statistical problems where the size of the data is so large that it cannot fit on one computer; with the advent of huge datasets in biology, astronomy, and the internet, such problems are now common. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible and we can achieve order-optimal convergence results”).


Prior Art of Record
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Reddi et al. (“On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants”) teaches various methods for asynchronous versions of optimization algorithms (Reddi Abstract: “We study optimization algorithms based on variance reduction for stochastic gradient descent (SGD). […] Subsequently, we propose an asynchronous algorithm grounded in our framework, and prove its fast convergence. An important consequence of our general approach is that it yields asynchronous versions of variance reduction algorithms such as SVRG and SAGA as a byproduct.”) 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to VAN C MANG whose telephone number is (571)270-7598. The examiner can normally be reached Mon - Fri 8:00-5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on 5712729767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/V.M./Examiner, Art Unit 2126
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126