DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. CN201611073994, filed on 08/16/2019.
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 09/05/2019 and 11/14/2019 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.
Claim Status
Examiner acknowledges the status of preliminary amendments to claims 1, 2, 4-8, and 10-15. Claims 1-15 are being considered by the examiner. 
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 

Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitations are: the communicator and the calculator in claim in claims 7, 8, 11, and 12.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim 
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 7-12 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinct claim the subject matter which the inventor or a join inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. 
Claim limitations “the communicator configured to”, and “the calculator configured to” recited in claims 7, 8, 11, and 12 invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. There is a lack of antecedent basis for a generic communicator and a generic calculator performing the respective functions in the recited claim limitations. Therefore, the claim is indefinite and is rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph. As claims 9 and 10 depend on claim 7, they have also been similarly rejected. 

(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.



Claims 1-15 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Li et al. ("Communication Efficient Distributed Machine Learning with the Parameter Server").
Regarding claim 1, Li et al. teaches a method for training a neural network model, wherein the method is applicable to a training system comprising (Li et al. [Abstract] “This paper describes a third-generation parameter server framework for distributed machine learning”): 
a server module and an N worker modules (Li et al. [Section 2: The Parameter Server Architecture, Paragraph 1] ““An instance of the parameter server contains a server group and several worker groups, in which a group has several machines”, where the parameter server is a server module and a worker machine is a worker module.”), wherein the server module and the N worker modules are configured to train a model parameter within an at least one training period wherein each one of the at least one training period comprises K iterations (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “The model w is learned iteratively”, where the training period of the model is iterative, where the model w is the model parameter that is trained iteratively where K iterations (t in algorithm 2) start from 1 seen in Fig. 2) and wherein for an ith iteration of one of the N worker modules within each training period of the at least one training period, each worker module of the N worker th iteration based on a local gradient of the ith iteration and a model parameter of the ith iteration (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1; Section 3.1; Section 3.1 Figure and Figure 2] “As illustrated in Figure 2 and Algorithm 1, training data is partitioned and distributed among all the workers. The model w is learned iteratively. In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers. A worker needs the model w to compute the gradients.”, where the local gradient for the next iteration is calculated by workers based on a local gradient calculated within the preceding iteration using the shared parameter of the same previous iteration…By default callees execute tasks in parallel for best performance…The diagram on the right illustrates the execution of three tasks. Tasks 10 and 11 are independent, but 12 depends on 11.”, where the shared parameter w is a model parameter and the callees are the workers performing the local gradient calculations independently in parallel seen in Tasks 10 and 11 in the figure located in Section 3.1.), and wherein for each iteration where i is less than K, calculating a local gradient of the (i+1)th iteration is calculated based on the model parameter of the (i+1)th iteration and a sample data of the (i+1)th iteration (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1-2  ] “In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w…A worker needs the model w to compute the gradients…or the motivating example introduced in (1), we can implement a standard distributed subgradient descent algorithm using the parameter server”, where the local gradient of the same iteration is calculated based on the shared model parameter w of that iteration and training data, a sample data.); and pulling a global gradient of an rth iteration from the server module and/or pushing a local gradient of an f iteration to the server module (Li et al. [Section 3.1, Paragraph 1 and Figure 2] “We decompose the workloads in the parameter server into tasks that are issued by a caller to a remote callee. There is considerable flexibility in terms of what constitutes a task: for instance, a task can be a push or a pull that a worker issues to servers”, where the worker pulls a global parameter, and r iteration and pushes a local gradient calculation to the server, an f iteration, seen in Fig. 2.), where r and f each are a positive integer less than or equal to i, where N and K each are an integer greater than or equal to 1, and where i is an integer greater than or equal to 1 and less than or equal to K (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “For the motivating example introduced in (1), we can implement a standard distributed subgradient descent algorithm [34] using the parameter server. As illustrated in Figure 2 and Algorithm 1, training data is partitioned and distributed among all the workers. The model w is learned iteratively… In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers… “, where i iterations consists of calculating local gradient until the gradient change is applied by an agent within K iterations, where the pushing of the local gradient, the f iteration, and pulling of global model, the r iteration, does not always occur at every I iteration, and where N represents the plurality of workers.).
	Regarding claim 2, Li et al. teaches the method wherein the calculating; a model parameter of an (i+1)th iteration based on a local gradient of the ith iteration and a model th iteration comprises (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1 and Figure 2] “As illustrated in Figure 2 and Algorithm 1, training data is partitioned and distributed among all the workers. The model w is learned iteratively. In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers. A worker needs the model w to compute the gradients.”, where the local gradient for the next iteration is calculated by workers based on a local gradient calculated within the preceding iteration using the shared parameter of the same previous iteration. Therefore the model parameter of the next iteration is based on the model and local gradient of the preceding iteration.): calculating, for a global gradient of a jth iteration that meets a first condition that has been pulled from the server module, the model parameter of the (i+1)th iteration based on the global gradient of the jth iteration, the local gradient of the ith iteration, and the model parameter of the ith iteration (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1 and Figure 2] “As illustrated in Figure 2 and Algorithm 1, training data is partitioned and distributed among all the workers. The model w is learned iteratively. In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers. A worker needs the model w to compute the gradients.”, where the local gradient for the next iteration is calculated by workers based on a local gradient calculated within the preceding iteration using the shared parameter of the same previous iteration. Therefore the model parameter of the next iteration is based on the model and local gradient of the preceding iteration.), where j is a positive integer less than or (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “The model w is learned iteratively”, where the training period of the model is iterative, comprising K iterations.”), and the first condition comprises: the global gradient of the jth iteration has not been used to calculate a model parameter in any iteration between a first iteration and the ith iteration (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers.”, where updated shared parameter has not been used before, because the shared parameter was updated to reflect the changes of the local gradients calculated.); or calculating,  for a global gradient of a jth iteration that meets a first condition that has not been pulled from the server module (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “The model w is learned iteratively. In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers.) the model parameter of the (i+1)th iteration based on the local gradient of the ith iteration and the model parameter of the ith iteration (Li [Figure 2 and Section 2: Distributed Subgradient Descent, Paragraph 1] “As illustrated in Figure 2 and Algorithm 1, training data is partitioned and distributed among all the workers. The model w is learned iteratively. In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers. A worker needs the model w to compute the gradients.”, where the local gradient for the next iteration is calculated by workers based on a local gradient calculated within the preceding iteration using the shared parameter of the same previous iteration. Therefore the model parameter of the next iteration is based on the model and local gradient of the preceding iteration.).
	Regarding claim 3, Li et al. teaches the method wherein the first condition further comprises: the global gradient of the jth iteration is a global gradient in an iteration with a largest iteration batch number in all global gradients that have been pulled from the server module (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers.”, where updated shared parameter has not been used before, because the shared parameter was updated to reflect the changes of the local gradients calculated.).
	Regarding claim 4, Li et al. teaches the method wherein the global gradient of the jth iteration is determined based on the following one or more local gradients of the jth iteration that are reported by M of the N worker modules, where M is an integer greater than or equal to 1 and less than or equal to N (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w”, where each worker, is in a group of workers, therefore each worker represents M up until the total number of workers in the group.).
	Regarding claim 5, Li et al. teaches the method wherein the pulling, a global gradient of an r iteration from the server module and/or the pushing, a local gradient of an f iteration to the server module comprises pulling the global gradient of the rth iteration and/or pushing, to the server module, either a local gradient of an (i-1)th iteration or the local gradient of the ith iteration (Li et al. [Section 3.1 and Figure 2] “We decompose the workloads in the parameter server into tasks that are issued by a caller to a remote callee. There is considerable flexibility in terms of what constitutes a task: for instance, a task can be a push or a pull that a worker issues to servers”, where the worker pulls a global parameter, and r iteration and pushes a local gradient calculation to the server calculated by an i iteration, an f iteration, seen in Fig. 2.).
	Regarding claim 6, Li et al. teaches the method for where i is K, the method further comprises: pushing, a model parameter of a (K+1)th iteration to the server module after calculating a local gradient of a Kth iteration and calculating the model parameter of the (K+1)th iteration based on the local gradient of the Kth iteration and a model parameter of the Kth iteration, wherein the model parameter of the (K+1)th iteration enables the server module to determine a model parameter of a first iteration within a next training period based on the iteration quantity K and the model parameter of the (K+1)th iteration that is pushed by each of the N worker modules to the server module. (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1; Section 3.1 and Figure 2] “The model w is learned iteratively. In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers… For example, the aggregation logic at servers in Algorithm 1 can be implemented by having the updating task depend on the push tasks of all workers. In this way, the weight w is updated only after all worker gradients have been aggregated.”, where the shared parameter w used for the next iteration of K is pushed, based on the calculation of the local gradients and the shared parameter at that current iteration of K, where I is K when the shared parameter w is updated. The updated shared parameter w, in which the model parameter w is determined by the pushing of the each of the local gradient calculation to the server seen in Figure 2.)
	Regarding claim 7, Li et al. teaches a training apparatus for training a neural network model, the training apparatus comprising N worker modules, wherein the training apparatus is applicable to a training system comprising the training apparatus and a server module (Li et al. [Abstract] “This paper describes a third-generation parameter server framework for distributed machine learning”) wherein the server module and the N worker modules are configured to train a model parameter within at least one training period (Li et al. [Section 2: The Parameter Server Architecture, Paragraph 1] “An instance of the parameter server contains a server group and several worker groups, in which a group has several machines”, where the parameter server is a server module and a worker machine is a worker module.”), and wherein each training period of the at least one training period comprises K iterations (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “The model w is learned iteratively”, where the training period of the model is iterative, where the model w is the model parameter that is trained iteratively where K iterations (t in algorithm 2) start from 1 seen in Fig. 2) and wherein each of the N worker modules comprises a communicator and a calculator (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1 and Section 3.1] “each worker computes the local gradients…We decompose the workloads in the parameter server into tasks that are issued by a caller to a remote callee. There is considerable flexibility in terms of what constitutes a task: for instance, a task can be a push or a pull that a worker issues to servers”, where the worker communicates with the parameter server and computing the local gradients involves performing calculation); and wherein for an ith iteration of one of the N worker modules within (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “The model w is learned iteratively. In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w.”): the communicator and the calculator of each worker module run in parallel (Li et al. [Section 3.1; Section 3.1 Figure] By default callees execute tasks in parallel for best performance…The diagram on the right illustrates the execution of three tasks. Tasks 10 and 11 are independent, but 12 depends on 11.”, where the shared parameter w is a model parameter and the callees are the workers performing the local gradient calculations independently in parallel seen in Tasks 10 and 11 in the figure located in Section 3.1.) wherein the calculator is configured to: calculate a model parameter of an (i+1)th iteration based on a local gradient of the ith iteration and a model parameter of the ith iteration, and wherein for each iteration where i is less than K, a local gradient of the (i+1)th iteration is calculated based on the model parameter of the (i+1)th iteration and a sample data of the (i+1)th iteration (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1-2  ] “In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w…A worker needs the model w to compute the gradients…or the motivating example introduced in (1), we can implement a standard distributed subgradient descent algorithm using the parameter server”, where the local gradient of the same iteration is calculated based on the shared model parameter w of that iteration and training data, a sample data.);; and the communicator is configured to: pull a global gradient of an rth iteration from the server module and/or push a local gradient of an f iteration to the server module (Li et al. [Section 3.1 and Figure 2] “We decompose the workloads in the parameter server into tasks that are issued by a caller to a remote callee. There is considerable flexibility in terms of what constitutes a task: for instance, a task can be a push or a pull that a worker issues to servers”, where the worker pulls a global parameter, and r iteration and pushes a local gradient calculation to the server, an f iteration, seen in Fig. 2.), where r and f each are a positive integer less than or equal to i, wherein where N and K each are an integer greater than or equal to 1, and where i is an integer greater than or equal to 1 and less than or equal to K (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “For the motivating example introduced in (1), we can implement a standard distributed subgradient descent algorithm [34] using the parameter server. As illustrated in Figure 2 and Algorithm 1, training data is partitioned and distributed among all the workers. The model w is learned iteratively… In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers… “, where i iterations consists of calculating local gradient until the gradient change is applied by an agent within K iterations, where the pushing of the local gradient, the f iteration, and pulling of global model, the r iteration, does not always occur at every I iteration, and where N represents the plurality of workers.).
	Regarding claim 8, Li et al. teaches the apparatus wherein the calculator is configured to: calculate, for a global gradient of a jth iteration that meets a first condition that has been pulled from the server module, the model parameter of the (i+1)th iteration based on the global gradient of the jth iteration, the local gradient of the ith iteration, and the model parameter of the ith iteration where j is a positive integer less than or equal to i (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1 and Figure 2] “As illustrated in Figure 2 and Algorithm 1, training data is partitioned and distributed among all the workers. The model w is learned iteratively. In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers. A worker needs the model w to compute the gradients.”, where the calculation of model parameter used for the set of local gradient calculation in the next iteration is based on the current iteration local gradient and model parameter of the current iteration.) and the first condition comprises: the global gradient of the jth iteration has not been used to calculate a model parameter in any iteration between a first iteration and the ith iteration (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers.”, where updated shared parameter has not been used before, because the shared parameter was updated to reflect the changes of the local gradients calculated.); calculate, for a global gradient of a jth iteration that meets a first condition that has not been pulled from the server module, the model parameter of the (i+1)th iteration based on the local gradient of the ith iteration and the model parameter of the ith iteration (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1 and Figure 2] “As illustrated in Figure 2 and Algorithm 1, training data is partitioned and distributed among all the workers. The model w is learned iteratively. In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers. A worker needs the model w to compute the gradients.”, where the calculation of model parameter used for the set of local gradient calculation in the next iteration is based on the current iteration local gradient and model parameter of the current iteration.).
	Regarding claim 9, Li et al. teaches the apparatus wherein the first condition further comprises: the global gradient of the jth iteration is a global gradient in an iteration with a largest iteration batch number in all global gradients that have been pulled from the server module (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “The model w is learned iteratively”, where the subsequent model w after each iteration automatically has the largest iteration batch number”).
	Regarding claim 10, Li et al. teaches the apparatus wherein the global gradient of the jth iteration is determined based on the following: one or more local gradients of the jth iteration that are reported by M of the N worker modules, where M is an integer greater than or equal to 1 and less than or equal to N (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w”, where each worker, is in a group of workers, therefore each worker represents M up until the total number of workers in the group.).
	Regarding claim 11, Li et al. teaches the apparatus wherein the communicator is configured to perform pulling, from the server module, the global gradient of the rth iteration and/or pushing, to the server module, either a local gradient of the (i-1)th iteration or the local gradient of the ith iteration (Li et al. [Section 3.1 and Figure 2] “We decompose the workloads in the parameter server into tasks that are issued by a caller to a remote callee. There is considerable flexibility in terms of what constitutes a task: for instance, a task can be a push or a pull that a worker issues to servers”, where the worker pulls a global parameter, and r iteration and pushes a local gradient calculation to the server calculated by an i iteration, an f iteration, seen in Fig. 2.).
	Regarding claim 12, Li et al. teaches the apparatus for where i is K, the communicator is further configured to: push a model parameter of a (K+1)th iteration to the server module after calculating, by the calculation module, a local gradient of a Kth iteration and calculating the model parameter of the (K+1)th iteration based on the local gradient of the Kth iteration and a model parameter of the Kth iteration, wherein the model parameter of the (K+1)th iteration enables is used to enable the server module to determine a model parameter of a first iteration within a next training period based on the iteration quantity K and the model parameter of the (K+1)th iteration that is pushed by each of the N worker modules to the server module.
 . (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1; Section 3.1 and Figure 2] “The model w is learned iteratively. In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers… For example, the aggregation logic at servers in Algorithm 1 can be implemented by having the updating task depend on the push tasks of all workers. In this way, the weight w is updated only after all worker gradients have been aggregated.”, where the shared parameter w used for the next iteration of K is pushed, based on the calculation of the local gradients and the shared parameter at that current iteration of K, where I is K when the shared parameter w is updated. The updated shared parameter w, in which the model parameter w is determined by the pushing of the each of the local gradient calculation to the server seen in Figure 2.).
(Li et al. [Section 5] “We ran the parameter server on 1000 machines, each with 16 CPU cores, 192GB DRAM, and connected by 10Gb Ethernet); wherein the training apparatus is applicable to a training system that comprises (Li et al. [Abstract] “This paper describes a third-generation parameter server framework for distributed machine learning”) a server module and the apparatus, wherein the server module and the N processor cores are configured to train a model parameter within at least one training period, and wherein each training period of the at least one training period comprises K iterations (Li et al. [Section 2: The Parameter Server Architecture, Paragraph 1 and Figure 2] “An instance of the parameter server contains a server group and several worker groups, in which a group has several machines”, where the parameter server is a server module and a worker machine is a worker module…The model w is learned iteratively”, where the model w is the model parameter that is trained iteratively where K iterations (t in algorithm 2) start from 1 seen in Fig. 2) and wherein the memory is configured to store an instruction; wherein the processor is configured to: execute the instruction stored in the memory, and control the transceiver to transmit data to the server module; and wherein executing, by  the processor, the instruction stored in the memory causes each of the N processor cores comprising : (Li et al. [Section 2: The Parameter Server Architecture; Section 5: Setup] “An instance of the parameter server [4] contains a server group and several worker groups, in which a group has several machines. Each machine in the server group maintains a portion of the global parameters, and all servers communicate with each other to replicate and/or migrate parameters for reliability and scaling. We ran the parameter server on 1000 machines, each with 16 CPU cores, 192GB DRAM, and connected by 10Gb Ethernet”); calculating a model parameter of an (i+1)th iteration based on a local gradient of the ith iteration and a model parameter of the ith iteration, and wherein for each iteration where i is less than K, a local gradient of the (i+1)th iteration is calculated based on the model parameter of the (i+1)th iteration and a sample data of the (i+1)th iteration (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1-2  ] “In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w…A worker needs the model w to compute the gradients…or the motivating example introduced in (1), we can implement a standard distributed subgradient descent algorithm using the parameter server”, where the local gradient of the same iteration is calculated based on the shared parameter w, the model parameter, of that iteration and training data, a sample data.); and pulling a global gradient of an rth iteration from the server module and/or pushing a local gradient of an fth iteration to the server module, (Li et al. [Section 3.1 and Figure 2] “We decompose the workloads in the parameter server into tasks that are issued by a caller to a remote callee. There is considerable flexibility in terms of what constitutes a task: for instance, a task can be a push or a pull that a worker issues to servers”, where the worker pulls a global parameter, and r iteration and pushes a local gradient calculation to the server, an f iteration, seen in Fig. 2.), where r and f each are a positive integer less than or equal to i, where N and K each are an integer greater than or equal to 1, and where i is an integer greater than or equal to 1 and less than or equal to K (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “For the motivating example introduced in (1), we can implement a standard distributed subgradient descent algorithm [34] using the parameter server. As illustrated in Figure 2 and Algorithm 1, training data is partitioned and distributed among all the workers. The model w is learned iteratively… In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers… “, where i iterations consists of calculating local gradient until the gradient change is applied by an agent within K iterations, where the pushing of the local gradient, the f iteration, and pulling of global model, the r iteration, does not always occur at every I iteration, and where N represents the plurality of workers.).
	Regarding claim 14, Li et al. teaches a chip for training a neural network model, wherein the chip is applicable to a training system (Li et al. [Abstract] “This paper describes a third-generation parameter server framework for distributed machine learning”) that comprises N chips and a server module (Li et al. [Section 2: The Parameter Server Architecture, Paragraph 1] “An instance of the parameter server contains a server group and several worker groups, in which a group has several machines”, where the parameter server is a server module and a worker is a chip.”), the server module and the N chips are configured to train a model parameter within at least one training period, and each of the at least one training period comprises K iterations and the chip is configured to perform a method comprising (Li et al. [Section 2] “An instance of the parameter server contains a server group and several worker groups, in which a group has several machines”, where the parameter server is a server module and a worker machine is a worker module…The model w is learned iteratively”, where the training period of the model is iterative, where the model w is the model parameter that is trained iteratively where K iterations (t in algorithm 2) start from 1 seen in Fig. 2): calculating a model parameter of an (i+1)th iteration based on a local gradient th iteration and a model parameter of the ith iteration; and wherein for each iteration where i is less than K, a local gradient of the (i+1)th iteration is calculated based on the model parameter of the (i+1)th iteration and a sample data of the (i+1)th iteration (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1-2  ] “In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w…A worker needs the model w to compute the gradients…or the motivating example introduced in (1), we can implement a standard distributed subgradient descent algorithm using the parameter server”, where the local gradient of the same iteration is calculated based on the shared model parameter w, the model parameter, of that iteration and training data, a sample data.) and pulling a global gradient of an rth iteration from the server module and/or pushing a local gradient of an fth iteration to the server module, (Li et al. [Section 3.1 and Figure 2] “We decompose the workloads in the parameter server into tasks that are issued by a caller to a remote callee. There is considerable flexibility in terms of what constitutes a task: for instance, a task can be a push or a pull that a worker issues to servers”, where the worker pulls a global parameter, and r iteration and pushes a local gradient calculation to the server, an f iteration, seen in Fig. 2.), where r and f each are a positive integer less than or equal to i, where N and K each are an integer greater than or equal to 1, and where i is an integer greater than or equal to 1 and less than or equal to K (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “For the motivating example introduced in (1), we can implement a standard distributed subgradient descent algorithm [34] using the parameter server. As illustrated in Figure 2 and Algorithm 1, training data is partitioned and distributed among all the workers. The model w is learned iteratively… In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers… “, where i iterations consists of calculating local gradient until the gradient change is applied by an agent within K iterations, where the pushing of the local gradient, the f iteration, and pulling of global model, the r iteration, does not always occur at every I iteration, and where N represents the plurality of workers.).
	Regarding claim 15, Li et al. teaches a non-transitory computer storage medium, wherein the computer storage medium stores a computer executable instruction, and when being called by a training system (Li et al. [Abstract] “This paper describes a third-generation parameter server framework for distributed machine learning”)  comprising a server module and an N worker modules (Li et al. [Section 2: The Parameter Server Architecture, Paragraph 1] “An instance of the parameter server contains a server group and several worker groups, in which a group has several machines”, where the parameter server is a server module and a worker is a worker module.”), wherein the server module and the N worker modules are configured to train a model parameter within an at least one training period, each of the at least one training period comprises K iterations (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1] “The model w is learned iteratively”, where the training period of the model is iterative, where the model w is the model parameter that is trained iteratively where K iterations (t in algorithm 2) start from 1 seen in Fig. 2)  and wherein for each iteration where i is less than K, calculating a local gradient of the (i+1)th iteration is calculated based on the model parameter of the (i+1)th iteration and a sample data of the (i+1)th iteration (Li et al. [Section 2: Distributed Subgradient Descent, Paragraph 1-2  ] “In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w…A worker needs the model w to compute the gradients…or the motivating example introduced in (1), we can implement a standard distributed subgradient descent algorithm using the parameter server”, where the local gradient of the same iteration is calculated based on the shared model parameter w, the model parameter, of that iteration and training data, a sample data.), where r and f each are a positive integer less than or equal to i, where N and K each are an integer greater than or equal to 1, and where i is an integer greater than or equal to 1 and less than or equal to K (Li et al. [Section 2] “For the motivating example introduced in (1), we can implement a standard distributed subgradient descent algorithm [34] using the parameter server. As illustrated in Figure 2 and Algorithm 1, training data is partitioned and distributed among all the workers. The model w is learned iteratively… In each iteration, each worker computes the local gradients using its own training data, and the servers aggregate these gradients to update the globally shared parameter w. Then the workers retrieve the updated weights from the servers… “, where i iterations consists of calculating local gradient until the gradient change is applied by an agent within K iterations, where the pushing of the local gradient, the f iteration, and pulling of global model, the r iteration, does not always occur at every i iteration, and where N represents the plurality of workers.).





Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Gemello (US 20160267380 A1) teaches a method, apparatus, and training system for training a neural network using a plurality of agents performing pipelined gradient analysis.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IAN K ALLEYNE whose telephone number is (571)272-1327. The examiner can normally be reached 8:30 - 5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached on 571-270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) 





/IAN K ALLEYNE/Examiner, Art Unit 2127                                                                                                                                                                                                        

/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127