DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are presented for examination.


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
 
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1-5 and 8-13 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Sridharan et al. (U.S. 2019/0205745).
Sridharan was cited on the IDS filed 23 December 2020.


With respect to claim 1, Sridharan teaches a system, comprising: a group of target devices, the group of target devices comprising one or more target devices, each target device of the group of target devices (multiple sets of worker nodes 2216A-2216B, 2236A-2236B – see Sridharan, page 23, paragraph 223) being communicatively connected to a parameter server (the nodes are directly communicating with the parameter server – see Sridharan, Fig. 22, element 2220; pages 22-23, paragraph 220) that stores a master copy of an artificial intelligence (AI) model (in data parallelism 1904, the different nodes of the distributed network have a complete instance of the model and each node receives a different portion of the data. The results from the different nodes are then combined. Data parallel training approaches all require a technique of combining results and synchronizing the model parameters between each node. Exemplary approaches to combining data include parameter averaging and update based data parallelism. Parameter averaging trains each node on a subset of the training data and sets the global parameters (e.g., weights, biases) to the average of the parameters from each node. Parameter averaging uses a central parameter server that maintains the parameter data. Update based data parallelism is similar to parameter averaging except that instead of transferring parameters from the nodes to the parameter server, the updates to the model are transferred – see Sridharan, pages 18-19, paragraph 190), the group of target devices being configured to run an instance of the Al model (the different nodes of the distributed network have a complete instance of the model and each node receives a different portion of the data – see Sridharan, Fig. 19, element 1904; pages 18-19, paragraph 190), each target device comprises: a downloader configured to download a portion of the Al model (each layer of a neural network can be trained by a different processing node of the distributed system – see Sridharan, page 18, paragraph 189) from the parameter server (in data parallelism 1904, the different nodes of the distributed network have a complete instance of the model and each node receives a different portion of the data. The results from the different nodes are then combined. Data parallel training approaches all require a technique of combining results and synchronizing the model parameters between each node. Exemplary 

With respect to claim 2, Sridharan teaches the invention described in claim 1, including the system wherein the executer is further configured to execute the set of microbatches of the dataset on the second subportion using the downloaded weights for the second subportion; the downloader is further configured to download weights for a third subportion of the downloaded portion of the Al model into the memory of the target device from the parameter server; wherein the executing the set of microbatches of the dataset on the second subportion and the downloading weights for the third subportion are performed contemporaneously (Sridharan, page 20, paragraph 202).

With respect to claim 3, Sridharan teaches the invention described in claim 1, including the system wherein the executer is further configured to execute the set of microbatches on the second subportion using the downloaded weights for the second subportion; and the downloader is further configured to download weights for a third subportion of the downloaded portion of the Al model from the parameter server; wherein the executing the set of microbatches of the dataset on the second subportion and the downloading weights for the third subportion are performed serially (Sridharan, page 20, paragraph 202).

With respect to claim 4, Sridharan teaches the invention described in claim 1, including the system wherein the set of microbatches comprises a plurality of microbatches that are configured to be executed in sequential order (Sridharan, page 20, paragraph 202), the set of microbatches forming a minibatch that comprises a number of samples per update for training (a subset of the samples in the mini-batch – see Sridharan, page 20, paragraph 202) or a number of samples served in every inference cycle for inference.

With respect to claim 5, Sridharan teaches the invention described in claim 4, including the system wherein each of the target device further comprises an output manager configured to: send the activations for the first subportion to the parameter server or save the activations on the target device for a forward pass during training of the Al model; and restore the activations for a backward pass during the training of the Al model (Sridharan, page 20, paragraph 202).

With respect to claim 8, Sridharan teaches the invention described in claim 1, including the system wherein each target device comprises at least one of an application-specific integrated circuit, a graphics processing unit (Sridharan, page 24, paragraph 230) or an edge device.

With respect to claim 9, Sridharan teaches a method implemented in a target device (multiple sets of worker nodes 2216A-2216B, 2236A-2236B – see Sridharan, page 23, paragraph 223), comprising: downloading a portion of an artificial intelligence (AI) model (each layer of a neural network can be trained by a different processing node of the distributed system – see Sridharan, page 18, paragraph 189) from a parameter server (in data parallelism 1904, the different nodes of the distributed network have a complete instance of the model and each node receives a different portion of the data. The results from the different nodes are then combined. Data parallel training approaches all require a technique of combining results and synchronizing the model parameters between each node. Exemplary approaches to combining data include parameter averaging and update based data parallelism. Parameter averaging trains each node on a subset of the training data and sets the global parameters (e.g., weights, biases) to the average of the parameters from each node. Parameter averaging uses a central parameter server that maintains the parameter data. Update based data parallelism is similar to parameter averaging except that instead of transferring parameters from the nodes to the parameter server, the updates to the model are transferred – see Sridharan, pages 18-19, paragraph 190); storing a set of microbatches (a subset of 

Claims 10-13 do not teach or define any new limitations above claims 2-5 and therefore are rejected for similar reasons.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 6, 14, and 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over Sridharan and further in view of de Vangel et al. (U.S. 2020/0226458).

With respect to claim 6, Sridharan teaches the invention described in claim 4, including a system, comprising: a group of target devices, the group of target devices comprising one or more target devices, each target device of the group of target devices (multiple sets of worker nodes 2216A-2216B, 2236A-2236B – see Sridharan, page 23, paragraph 223) being communicatively connected to a parameter server (the nodes are directly communicating with the parameter server – see Sridharan, Fig. 22, element 2220; pages 22-23, paragraph 220) that stores a master copy of an artificial intelligence (AI) model (in data parallelism 1904, the different nodes of the distributed network have a complete instance of the model and each node receives a different portion of the data. The results from the different nodes are then combined. Data parallel training approaches all require a technique of combining results and 
Sridharan does not explicitly teach the system wherein each batch of the set of batches has a batch size selected based on a rate of execution of the plurality of batches and a rate of communication.
However, de Vangel teaches the system wherein each batch of the set of batches has a batch size selected based on a rate of execution of the plurality of batches and a rate of communication (de Vangel, Fig. 7; page 5, paragraph 68).


With respect to claim 16, Sridharan teaches a system, comprising: a parameter server (the nodes are directly communicating with the parameter server – see Sridharan, Fig. 22, element 2220; pages 22-23, paragraph 220) communicatively connected to a group of target devices, the group of target devices comprising one or more target devices, the group of target devices (multiple sets of worker nodes 2216A-2216B, 2236A-2236B – see Sridharan, page 23, paragraph 223) being configured to run an instance of an artificial intelligence (AI) model (the different nodes of the distributed network have a complete instance of the model and each node receives a different portion of the data – see Sridharan, Fig. 19, element 1904; pages 18-19, paragraph 190), the parameter server (the nodes are directly communicating with the parameter server – see Sridharan, Fig. 22, element 2220; pages 22-23, paragraph 220) comprises: a data manager configured to store a master copy the AI model (in data parallelism 1904, the different nodes of the distributed network have a complete instance of the model and each node receives a different portion of the data. The results from the different nodes are then combined. Data parallel training approaches all require a technique of combining results and synchronizing the model parameters between each node. Exemplary approaches to combining data include parameter averaging and update based data parallelism. Parameter averaging trains each node on a subset of the training data and sets the global parameters (e.g., weights, biases) to the average of the parameters from each node. Parameter averaging uses a central parameter server that maintains the parameter data. Update based data parallelism is similar to 
Sridharan does not explicitly teach determine a batch size suitable for each target device of the group of target devices.
However, de Vangel teaches determine a batch size suitable for each target device of the group of target devices (the blocks 740, 750, and 760 can be iterated to find a configuration of batch sizes of layers of the ANN corresponding to an optimal performance measure. The performance measure can be tuned by a user. The user may indicate whether the performance measure is to correspond to a minimum latency of the ANN or a maximum throughput of the ANN. The batch size for the ANN can be then selected as a maximum of batch sizes of the layers of the ANN – see de Vangel, Fig. 7; page 5, paragraph 68).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Sridharan in view of de Vangel in order to determine a batch size suitable for each target device of the group of target devices. One would be motivated to do so in order to optimize artificial neural network (ANN) computations based on automatic determination of a batch size (de Vangel, page 1, paragraph 1).	
	
With respect to claim 17, the combination of Sridharan and de Vangel teaches the invention described in claim 16, including the system wherein the microbatch (Sridharan, page 20, paragraph 202) size is based on a rate of executing a set of (de Vangel, Fig. 7; page 5, paragraph 68) microbatches (Sridharan, page 20, paragraph 202) at each target device (Sridharan, page 23, paragraph 223) and a rate of communication (de Vangel, Fig. 7; page 5, paragraph 68) between the target device (Sridharan, page 23, paragraph 223) and the parameter server (Sridharan, Fig. 22, element 2220; pages 22-23, paragraph 220).
The combination of references is made under the same rationale as claim 16 above.

With respect to claim 18, the combination of Sridharan and de Vangel teaches the invention described in claim 16, including the system wherein parameter server further comprises an output data manager configured to: receive activations from each target device after each minibatch is executed; and generate output activations for a subportion of the downloaded portion of the Al model based on the received activations (Sridharan, page 20, paragraphs 204-205).
The combination of references is made under the same rationale as claim 16 above.

With respect to claim 19, the combination of Sridharan and de Vangel teaches the invention described in claim 16, including the system wherein the parameter server further comprises a weight updater configured to: update weights of the Al model based on gradients received from each target device (Sridharan, page 20, paragraph 202).
The combination of references is made under the same rationale as claim 16 above.

With respect to claim 20, the combination of Sridharan and de Vangel teaches the invention described in claim 16, including the system wherein the parameter server (Sridharan, Fig. 22, element 2220; pages 22-23, paragraph 220) comprises a central processing unit (Sridharan, page 28, paragraph 256), a field programmable gate array, or an application-specific integrated circuit.
The combination of references is made under the same rationale as claim 16 above.

Claim 14 does not teach or define any new limitations above claim 6 and therefore is rejected for similar reasons.


Allowable Subject Matter
Claims 7 and 15 are objected to as being dependent upon rejected base claims, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Alicia Baturay whose telephone number is (571) 272-3981. The examiner can normally be reached at 7am – 4pm, Mondays – Thursdays, Eastern Time.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Wing Chan can be reached on (571) 272-7493. The fax number for the organization where this application or proceeding is assigned is (571) 273-8300.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).




/Alicia Baturay/
Primary Examiner, Art Unit 2441

March 29, 2021