DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
2.	Applicant’s arguments with respect to claims 1 - 29 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Applicant argues that Ravi et al. do not suggest the instant wherein the student (slave) outputs a second configuration data, from which the teacher (master) learns (Amendment, pages 11, 12).

Claim Rejections - 35 USC § 103
3.	The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
4.	Claims 1 – 29 are rejected under 35 U.S.C. 103 as being unpatentable over Ravi et al. (US PAP 2020/0125956) in view Xu et al. (US PAP 2021/0150407). 
As per claims 1, 23, 28, 29, Ravi et al. teach a distributed system for training a neural network model, the system comprising: 
a master device including a first version of the neural network model (“the teacher or trainer model parameters”; paragraph 111); and 
a slave device communicatively coupled to a first data source and the master device, the first data source being inaccessible by the master device (“the trainer model can also be jointly trained with multiple student models of different sizes.”; paragraph 112), 
wherein the slave device is remote from the master device (“server computing devices”; paragraphs 85, 92), 
wherein the master device is configured to output first configuration data for the neural network model based on the first version of the neural network model (“The trainer or teacher model can be any type of model, including, as examples, feed forward neural networks, recurrent neural networks (e.g., long short-term memory networks), quasi-RNNs, convolutional neural networks, ProjectionNets (e.g., dense and sparse versions), BiLSTM (bi-directional LSTMs), depth-separable ConvNets, MobileNets, ProjectionCNN, NASNets, Inception (e.g., Inception v3), ResNet, and/or other types of machine-learned models.”; paragraph 51), 
wherein the slave device is configured to use the first configuration data to instantiate a second version of the neural network model (“Example student models include feed forward neural networks, recurrent neural networks (e.g., long short-term memory networks), quasi-RNNs, convolutional neural networks, ProjectionNets (e.g., dense and sparse versions), BiLSTM (bi-directional LSTMs), depth-separable ConvNets, MobileNets, ProjectionCNN, NASNets, Inception (e.g., Inception v3), ResNet, and/or other types of machine-learned models.”; paragraph 51), 
wherein the slave device is configured to train the second version of the neural network model using data from the first data source (“a teacher-student setup where the knowledge of the trainer model is utilized to learn an equivalent compact student model with minimal loss in accuracy.”; paragraphs 109, 111),
wherein the slave device outputs a second configuration data for the neural network model (paragraphs 109, 111), and 
However, Ravi et al. do not specifically teach that the master device is configured to learn from the second configuration data to update parameters for the first version of the neural network model.
Xu et al. disclose that based on the state information generated by the student model, the teacher model updates its teaching actions so as to refine the machine learning problem of the student model. The student model then performs its learning process based on the inputs from the teacher model and provides reward signals (e.g., the accuracy on the training data) back to the teacher model afterwards. The teacher model then utilizes such rewards to update its parameters via policy gradient methods, which are a type of a reinforcement learning technique (paragraph 31).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to learn the teacher model from the student model as taught by Xu et al. in Ravi et al., because that would help improve prediction accuracy (paragraph 27).

As per claim 2, Ravi et al. in view of Xu et al. further disclose the slave device is configured to use the first configuration data to instantiate the first version of the neural network model as a teacher model and to instantiate the second version of the neural network model as a student model, the teacher model being used to train the student model (“a teacher-student setup where the knowledge of the trainer model is utilized to learn an equivalent compact student model with minimal loss in accuracy.”; Ravi et al., paragraphs 109, 111).

As per claims 3, 24, Ravi et al. in view of Xu et al. further disclose the master device is configured to use the second configuration data to instantiate the second version of the neural network model as a teacher model, and to instantiate the first version of the neural network model as a student model, the teacher model being used to train the student model and update the parameters for the first version of the neural network model(“a teacher-student setup where the knowledge of the trainer model is utilized to learn an equivalent compact student model with minimal loss in accuracy… During training, the teacher or trainer model parameters can be held fixed (e.g., as in distillation) or jointly optimized to improve both models simultaneously”; Ravi et al., paragraphs 109, 111, 112).

As per claim 4, Ravi et al. in view of Xu et al. further disclose the first configuration data includes parameters for the first version of the neural network model (Ravi et al., paragraphs 111, 112).

As per claim 5, Ravi et al. in view of Xu et al. further disclose the second configuration data includes parameters for the second version of the neural network model (Ravi et al., paragraphs 111, 112).

As per claim 6, Ravi et al. in view of Xu et al. further disclose the second configuration data includes gradient data (“the model (e.g., network) learns to optimize the weights and activations in the quantized space using gradients computed via backpropagation.”; Ravi et al., paragraph 110).

As per claim 7, Ravi et al. in view of Xu et al. further disclose the master device is communicatively coupled to a first network and the slave device is communicatively coupled to a second network, the first and second networks being heterogeneous and communicatively coupled by one or more untrusted devices (“different versions of machine-learned models can be distributed to different types of devices. In one example, a larger, more complex model can be distributed to devices with more advanced computing capabilities (e.g., larger memory size, faster processor, etc.) while a smaller, less complex model can be distributed to devices with less advanced computing capabilities. In other examples, different model versions can be downloaded to different devices depending on the geolocation of the device”; Ravi et al., paragraph 75).

As per claim 8, Ravi et al. in view of Xu et al. further disclose a plurality of slave devices each in communication with the master device, wherein the master device is configured to use second configuration data output by each of the plurality of slave devices to update parameters for the first version of the neural network model (“the teacher or trainer model parameters can be held fixed (e.g., as in distillation) or jointly optimized to improve both models simultaneously”; Ravi et al., paragraphs 75, 111).

As per claims 9, 25, Ravi et al. in view of Xu et al. further disclose the master device is configured to use second configuration data output by each of the plurality of slave devices to instantiate an ensemble of second versions of the neural network model and to use the ensemble to train the first version of the neural network model (“a teacher-student setup where the knowledge of the trainer model is utilized to learn an equivalent compact student model with minimal loss in accuracy… During training, the teacher or trainer model parameters can be held fixed (e.g., as in distillation) or jointly optimized to improve both models simultaneously”; Ravi et al., paragraphs 109, 111, 112).

As per claim 10, Ravi et al. in view of Xu et al. further disclose the master device is configured to use aggregate data derived from the second configuration data output by each of the plurality of slave devices to update parameters for the first version of the neural network model (“the trainer model can also be jointly trained with multiple student models of different sizes. As an example, FIG. 16B depicts a graphical diagram of the example joint training scheme used to train multiple student models according to example aspects of the present disclosure”; Ravi et al., paragraph 169).

As per claim 11, Ravi et al. in view of Xu et al. further disclose the master device and the plurality of slave devices are communicatively coupled according to a defined graph model (“(e.g., create_graph) of each model, 3)”; Ravi et al., paragraphs 129, 169).

As per claims 12, 26, Ravi et al. in view of Xu et al. further disclose the second configuration data includes gradient data from each of the plurality of slave devices and the master device is configured to compare the gradient data from each of the plurality of slave devices to selectively update the parameters for the first version of the neural network model based on the comparison (“the model (e.g., network) learns to optimize the weights and activations in the quantized space using gradients computed via backpropagation.”; Ravi et al.,  paragraph 110).

As per claims 13, 27, Ravi et al. in view of Xu et al. further disclose the master device is communicatively coupled to a second data source that is inaccessible by the slave device and the master device is configured to train the first version of the neural network model using data from the second data source (“The developer can use the pre-trained model as-is or can retrain the model on additional training data.”; Ravi et al., paragraph 31).

As per claim 14, Ravi et al. in view of Xu et al. further disclose the slave device includes at least one processor to execute a binary executable stored in memory and the executed binary executable is configured to load the first configuration data and instantiate a second version of the neural network model independently of the master device (“The binary representation can be significant since this results in a significantly compact representation for the projection network parameters that in turn reduces the model size considerably compared to the trainer network.”; Ravi et al., paragraph 161).

As per claim 15, Ravi et al. in view of Xu et al. further disclose the executed binary executable is configured to output the second configuration data and to control transmission to the master device (Ravi et al., paragraph 161).

As per claim 16, Ravi et al. in view of Xu et al. further disclose the neural network model forms part of a speech recognition pipeline and the first data source stores audio data (Ravi et al., paragraph 68).

As per claim 17, Ravi et al. in view of Xu et al. further disclose the slave device is configured to augment the audio data from the first data source with audio noise (“introducing/or transformations of existing samples (e.g., add noises, rotations, perturbations, etc.)”; Ravi et al., paragraph 189).

As per claim 18, Ravi et al. in view of Xu et al. further disclose the first configuration data includes hyperparameters for the neural network model and parameters for the first version of the neural network model (“the full trainer model (e.g., using existing architectures like Feed-forward NNs or LSTM RNNs) combined with a simpler student model.”; Ravi et al., paragraph 133).

	As per claim 19, Ravi et al. in view of Xu et al. further disclose the hyperparameters include one or more of: an architecture definition for the neural network model; a number of nodes for one or more layers in the neural network model; a set of node definitions including at least one of a node type and a node connectivity; a set of activation function definitions; and at least one cost function definition (“the full trainer model (e.g., using existing architectures like Feed-forward NNs or LSTM RNNs) combined with a simpler student model.”; Ravi et al., paragraph 133).

	As per claim 20, Ravi et al. in view of Xu et al. further disclose the parameters include at least one of: weight values for at least one connection between nodes of the neural network model; weight values for at least one input to the neural network model; weight values for at least one recurrent path in the neural network model; and bias values for at least one node of the neural network model (“the model (e.g., network) learns to optimize the weights and activations in the quantized space using gradients computed via backpropagation. This can be more effective than applying this method post training (e.g., quantizing pre-trained weights just for inference)”; Ravi et al., paragraph 110).

	As per claim 21, Ravi et al. in view of Xu et al. further disclose the second configuration data includes gradient data and the master device is configured to weight the gradient data based on an age of the second configuration data (“the model (e.g., network) learns to optimize the weights and activations in the quantized space using gradients computed via backpropagation.”; Ravi et al., paragraph 110).

	As per claim 22, Ravi et al. in view of Xu et al. further disclose the second configuration data includes gradient data and the master device is configured to compare the gradient data from the second configuration data with gradient data determined using the first version of the neural network model and to selectively update the parameters for the first version of the neural network model based on the comparison (“the model (e.g., network) learns to optimize the weights and activations in the quantized space using gradients computed via backpropagation.”; Ravi et al., paragraph 110).

Conclusion
5.	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
 
6.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD SAINT-CYR whose telephone number is (571)272-4247. The examiner can normally be reached Monday- Friday.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571)272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/LEONARD SAINT-CYR/Primary Examiner, Art Unit 2658