DETAILED ACTION
This action is written in response to the application filed 5/20/20. The present application is being examined under the pre-AIA  first to invent provisions. 

Information Disclosure Statements
The IDSs dated 5/20/20 and 9/22/21 have been considered. The Examiner notes that copies of the NPL references cited in the former IDS can be found in the file wrapper for parent application 16/393063.

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 21-40 are rejected on the ground of nonstatutory double patenting as being unpatentable over the corresponding claims of U.S. Patent No. 10,719,761. Although the claims at issue are not identical, they are not patentably distinct for the reasons outlined in the tables below.
16/879187 – this application 
US 10,719,761 B2 – issued patent
21. A system comprising:
1. A system comprising:
a main neural network implemented by one or more computers, the main neural network comprising a Mixture of Experts (MoE) subnetwork between a first neural network layer and a second neural network layer in the main neural network, wherein the MoE subnetwork comprises:
a main neural network implemented by one or more computers, the main neural network comprising a Mixture of Experts (MoE) subnetwork between a first neural network layer and a second neural network layer in the main neural network, wherein the MoE subnetwork comprises:
a plurality of expert neural networks, wherein each expert neural network is configured to process a first layer output generated by the first neural network layer in accordance with a respective set of expert parameters of the expert neural network to generate a respective expert output, and
a plurality of expert neural networks, wherein each expert neural network is configured to process a first layer output generated by the first neural network layer in accordance with a respective set of expert parameters of the expert neural network to generate a respective expert output, and
a gating subsystem configured to:
a gating subsystem configured to:
generate a modified first layer output by applying a set of gating parameters to the first layer output,
generate an initial gating output by applying a set of gating parameters to the first layer output,
apply a set of trainable noise parameters to the first layer output to generate an initial noise output, 

apply a sparsifying function to the initial gating output to generate a sparsified initial gating output,

apply a softmax function to the sparsified initial gating output to generate a weight vector that includes a respective weight for each of the plurality of expert neural networks,
generate a final noise output from the initial noise output and a vector of noise values sampled from a distribution,
[From claim 7] generate a final noise output; and
add the final noise output to the modified first layer output to generate an initial gating output,
[From claim 7] adding the final noise output to the modified first layer output.
select, based on the initial gating output, one or more of the expert neural networks and determine a respective weight for each selected expert neural network, 
select, based on the weights in the weight vector, one or more of the expert neural networks and determine a respective weight for each selected expert neural network,
provide the first layer output as input to each of the selected expert neural networks,
provide the first layer output as input to each of the selected expert neural networks,
combine the expert outputs generated by the selected expert neural networks in accordance with the weights for the selected expert neural networks to generate an MoE output, and
combine the expert outputs generated by the selected expert neural networks in accordance with the weights for the selected expert neural networks to generate an MoE output, and
provide the MoE output as input to the second neural network layer.
provide the MoE output as input to the second neural network layer.
As illustrated in the table above, every limitation of claim 21 of this application has a corresponding equivalent or broader limitation in claim 1 of the ‘761 patent. Thus, claim 1 of the ‘761 patent anticipates claim 21 of this application.





16/879187 – this application 
US 10,719,761 B2 – issued patent
22. The system of claim 21, wherein the expert neural networks have the same or similar architectures but different parameter values.
2. The system of claim 1, wherein the expert neural networks have the same or similar architectures but different parameter values.
23. The system of claim 21, wherein combining the expert outputs generated by the selected expert neural network comprises:

weighting the expert output generated by each of the selected expert neural networks by the weight for the selected expert neural network to generate a weighted expert output, and summing the weighted expert outputs to generate the MoE output.
3. The system of claim 1, wherein combining the expert outputs generated by the selected expert neural network comprises:

weighting the expert output generated by each of the selected expert neural networks by the weight for the selected expert neural network to generate a weighted expert output, and
summing the weighted expert outputs to generate the MoE output.
24. The system of claim 21, wherein the gating subsystem comprises a gating subnetwork, and wherein the gating subnetwork is configured to:
process the first layer output to generate a weight vector that includes a respective weight for each of the plurality of expert neural networks in accordance with a set of gating parameters, and
select one or more of the expert neural networks based on the weights in the weight vector.
6. The system of claim 1, wherein generating the initial gating output comprises:
applying the set of gating parameters to the first layer output to generate a modified first layer output; and
adding tunable Gaussian noise to the modified first layer output to generate the initial gating output.
[from claim 1]select, based on the weights in the weight vector, one or more of the expert neural networks and determine a respective weight for each selected expert neural network,
25. The system of claim 24, wherein the weight vector is a sparse vector that includes non-zero weights for only a few of the expert neural networks.
4. The system of claim 1, wherein the weight vector is a sparse vector that includes non-zero weights for only a few of the expert neural networks.
26. The system of claim 24, wherein selecting one or more of the expert neural networks comprises:
selecting only expert neural networks that have non-zero weights in the weight vector.
5. The system of claim 1, wherein selecting one or more of the expert neural networks comprises:selecting only expert neural networks that have non-zero weights in the weight vector.
27. The system of claim 24, wherein processing the first layer output to generate a weight vector that includes a respective weight for each of the plurality of expert neural networks in accordance with a set of gating parameters comprises:

applying a sparsifying function to the initial gating output to generate a sparsified initial gating output; andapplying a softmax function to the sparsified initial gating output to generate the weight vector.

[from claim 1]apply a sparsifying function to the initial gating output to generate a sparsified initial gating output,
apply a softmax function to the sparsified initial gating output to generate a weight vector that includes a respective weight for each of the plurality of expert neural networks,
28. The system of claim 27, wherein the sparsifying function sets all values in the initial gating output other than the k highest values to a value that is mapped to zero by the softmax function.
8. The system of claim 1, wherein the sparsifying function sets all values in the initial gating output other than the k highest values to a value that is mapped to zero by the softmax function.
29. The system of claim 21, wherein generating the final noise output from the initial noise output and the vector of noise values sampled from a distribution comprises:


element-wise multiplying the initial noise output by a vector of noise values sampled from a normal distribution to generate the final noise output.

7. The system of claim 6, wherein adding tunable Gaussian noise to the modified first layer output to generate the initial gating output comprises:applying a set of trainable noise parameters to the first layer output to generate an initial noise output;

element-wise multiplying the initial noise output by a vector of noise values sampled from a normal distribution to generate a final noise output; and
adding the final noise output to the modified first layer output.
30. The system of claim 21, wherein the gating subsystem comprises a parent gating subnetwork and a plurality of child gating subnetworks, and wherein each of the child gating subnetworks manages a disjoint subset of the plurality of expert neural networks from each other child gating subnetwork.
9. The system of claim 1, wherein the gating subsystem comprises a parent gating subnetwork and a plurality of child gating subnetworks, and wherein each of the child gating subnetworks manages a disjoint subset of the plurality of expert neural networks from each other child gating subnetwork.
As illustrated in the table above, every limitation of the claims of this application has a corresponding equivalent or broader limitation in the claims of the ‘761 patent. Thus, the claims of the ‘761 patent anticipate the claims of this application.The Examiner notes that there is a similar correspondence between claims 31-40 of this application and claims 10-17 in the ‘761 patent. (These are method and computer-readable storage medium claims corresponding to the system claims above.)



Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The following are the references relied upon in the rejections below:
Eigen, primary reference (Eigen D, Ranzato MA, Sutskever I. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314. 2014 March 9. Cited by Applicant in IDS dated 5/20/20.).
Leonard (Leonard et al., "Distributed conditional computation”, Memoire de maitrise en informatique de l'Universite de Montreal, August 2014, 90 pages. Cited by Applicant in IDS dated 5/20/20.)
Claims 21-40 are rejected under 35 U.S.C. 103 as being unpatentable over Eigen and Leonard.
Regarding claims 21, 31 and 39, Eigen discloses a system (and a related method and computer-readable storage media) comprising:
a main neural network implemented by one or more computers, the main neural network comprising a Mixture of Experts (MoE) subnetwork between a first neural network layer and a second neural network layer in the main neural network, wherein the MoE subnetwork comprises:
P. 3, fig. 1(b) (reproduced below) illustrates a mixture of experts layer situated between hidden layers z1 and z2.
    PNG
    media_image1.png
    490
    299
    media_image1.png
    Greyscale
[The Examiner notes that computer implementation of the described technique is inherent throughout the disclosure.]
a plurality of expert neural networks, wherein each expert neural network is configured to process a first layer output generated by the first neural network layer in accordance with a respective set of expert parameters of the expert neural network to generate a respective expert output, and a gating subsystem configured to:
PP. 2-3, sec. 3, fig. 1(b) (reproduced above): each of the plurality of expert neural networks is denoted fab.PP. 2-3, sec. 3: output from mixture of experts layer is denoted z1, the gating subsystems are e.g. (g1, fi1).
generate a modified first layer output by applying a set of gating parameters to the first layer output,
P. 5, fig. 2, illustrating mean gating outputs for firs and second layers.
...
select, based on the initial gating output, one or more of the expert neural networks and determine a respective weight for each selected expert neural network,
PP. 2-3, sec. 3: the disclosed parameter gii is equivalent to the recited weight parameter.P. 2, sec. 3: see equations reproduced below.
    PNG
    media_image2.png
    239
    645
    media_image2.png
    Greyscale

Excerpt from Eigen, p. 2, sec. 3.
provide the first layer output as input to each of the selected expert neural networks,
PP. 2-3, sec. 3, including fig. 1(b) and excerpts reproduced above.
Filed: May 20, 2020combine the expert outputs generated by the selected expert neural networks in accordance with the weights for the selected expert neural networks to generate an MoE output, and
Id. The Examiner notes that the outputs are combined according to the equations reproduced above.
provide the MoE output as input to the second neural network layer.
P. 3, fig. 1(b), neural networks outputs are propagated towards the top of the figure.
Leonard discloses the following further limitations which Eigen does not disclose:

apply a set of trainable noise parameters to the first layer output to generate an initial noise output, 
P. 61: NoisyReLU implementations: “In our experiments, we found that a σ = 1 and α = 1 worked well and that some noise was often better than no noise”.
generate a final noise output from the initial noise output and a vector of noise values sampled from a distribution,
P. 61: NoisyReLU implementations: “In our experiments, we found that a σ = 1 and α = 1 worked well and that some noise was often better than no noise”. The Examiner notes that Gaussian noise can be added at every layer of a neural network, e.g. by a noisy rectified linear unit (NoisyReLU), as disclosed by Eigen at pp. 60-61, in order to avoid overfitting.
add the final noise output to the modified first layer output to generate an initial gating output,
p. 61: NoisyReLU implementations: “In our experiments, we found that a σ = 1 and α = 1 worked well and that some noise was often better than no noise”. The Examiner notes that Gaussian noise can be added at every layer of a neural network, e.g. by a noisy rectified linear unit (NoisyReLU), as disclosed by Eigen at pp. 60-61, in order to avoid overfitting).
At the time of filing, it would have been obvious to a person of ordinary skill to apply sparse initialization (as taught by Leonard) in the neural network mixture-of-experts system of Eigen because (1) all neural networks need to have their parameters initialized, (2) this technique “is simple to implement and is supported by empirical evidence” (see Leonard pp. 13-14, sec. 1.4.2), and (3) Leonard specifically applies this technique to a mixture-of-experts model (see sec. 1.5 at p. 20 et seq.). Additionally, Eigen explicitly suggests that the techniques disclosed in that reference could be applied to “large sparse models that compute only a subset of themselves for any given input” (p. 6, sec. 6).

Regarding claim 22, Eigen discloses the further limitation wherein the expert neural networks have the same or similar architectures but different parameter values.
P. 2, sec. 3: “We set each fi1 to a single linear map with rectification”.

Regarding claims 23 and 32, Eigen discloses the further limitation wherein combining the expert outputs generated by the selected expert neural network comprises:
weighting the expert output generated by each of the selected expert neural networks by the weight for the selected expert neural network to generate a weighted expert output, and summing the weighted expert outputs to generate the MoE output.
P. 2, sec. 3, see equations reproduced above in rejection of claim 21.

Regarding claims 24 and 33, Eigen discloses the further limitations wherein the gating subsystem comprises a gating subnetwork, and wherein the gating subnetwork is configured to:
process the first layer output to generate a weight vector that includes a respective weight for each of the plurality of expert neural networks in accordance with a set of gating parameters, and
PP. 2-3, sec. 3: weight vector g1(x).
select one or more of the expert neural networks based on the weights in the weight vector.
P. 2, sec. 3, see equations reproduced above in rejection of claim 21.

Regarding claims 25 and 34, Eigen discloses the further limitations wherein the weight vector is a sparse vector that includes non-zero weights for only a few of the expert neural networks.
P. 6, sec. 6: “The Deep Mixture of Experts model we examine is a promising step toward developing large, sparse models that compute only a subset of themselves for any input.”

Regarding claims 26 and 35, Eigen discloses the further limitation wherein selecting one or more of the expert neural networks comprises:
selecting only expert neural networks that have non-zero weights in the weight vector.
P. 6, sec. 6: “The Deep Mixture of Experts model we examine is a promising step toward developing large, sparse models that compute only a subset of themselves for any input.”

Regarding claims 27 and 36, Eigen discloses the further limitation wherein processing the first layer output to generate a weight vector that includes a respective weight for each of the plurality of expert neural networks in accordance with a set of gating parameters comprises: ...
applying a softmax function to the ... initial gating output to generate the weight vector.
P. 2: softmax function.
Leonard discloses the following further limitation which Eigen does not disclose:
applying a sparsifying function to the initial gating output to generate a sparsified initial gating output.
p. 13, sec. 1.4.2: sparse initialization comprising “initializing only 15 random weights for each output neuron by sampling values from a normal distribution”).
The obviousness analysis of claim 21 applies equally here.

Regarding claims 28 and 37, Leonard disclose the further limitation wherein the sparsifying function sets all values in the initial gating output other than the k highest values to a value that is mapped to zero by the softmax function.
P. 13, sec. 1.4.2: sparse initialization comprising “initializing only 15 random weights for each output neuron by sampling values from a normal distribution”).

Regarding claim 29, 38 and 40, Eigen discloses the further limitation wherein generating the final noise output from the initial noise output and the vector of noise values sampled from a distribution comprises:
element-wise multiplying the initial noise output by a vector of noise values sampled from a normal distribution to generate the final noise output.
P. 61: NoisyReLU implementations: “In our experiments, we found that a σ = 1 and α = 1 worked well and that some noise was often better than no noise”. The Examiner notes that Gaussian noise can be added at every layer of a neural network, e.g. by a noisy rectified linear unit (NoisyReLU), as disclosed by Eigen at pp. 60-61, in order to avoid overfitting.

Regarding claim 30, Eigen discloses the further limitation wherein the gating subsystem comprises a parent gating subnetwork and a plurality of child gating subnetworks, and wherein each of the child gating subnetworks manages a disjoint subset of the plurality of expert neural networks from each other child gating subnetwork.
P. 2, sec. 2: hierarchical mixture of experts “which learns a hierarchy of gating networks in a tree structure”.

Additional Relevant Prior Art
The following references were identified by the Examiner as being relevant to the disclosed invention, but are not relied upon in any particular prior art rejection:
Anaya (US 10,032,256 B1) discloses a convolutional neural network with a trainable noise parameter. See fig. 4 and col. 6, line 63 et seq.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Vincent Gonzales whose telephone number is (571) 270-3837. The examiner can normally be reached on Monday-Friday 7 a.m. to 4 p.m. MT.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang, can be reached at (571) 270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/Vincent Gonzales/Primary Examiner, Art Unit 2124