DETAILED ACTION
This action is in response to claims filed 12 December 2019 for application 16674801 filed 05 November 2019. Currently claims 2-24 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 2, 3, 8-13 and 15-24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Finn et al. (Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks) in view of Merity et al. (US 20180336453).

Regarding claims 2, 23 and 24, Finn discloses: A computer-implemented method comprising: 
generating, using a controller neural network having a plurality of controller parameters and in accordance with current values of the controller parameters, a batch of output sequences 
    PNG
    media_image1.png
    258
    371
    media_image1.png
    Greyscale
(p3 Algorithm 1, hyperparameters are control parameters, batch of tasks are a batch of output sequences), 
each output sequence in the batch defining an architecture for a first convolutional cell configured to receive a cell input and to generate a cell output (“Our model follows the same architecture as the embedding function used by Vinyals et al. (2016), which has 4 modules with a 3 x 3 convolutions and 64 filters, followed by batch normalization (Ioffe & Szegedy, 2015), a ReLU nonlinearity, and 2 x 2 max-pooling. The Omniglot images are downsampled to 28 x 28, so the dimensionality of the last hidden layer is 64. As in the baseline classifier used by Vinyals et al. (2016), the last layer is fed into a softmax. For Omniglot, we used strided convolutions instead of max-pooling. For MiniImagenet, we used 32 filters per layer to reduce overfitting, as done by (Ravi & Larochelle, 2017).” P6 §5.2 ¶3), and 
the first convolutional cell comprising a sequence of a predetermined number of operation blocks that each receive one or more respective input hidden states and generate a respective output hidden state (P6 §5.2 ¶3, “For N-way, K-shot classification, each gradient is computed using a batch size of NK examples. For Omniglot, the 5-way convolutional and non-convolutional MAML models were each trained with 1 gradient step with step size α = 0:4 and a meta batch-size of 32 tasks. The network was evaluated using 3 gradient steps with the same step size α = 0:4. The 20-way convolutional MAML model was trained and evaluated with 5 gradient steps with step size α = 0:1. During training, the meta batch-size was set to 16 tasks. For MiniImagenet, both models were trained using 5 gradient steps of size α  = 0:01, and evaluated using 10 gradient steps at test time. Following Ravi & Larochelle (2017), 15 examples per class were used for evaluating the post-update meta-gradient. We used a meta batch-size of 4 and 2 tasks for 1-shot and 5-shot training respectively. All models were trained for 60000 iterations on a single NVIDIA Pascal Titan X GPU.” P11 §A.1 ¶1); 
for each output sequence in the batch: 
	generating an instance … having the architecture defined by the output sequence (P6 §5.2 ¶3, “Note that the meta-optimization is performed over the model parameters θ, whereas the objective is computed using the updated model parameters θ’. In effect, our proposed method aims to optimize the model parameters such that one or a small number of gradient steps on a new task will produce maximally effective behavior on that task.” P3 ¶5)
	training the instance of the … convolutional neural network to perform an image processing task (“Our model follows the same architecture as the embedding function used by Vinyals et al. (2016), which has 4 modules with a 3 x 3 convolutions and 64 filters, followed by batch normalization (Ioffe & Szegedy, 2015), a ReLU nonlinearity, and 2 x 2 max-pooling. The Omniglot images are downsampled to 28 x 28, so the dimensionality of the last hidden layer is 64. As in the baseline classifier used by Vinyals et al. (2016), the last layer is fed into a softmax. For Omniglot, we used strided convolutions instead of max-pooling. For MiniImagenet, we used 32 filters per layer to reduce overfitting, as done by (Ravi & Larochelle, 2017).” P6 §5.2 ¶3, note: operations performed on Omniglot and MiniImagenet datasets are image processing tasks); and 
evaluating a performance of the trained instance of the … convolutional neural network on the image processing task to determine a performance metric for the trained instance of the … convolutional neural network (“The key idea underlying our method is to train the model’s initial parameters such that the model has maximal performance on a new task after the parameters have been updated through one or more gradient steps computed with a small amount of data from that new task.” P1 §1 ¶2, “Our model follows the same architecture as the embedding function used by Vinyals et al. (2016), which has 4 modules with a 3 x 3 convolutions and 64 filters, followed by batch normalization (Ioffe & Szegedy, 2015), a ReLU nonlinearity, and 2 x 2 max-pooling. The Omniglot images are downsampled to 28 x 28, so the dimensionality of the last hidden layer is 64. As in the baseline classifier used by Vinyals et al. (2016), the last layer is fed into a softmax. For Omniglot, we used strided convolutions instead of max-pooling. For MiniImagenet, we used 32 filters per layer to reduce overfitting, as done by (Ravi & Larochelle, 2017).” P6 §5.2 ¶3, note: operations performed on Omniglot and MiniImagenet datasets are image processing tasks); 
using the performance metrics for the trained instances of the … convolutional neural network to adjust the current values of the controller parameters of the controller neural network (“The key idea underlying our method is to train the model’s initial parameters such that the model has maximal performance on a new task after the parameters have been updated through one or more gradient steps computed with a small amount of data from that new task.” P1 §1 ¶2, “In our meta-learning scenario, we consider a distribution over tasks p(T ) that we want our model to be able to adapt to. In the K-shot learning setting, the model is trained to learn a new task Ti drawn from p(T ) from only K samples drawn from qi and feedback LTi generated by Ti. During meta-training, a task Ti is sampled from p(T ), the model is trained with K samples and feedback from the corresponding loss LTi from Ti, and then tested on new samples from Ti.” P2 §2.1 ¶3).

However, Finn does not explicitly disclose: of a child convolutional neural network that includes multiple instances of the first convolutional cell (“A child of the parent data model may be defined by a search (typically a narrower search) that produces a subset of the events that would be produced by the parent data model's search.  The child's set of fields can include a subset of the set of fields of the parent data model and/or additional fields.  Data model objects that reference the subsets can be arranged in a hierarchical manner, so that child subsets of events are proper subsets of their parents.  A user iteratively applies a model development tool (not shown in Fig.) to prepare a query that defines a subset of events and assigns an object name to that subset.  A child subset is created by further limiting a query that generated a parent subset.  A late-binding schema of field extraction rules is associated with each object or subset in the data model.” [0127]);
generating a final architecture for the first convolutional cell using the controller neural network in accordance with the adjusted values of the controller parameters  (“A method comprising: generating a plurality of candidate recurrent neural network (RNN) architectures, wherein each candidate RNN architecture is represented using a domain specific language (DSL), wherein the DSL supports a plurality of operators, wherein the representation of a particular candidate RNN architecture comprises one or more operators of the DSL; for each of the plurality of candidate RNN architectures, performing: providing an encoding of the candidate RNN architecture as input to an architecture ranking neural network configured to determine a score for the candidate RNN architecture, the score representing a performance of the candidate RNN architecture for a given particular type of task; executing the ranking neural network to generate a score indicating the performance of the candidate RNN architecture; selecting a candidate RNN architecture based on the scores of each of the plurality of candidate RNN architectures; compiling the selected candidate architecture to generate code representing a target RNN; and executing the code representing the target RNN.” Claim 1, note: the ranking neural network is a controller network, the RNN architecture are values of the controller parameters).

Finn and Merity are both in the same field of endeavor of neural architecture search and are analogous. Finn discloses architecture search for a convolutional neural network using a controller network. Merity teaches a final architecture of a recurrent neural network ranked using a controller neural network to create a child network. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the architecture searches of Finn and Merity. One would have been motivated to combine as Merity states their invention can be used to find a highest ranked final architecture of a child neural network [0037].
Regarding claim 3, Finn discloses: The method of claim 2, wherein, after the training, different instances of the first convolutional cell in the trained instance of the child convolutional neural network have different parameter values (Algorithm 1, “The step size may be fixed as a hyperparameter or metalearned. For simplicity of notation, we will consider one gradient update for the rest of this section, but using multiple gradient updates is a straightforward extension.” P3 ¶2).

Regarding claim 8, Finn discloses: The method of claim 2, wherein using the performance metrics for the trained instances of the child convolutional neural network to adjust the current values of the controller parameters of the controller neural network comprises: training the controller neural network to generate output sequences that result in child convolutional neural networks having increased performance metrics using a machine learning training technique (“The key idea underlying our method is to train the model’s initial parameters such that the model has maximal performance on a new task after the parameters have been updated through one or more gradient steps computed with a small amount of data from that new task.” P1 §1 ¶2, see also Algorithm 1 and P6 §5.2 ¶3).

Regarding claim 9, Finn discloses: The method of claim 8, wherein the training technique is a policy gradient technique (“The policy was trained with MAML to maximize performance after 1 policy gradient update using 20 trajectories.” P8 §2D navigation ¶1).

Regarding claim 10, Finn discloses: The method of claim 8, wherein the training technique is a REINFORCE technique (“We discuss the individual domains below. In all of the domains, the model trained by MAML is a neural network policy with two hidden layers of size 100, with ReLU nonlinearities. The gradient updates are computed using vanilla policy gradient (REINFORCE) (Williams, 1992), and we use trust-region policy optimization (TRPO) as the meta-optimizer (Schulman et al., 2015).” P7 §5.3 ¶1).

Regarding claim 11, Finn discloses: The method of claim 8, wherein the training technique is Proximal Policy Optimization (PPO) technique (“We discuss the individual domains below. In all of the domains, the model trained by MAML is a neural network policy with two hidden layers of size 100, with ReLU nonlinearities. The gradient updates are computed using vanilla policy gradient (REINFORCE) (Williams, 1992), and we use trust-region policy optimization (TRPO) as the meta-optimizer (Schulman et al., 2015).” P7 §5.3 ¶1).

Regarding claim 12, Finn discloses: The method of claim 8, further comprising: using at least one of the … convolutional networks having increased performance metrics to perform the image processing task (“Our model follows the same architecture as the embedding function used by Vinyals et al. (2016), which has 4 modules with a 3 x 3 convolutions and 64 filters, followed by batch normalization (Ioffe & Szegedy, 2015), a ReLU nonlinearity, and 2 x 2 max-pooling. The Omniglot images are downsampled to 28 x 28, so the dimensionality of the last hidden layer is 64. As in the baseline classifier used by Vinyals et al. (2016), the last layer is fed into a softmax. For Omniglot, we used strided convolutions instead of max-pooling. For MiniImagenet, we used 32 filters per layer to reduce overfitting, as done by (Ravi & Larochelle, 2017).” P6 §5.2 ¶3, note: operations performed on Omniglot and MiniImagenet datasets are image processing tasks)

Finn does not explicitly disclose: child networks.

Merity teaches: child networks (“A child of the parent data model may be defined by a search (typically a narrower search) that produces a subset of the events that would be produced by the parent data model's search.  The child's set of fields can include a subset of the set of fields of the parent data model and/or additional fields.  Data model objects that reference the subsets can be arranged in a hierarchical manner, so that child subsets of events are proper subsets of their parents.  A user iteratively applies a model development tool (not shown in Fig.) to prepare a query that defines a subset of events and assigns an object name to that subset.  A child subset is created by further limiting a query that generated a parent subset.  A late-binding schema of field extraction rules is associated with each object or subset in the data model.” [0127]).

Regarding claim 13, Finn discloses: The method of claim 2, wherein each output sequence comprises a value for a respective hyperparameter of the first convolutional cell at each of a plurality of time steps (Algorithm 1, “For N-way, K-shot classification, each gradient is computed using a batch size of NK examples. For Omniglot, the 5-way convolutional and non-convolutional MAML models were each trained with 1 gradient step with step size α = 0:4 and a meta batch-size of 32 tasks. The network was evaluated using 3 gradient steps with the same step size α = 0:4. The 20-way convolutional MAML model was trained and evaluated with 5 gradient steps with step size α = 0:1. During training, the meta batch-size was set to 16 tasks. For MiniImagenet, both models were trained using 5 gradient steps of size α  = 0:01, and evaluated using 10 gradient steps at test time. Following Ravi & Larochelle (2017), 15 examples per class were used for evaluating the post-update meta-gradient. We used a meta batch-size of 4 and 2 tasks for 1-shot and 5-shot training respectively. All models were trained for 60000 iterations on a single NVIDIA Pascal Titan X GPU.” P11 §A.1 ¶1, “The step size may be fixed as a hyperparameter or metalearned. For simplicity of notation, we will consider one gradient update for the rest of this section, but using multiple gradient updates is a straightforward extension.” P3 ¶2, α and β are hyperparameters which have values for each of a plurality of time steps).

Regarding claim 15, Finn discloses: The method of claim 13, wherein generating, using a controller neural network having a plurality of controller parameters and in accordance with current values of the controller parameters, a batch of output sequences, comprises, for each output sequence in the batch and for each of the plurality of time steps: providing as input to the controller neural network the value of the hyperparameter at the preceding time step in the output sequence to generate an output for the time step that defines a score distribution over possible values of the hyperparameter at the time step; and sampling from the possible values in accordance with the score distribution to determine the value of the hyperparameter at the time step in the output sequence (Algorithm 1, “The step size may be fixed as a hyperparameter or metalearned. For simplicity of notation, we will consider one gradient update for the rest of this section, but using multiple gradient updates is a straightforward extension.” P3 ¶2).

Regarding claim 16, Finn discloses: The method of claim 2, wherein a number of filters of convolutional operations within the instances of the first convolutional cell differs based on a position of the instances within the child convolutional neural network (“Our model follows the same architecture as the embedding function used by Vinyals et al. (2016), which has 4 modules with a 3 x 3 convolutions and 64 filters, followed by batch normalization (Ioffe & Szegedy, 2015), a ReLU nonlinearity, and 2 x 2 max-pooling. The Omniglot images are downsampled to 28 x 28, so the dimensionality of the last hidden layer is 64. As in the baseline classifier used by Vinyals et al. (2016), the last layer is fed into a softmax. For Omniglot, we used strided convolutions instead of max-pooling. For MiniImagenet, we used 32 filters per layer to reduce overfitting, as done by (Ravi & Larochelle, 2017).” P6 §5.2 ¶3).

Regarding claim 17, Finn discloses: The method of claim 2, wherein the cell output of the first convolutional cell has a same height and width as the cell input (“Our model follows the same architecture as the embedding function used by Vinyals et al. (2016), which has 4 modules with a 3 x 3 convolutions and 64 filters, followed by batch normalization (Ioffe & Szegedy, 2015), a ReLU nonlinearity, and 2 x 2 max-pooling. The Omniglot images are downsampled to 28 x 28, so the dimensionality of the last hidden layer is 64. As in the baseline classifier used by Vinyals et al. (2016), the last layer is fed into a softmax. For Omniglot, we used strided convolutions instead of max-pooling. For MiniImagenet, we used 32 filters per layer to reduce overfitting, as done by (Ravi & Larochelle, 2017).” P6 §5.2 ¶3).

Regarding claim 18, Finn discloses: The method of claim 16, wherein each output sequence in the batch further defines an architecture for a second convolutional cell configured to receive a second cell input and to generate a second cell output having a smaller height, a smaller width, or both from the second cell input, and wherein the instance of a child convolutional neural network for each output sequence also includes multiple instances of the second convolutional cell having the architecture defined by the output sequence (“Our model follows the same architecture as the embedding function used by Vinyals et al. (2016), which has 4 modules with a 3 x 3 convolutions and 64 filters, followed by batch normalization (Ioffe & Szegedy, 2015), a ReLU nonlinearity, and 2 x 2 max-pooling. The Omniglot images are downsampled to 28 x 28, so the dimensionality of the last hidden layer is 64. As in the baseline classifier used by Vinyals et al. (2016), the last layer is fed into a softmax. For Omniglot, we used strided convolutions instead of max-pooling. For MiniImagenet, we used 32 filters per layer to reduce overfitting, as done by (Ravi & Larochelle, 2017).” P6 §5.2 ¶3).
Regarding claim 19, Finn discloses: The method of claim 2, wherein training each instance of the child convolutional neural network comprises training each instance until a particular amount of time has elapsed (“The reward is the negative squared distance to the goal, and episodes terminate when the agent is within 0:01 of the goal or at the horizon ofH = 100.” P8 §2D navigation, note: horizon is a number of time steps being used as a termination condition, “We used a meta batch-size of 4 and 2 tasks for 1-shot and 5-shot training respectively. All models were trained for 60000 iterations on a single NVIDIA Pascal Titan X GPU.” P11 §A.1 ¶1).

Regarding claim 20, Finn discloses: The method of claim 2, further comprising: generating a computationally-efficient architecture of a convolutional neural network that includes fewer instances of the first convolutional cell than the child convolutional neural network instances, wherein the instances of the convolutional cell have the generated final architecture (“Our model follows the same architecture as the embedding function used by Vinyals et al. (2016), which has 4 modules with a 3 x 3 convolutions and 64 filters, followed by batch normalization (Ioffe & Szegedy, 2015), a ReLU nonlinearity, and 2 x 2 max-pooling. The Omniglot images are downsampled to 28 x 28, so the dimensionality of the last hidden layer is 64. As in the baseline classifier used by Vinyals et al. (2016), the last layer is fed into a softmax. For Omniglot, we used strided convolutions instead of max-pooling. For MiniImagenet, we used 32 filters per layer to reduce overfitting, as done by (Ravi & Larochelle, 2017).” P6 §5.2 ¶3, “The policy was trained with MAML to maximize performance after 1 policy gradient update using 20 trajectories. Additional hyperparameter settings for this problem and the following RL problems are in Appendix A.2. In our evaluation, we compare adaptation to a new task with up to 4 gradient updates, each with 40 samples. The results in Figure 4 show the adaptation performance of models that are initialized with MAML, conventional pretraining on the same set of tasks, random initialization, and an oracle policy that receives the goal position as input. The results show that MAML can learn a model that adapts much more quickly in a single gradient update, and furthermore continues to improve with additional updates.” P8 §2D navigation ¶1).

Regarding claim 21, Finn discloses: The method of claim 2, further comprising: generating a larger architecture of a convolutional neural network that includes more instances of the first convolutional cell than the child convolutional neural network instances for use in a more complex image processing task, wherein the instances of the first convolutional cell have the generated final architecture (“Our model follows the same architecture as the embedding function used by Vinyals et al. (2016), which has 4 modules with a 3 x 3 convolutions and 64 filters, followed by batch normalization (Ioffe & Szegedy, 2015), a ReLU nonlinearity, and 2 x 2 max-pooling. The Omniglot images are downsampled to 28 x 28, so the dimensionality of the last hidden layer is 64. As in the baseline classifier used by Vinyals et al. (2016), the last layer is fed into a softmax. For Omniglot, we used strided convolutions instead of max-pooling. For MiniImagenet, we used 32 filters per layer to reduce overfitting, as done by (Ravi & Larochelle, 2017).” P6 §5.2 ¶3, “The policy was trained with MAML to maximize performance after 1 policy gradient update using 20 trajectories. Additional hyperparameter settings for this problem and the following RL problems are in Appendix A.2. In our evaluation, we compare adaptation to a new task with up to 4 gradient updates, each with 40 samples. The results in Figure 4 show the adaptation performance of models that are initialized with MAML, conventional pretraining on the same set of tasks, random initialization, and an oracle policy that receives the goal position as input. The results show that MAML can learn a model that adapts much more quickly in a single gradient update, and furthermore continues to improve with additional updates.” P8 §2D navigation ¶1).

Regarding claim 22, Finn discloses: The method of claim 21, further comprising: performing the more complex image processing task using the convolutional neural network that includes more instances of the first convolutional cell than the child convolutional neural network instances  (“Our model follows the same architecture as the embedding function used by Vinyals et al. (2016), which has 4 modules with a 3 x 3 convolutions and 64 filters, followed by batch normalization (Ioffe & Szegedy, 2015), a ReLU nonlinearity, and 2 x 2 max-pooling. The Omniglot images are downsampled to 28 x 28, so the dimensionality of the last hidden layer is 64. As in the baseline classifier used by Vinyals et al. (2016), the last layer is fed into a softmax. For Omniglot, we used strided convolutions instead of max-pooling. For MiniImagenet, we used 32 filters per layer to reduce overfitting, as done by (Ravi & Larochelle, 2017).” P6 §5.2 ¶3, “The policy was trained with MAML to maximize performance after 1 policy gradient update using 20 trajectories. Additional hyperparameter settings for this problem and the following RL problems are in Appendix A.2. In our evaluation, we compare adaptation to a new task with up to 4 gradient updates, each with 40 samples. The results in Figure 4 show the adaptation performance of models that are initialized with MAML, conventional pretraining on the same set of tasks, random initialization, and an oracle policy that receives the goal position as input. The results show that MAML can learn a model that adapts much more quickly in a single gradient update, and furthermore continues to improve with additional updates.” P8 §2D navigation ¶1).

Claim(s) 4 and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Finn et al. (Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks) in view of Merity et al. (US 20180336453) and further in view of Bergstra et al. (Random Search for Hyper-Parameter Optimization).

Regarding claim 4, Finn does not explicitly disclose: The method of claim 2, wherein each operation block in the first convolutional cell is configured to: apply a first operation to a first input hidden state to generate a first output; apply a second operation to a second input hidden state to generate a second output; and apply a combining operation to the first and second outputs to generate an output hidden state.

Bergstra teaches: The method of claim 2, wherein each operation block in the first convolutional cell is configured to: apply a first operation to a first input hidden state to generate a first output; apply a second operation to a second input hidden state to generate a second output; and apply a combining operation to the first and second outputs to generate an output hidden state (“The network takes as input the optimize gradient for a single coordinate as well as the previous hidden state and outputs the update for the corresponding optimizee parameter. We will refer to this architecture, illustrated in Figure 3, as an LSTM optimizer.” P4 §2.1 ¶3).

Finn, Merity and Bergstra are in the same field of endeavor of optimizing neural network parameters and are analogous. Finn discloses using instances of convolutional neural networks on a plurality of tasks for updating parameters using stochastic gradient descent and Bergstra discloses stochastic gradient descent that combines two states to produce an output hidden state. It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the learning algorithms of Finn, Merity with the learning of Bergstra to yield predictable results.

Regarding claim 14, Finn does not explicitly disclose: The method of claim 12, wherein the controller neural network is a recurrent neural network that comprises: 
one or more recurrent neural network layers that are configured to, for a given output sequence and at each time step: 
receive as input the value of hyperparameter at the preceding time step in the given output sequence, and to process the input to update a current hidden state of the recurrent neural network;
a respective output layer for each time step, wherein each output layer is configured to, for the given output sequence: receive an output layer input comprising the updated hidden state at the time step and to generate an output for the time step that defines a score distribution over possible values of the hyperparameter at the time step.

Bergstra teaches: The method of claim 12, wherein the controller neural network is a recurrent neural network that comprises: 
one or more recurrent neural network layers that are configured to, for a given output sequence and at each time step: 
receive as input the value of hyperparameter at the preceding time step in the given output sequence, and to process the input to update a current hidden state of the recurrent neural network (“The network takes as input the optimize gradient for a single coordinate as well as the previous hidden state and outputs the update for the corresponding optimizee parameter. We will refer to this architecture, illustrated in Figure 3, as an LSTM optimizer.” P4 §2.1 ¶3); and 
a respective output layer for each time step, wherein each output layer is configured to, for the given output sequence: receive an output layer input comprising the updated hidden state at the time step and to generate an output for the time step that defines a score distribution over possible values of the hyperparameter at the time step (“The network takes as input the optimize gradient for a single coordinate as well as the previous hidden state and outputs the update for the corresponding optimizee parameter. We will refer to this architecture, illustrated in Figure 3, as an LSTM optimizer.” P4 §2.1 ¶3).
Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 2-24 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-20 of U.S. Patent No. 10,521,729. Although the claims at issue are not identical, they are not patentably distinct from each other because:
Instant Application
US Patent 10,521,729
Claim 2
Claim 1
A computer-implemented method comprising:

generating, using a controller neural network having a plurality of controller parameters and in accordance with current values of the controller parameters, a batch of output sequences,

each output sequence in the batch defining an architecture for a first convolutional cell configured to receive a cell input and to generate a cell output, and

the first convolutional cell comprising a sequence of a predetermined number of operation blocks that each receive one or more respective input hidden states and generate a respective output hidden state;

for each output sequence in the batch:

































generating an instance of a child convolutional neural network that includes multiple instances of the first convolutional cell having the architecture defined by the output sequence;

training the instance of the child convolutional neural network to perform an image processing task; and

evaluating a performance of the trained instance of the child convolutional neural network on the image processing task to determine a performance metric for the trained instance of the child convolutional neural network;

using the performance metrics for the trained instances of the child convolutional neural network to adjust the current values of the controller parameters of the controller neural network; and

generating a final architecture for the first convolutional cell using the controller neural network in accordance with the adjusted values of the controller parameters.
A computer-implemented method comprising:
generating, using a controller neural network having a plurality of controller parameters and in accordance with current values of the controller parameters, a batch of output sequences,
each output sequence in the batch defining an architecture for a first convolutional cell configured to receive a cell input and to generate a cell output, and
the first convolutional cell comprising a sequence of a predetermined number of operation blocks that each receive one or more respective input hidden states and generate a respective output hidden state, wherein each output sequence in the batch defines, for each of the operation blocks:
a source for a first input hidden state for the operation block selected from one or more of: (i) outputs generated by one or more other components of the child convolutional neural network, (ii) an input image, or (iii) output hidden states of preceding operation blocks in the sequence of operation blocks within the first convolutional cell,
a source for a second input hidden state for the operation block selected from one or more of: (i) outputs generated by one or more preceding convolutional cells in the sequence of convolutional cells, (ii) the input image, or (iii) output hidden states of preceding operation blocks in the sequence of operation blocks within the convolutional cell,
an operation type for a first operation selected from a predetermined set of convolutional neural network operations, and
an operation type for a second operation selected from the predetermined set of convolutional neural network operations; and
for each output sequence in the batch:
generating an instance of a child convolutional neural network that includes multiple instances of the first convolutional cell having the architecture defined by the output sequence;
training the instance of the child convolutional neural network to perform an image processing task; and
evaluating a performance of the trained instance of the child convolutional neural network on the image processing task to determine a performance metric for the trained instance of the child convolutional neural network;
using the performance metrics for the trained instances of the child convolutional neural network to adjust the current values of the controller parameters of the controller neural network; and
generating a final architecture for the first convolutional cell using the controller neural network in accordance with the adjusted values of the controller parameters.
Claim 2 of the instant application is fully anticipated by claim 1 of the ‘729 patent.


Claims 23 and 24 are rejected under the same reasoning as claim 2.

Claims 3-22 of the instant application are fully anticipated by claims 2-20 of the ‘729 patent.
Allowable Subject Matter










Claims 5-7 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERIC NILSSON whose telephone number is (571)272-5246. The examiner can normally be reached M-F: 7-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571)-272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ERIC NILSSON/           Primary Examiner, Art Unit 2122