DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The present application was filed on January 30, 2018.
This office action is in response to amendments and/or Remarks filed on September 22, 2022. In the current amendment, claims 1, 2, 5, 9, 10, 13, 17, 18, and 20 are amended. Claims 7 and 15 are cancelled. Claims 1-6, 8-14, and 16-20 are pending. 

Drawings
The drawings filed on January 30, 2018 are accepted.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-6, 8-14, and 16-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Almahairi et al. (“Dynamic Capacity Networks”) in view of Liu et al. (“Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-offs by Selective Execution”).

Regarding Claim 1, 
Almahairi teaches: 
A method performed by one or more computers, the method comprising: (Page 4: “To validate the effectiveness of our approach, we first investigate the Cluttered MNIST dataset (Mnih et al., 2014). We then apply our model in a transfer learning setting to a real-world object recognition task using the Street View House Numbers (SVHN) dataset (Netzer et al., 2011).” teaches a computer based implementation that uses the MNIST dataset and SVHN dataset) 

receiving a network input for processing by a task neural network, the task neural network comprising a plurality of neural network layers; (Fig. 1 and Page 2: “We consider a deep neural network h, which we decompose into two parts: h(x) = g(f(x)) where f and g represent respectively the bottom layers and top layers of the network h while x is some input data.” teaches receiving input data x for a  deep neural network (task neural network) that comprises a plurality of layers)

    PNG
    media_image1.png
    255
    576
    media_image1.png
    Greyscale



and processing the network input using the task neural network in accordance with [a] usage input… to generate a network output for the network input, comprising: (Fig. 1 and Page 2: “For example, f can output a feature map, i.e. vectors of features each with a specific spatial location, or a probability map outputting probability distributions at each different spatial location. Top layers g consider as input the bottom layers’ representations f(x) and output a distribution over labels.” )
teaches processing the deep neural network to output a distribution over labels

selecting, based at least on [a] usage input…, a proper subset of the plurality of neural network layers to be active while processing the network input, comprising: (Page 2: “DCN introduces the use of two alternative sub-networks for the bottom layers f: the coarse layers fc or the fine layers ff , which differ in their capacity. The fine layers correspond to a high-capacity sub-network which has a high-computational requirement, while the coarse layers constitute a low-capacity sub-network. Consider applying the top layers only on the fine representation, i.e. hf (x) = g(ff (x)). We refer to the composition hf = g ◦ ff as the fine model. We assume that the fine model can achieve very good performance, but is computationally expensive. Alternatively, consider applying the top layers only on the coarse representation, i.e. hc(x) = g(fc(x)). We refer to this composition hc = g ◦ fc as the coarse model. Conceptually, the coarse model can be much more computationally efficient, but is expected to have worse performance than the fine model. The key idea behind DCN is to have g use representations from either the coarse or fine layers in an adaptive, dynamic way. Specifically, we apply the coarse layers fc on the whole input x, and leverage the fine layers ff only at a few “important” input regions. This way, the DCN can leverage the capacity of ff , but at a lower computational cost, by applying the fine layers only on a small portion of the input” teaches selectively applying the high-capacity sub-network or low-capacity sub-network (which contains fine layers and coarse layers respectively))

processing the network input using only the selected neural network layers. (Page 2: “DCN introduces the use of two alternative sub-networks for the bottom layers f: the coarse layers fc or the fine layers ff , which differ in their capacity. The fine layers correspond to a high-capacity sub-network which has a high-computational requirement, while the coarse layers constitute a low-capacity sub-network. Consider applying the top layers only on the fine representation, i.e. hf (x) = g(ff (x)). We refer to the composition hf = g ◦ ff as the fine model. We assume that the fine model can achieve very good performance, but is computationally expensive. Alternatively, consider applying the top layers only on the coarse representation, i.e. hc(x) = g(fc(x)). We refer to this composition hc = g ◦ fc as the coarse model. Conceptually, the coarse model can be much more computationally efficient, but is expected to have worse performance than the fine model. The key idea behind DCN is to have g use representations from either the coarse or fine layers in an adaptive, dynamic way. Specifically, we apply the coarse layers fc on the whole input x, and leverage the fine layers ff only at a few “important” input regions. This way, the DCN can leverage the capacity of ff , but at a lower computational cost, by applying the fine layers only on a small portion of the input” teaches processing portions of the input to the dynamic capacity network by applying the selected fine layers of the high-capacity sub-network or the coarse layers of the low-capacity sub-network)

each subnetwork comprising one or more neural network layers; and (Page 2: “DCN introduces the use of two alternative sub-networks for the bottom layers f: the coarse layers fc or the fine layers ff , which differ in their capacity. The fine layers correspond to a high-capacity sub-network which has a high-computational requirement, while the coarse layers constitute a low-capacity sub-network.” teaches that both the high-capacity and sub-network have a plurality of layers)

Almahairi does not appear to explicitly teach: 
receiving a usage input that is different from the network input and that specifies a respective weight for each of one or more usage factors, wherein each usage factor impacts how many computational resources are used… 
generating, using a controller neural network that is conditioned on the usage input different from the network input, a respective score for each subnetwork of a plurality of subnetworks of the task neural network, 
selecting a subnetwork from the plurality of subnetworks of the task neural network using the respective scores; and 

However, Liu teaches: 
receiving a usage input that is different from the network input and that specifies a respective weight for each of one or more usage factors, wherein each usage factor impacts how many computational resources are used… (Page 6: “During training we define the Q-learning reward as a linear combination of accuracy A and efficiency E (negative cost): r = λA + (1 − λ)E where λ ∈ [0, 1]. We train instances of high-low capacity D2NNs using different λ’s. As λ increases, the learned D2NN trades off efficiency for accuracy. Fig. 2a) plots the accuracy-cost curve on the test set; it also plots the accuracy and efficiency achieved by a conventional DNN with only the high capacity path N1+N2 (High NN) and a conventional DNN with only the low capacity path N1+N3 (Low NN). As we can see, the D2NN achieves a trade-off curve close to the upperbound: there are points on the curve that are as fast as the low-capacity node and as accurate as the high-capacity node. Fig. 4(left) plots the distribution of examples going through different execution paths. It shows that as λ increases, accuracy becomes more important and more examples go through the high-capacity node.” teaches receiving a Q learning reward that is a combination of accuracy and efficiency and includes a weight λ (usage input specifying a weight for a usage factor, different from the network input) and that as the weight λ increases, accuracy becomes more important, this causes a greater emphasis on high-capacity nodes which increases computational usage; Page 6: “We measure computational cost using the number of multiplications following prior work [2, 27] and for reproductivity. Specifically, we use the number of multiplications (control nodes included), normalized by a conventional DNN consisting of N1 and N2, that is, the high-capacity execution path.” teaches that the number of multiplications is used for quantifying computational cost)

generating, using a controller neural network that is conditioned on the usage input different from the network input, a respective score for each subnetwork of a plurality of subnetworks of the task neural network, (Page 3, Section 3: “Given a D2NN, we perform inference by traversing the graph starting from the input nodes. Because a D2NN is a DAG, we can execute each node in a topological order (the parents of a node are ordered before it; we take both data edges and control edges in consideration), same as conventional DNNs except that the control nodes can cause the computation of some nodes to be skipped. After we execute a control node, it outputs a set of control scores, one for each of its outgoing control edges. The control edge with the highest score is “activated”, meaning that the node being controlled is allowed to execute. The rest of the control edges are not activated, and their controllees are not allowed to execute. For example, in Fig 1 (right), the node Q controls N2 and N3. Either N2 or N3 will execute depending on which has the higher control score.” teaches using the D2NN (controller neural network) which is conditioned on λ to generate scores for nodes of the D2NN)

selecting a subnetwork from the plurality of subnetworks of the task neural network using the respective scores; and (Page 3, Section 3: “Given a D2NN, we perform inference by traversing the graph starting from the input nodes. Because a D2NN is a DAG, we can execute each node in a topological order (the parents of a node are ordered before it; we take both data edges and control edges in consideration), same as conventional DNNs except that the control nodes can cause the computation of some nodes to be skipped. After we execute a control node, it outputs a set of control scores, one for each of its outgoing control edges. The control edge with the highest score is “activated”, meaning that the node being controlled is allowed to execute. The rest of the control edges are not activated, and their controllees are not allowed to execute. For example, in Fig 1 (right), the node Q controls N2 and N3. Either N2 or N3 will execute depending on which has the higher control score.” teaches selecting the node with the highest score to activate among the plurality of nodes)

Almahairi and Liu are analogous art because they are directed to conditional computation using neural networks.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Liu’s Q-learning reward for reinforcement learning and dynamic deep neural network to train the DCN of Almahairi and select subnetworks using scores with a motivation to maximize a combination of accuracy and efficiency. (Liu, Page 4)

Regarding Claim 2, 
The combination of Almahairi and Liu teaches 
The method of claim 1, 

Almahairi further teaches: 
wherein the neural network comprises a plurality of components, (Page 2: “DCN introduces the use of two alternative sub-networks for the bottom layers f: the coarse layers fc or the fine layers ff , which differ in their capacity.” teaches that the DCN contains a plurality of layers (components))

wherein the plurality of components comprise a plurality of partitions each including a respective plurality of subnetworks, (Page 2: “The fine layers correspond to a high-capacity sub-network which has a high-computational requirement, while the coarse layers constitute a low-capacity sub-network.” teaches that the fine layers and coarse layers are partitioned into subnetworks)

wherein the subnetworks in each partition are each configured to receive a same type of layer input and to generate a same type of layer output as each other subnetwork in the partition, (Page 2: “The fine layers correspond to a high-capacity sub-network which has a high-computational requirement, while the coarse layers constitute a low-capacity sub-network… Specifically, we apply the coarse layers fc on the whole input x, and leverage the fine layers ff only at a few “important” input regions.” teaches that the high-capacity subnetwork has the same type of input (whole input x) and the low-capacity sub-network has the same type of input (important input regions); Page 3: “We then compute the output of the model based completely on the coarse vectors, i.e. the coarse model’s output hc(x) = g(fc(x)).” teaches that the coarse layers of the low-capacity sub-network outputs the same type of output (coarse vectors); Page 3: “Next we apply the fine layers ff only on the selected patches and obtain a small set of fine representation vectors… where fi,j = ff (xi,j ).” teaches that the fine layers of the high-capacity sub-network outputs the same type of output (fine vectors))

and wherein selecting, based at least on [a] usage input, a proper subset of the plurality of neural network layers to be active while processing the network input comprises: selecting a respective subnetwork from each of the partitions. (Page 2: “DCN introduces the use of two alternative sub-networks for the bottom layers f: the coarse layers fc or the fine layers ff , which differ in their capacity. The fine layers correspond to a high-capacity sub-network which has a high-computational requirement, while the coarse layers constitute a low-capacity sub-network… The key idea behind DCN is to have g use representations from either the coarse or fine layers in an adaptive, dynamic way. Specifically, we apply the coarse layers fc on the whole input x, and leverage the fine layers ff only at a few “important” input regions. This way, the DCN can leverage the capacity of ff , but at a lower computational cost, by applying the fine layers only on a small portion of the input” teaches selecting high-capacity subnetworks or low-capacity subnetworks (subset of fine and coarse layers) to be used on the input to the DCN)

Regarding Claim 3, 
The combination of Almahairi and Liu teaches 
The method of claim 2, 

Almahairi further teaches: 
wherein at least one subnetwork in each partition consumes a different amount of computational resources than at least one other subnetwork in the partition. (Page 2: “DCN introduces the use of two alternative sub-networks for the bottom layers f: the coarse layers fc or the fine layers ff , which differ in their capacity. The fine layers correspond to a high-capacity sub-network which has a high-computational requirement, while the coarse layers constitute a low-capacity sub-network… Alternatively, consider applying the top layers only on the coarse representation, i.e. hc(x) = g(fc(x)). We refer to this composition hc = g ◦ fc as the coarse model. Conceptually, the coarse model can be much more computationally efficient, but is expected to have worse performance than the fine model. teaches that the high capacity sub-network has a high computational requirement and the low capacity sub-network has a low computational requirement)

Regarding Claim 4, 
The combination of Almahairi and Liu teaches 
The method of claim 2, 

Almahairi further teaches: 
wherein the components further comprise at least one of a base neural network layer or an output layer in addition to the plurality of subnetworks. (Page 2: “We consider a deep neural network h, which we decompose into two parts: h(x) = g(f(x)) where f and g represent respectively the bottom layers and top layers of the network h while x is some input data.” and “DCN introduces the use of two alternative sub-networks for the bottom layers f: the coarse layers fc or the fine layers ff , which differ in their capacity.” teaches that deep neural network contains bottom layers and top layers and that the sub-networks are used for the bottom layers, therefore the deep neural network contains top layers that are not part of the sub-networks)

Regarding Claim 5, 
The combination of Almahairi and Liu teaches 
The method of claim 2, 

Almahairi further teaches: 
wherein the components are arranged in a sequence from a first component in the sequence to a last component in the sequence, (Page 2: “Top layers g consider as input the bottom layers’ representations f(x) and output a distribution over labels.” teaches that the layers of the deep neural network are arranged from bottom layers (first component) to top layers (last component) in a sequence) 

and wherein selecting a respective subnetwork from each partition comprises, for each of the partitions: (Page 2: “DCN introduces the use of two alternative sub-networks for the bottom layers f: the coarse layers fc or the fine layers ff , which differ in their capacity. The fine layers correspond to a high-capacity sub-network which has a high-computational requirement, while the coarse layers constitute a low-capacity sub-network… The key idea behind DCN is to have g use representations from either the coarse or fine layers in an adaptive, dynamic way. Specifically, we apply the coarse layers fc on the whole input x, and leverage the fine layers ff only at a few “important” input regions. This way, the DCN can leverage the capacity of ff , but at a lower computational cost, by applying the fine layers only on a small portion of the input” teaches selecting high-capacity subnetworks or low-capacity subnetworks (subset of fine and coarse layers) to be used on the input to the DCN)

Liu further teaches: 
processing a controller input for the partition using the controller neural network conditioned on the usage input, wherein the controller input for the partition comprises a preceding partition input for the partition, (Page 1: “A D2NN is a feed-forward deep neural network (directed acyclic graph of differentiable modules) augmented with one or more control modules. A control module is a sub network whose output is a decision that controls whether other modules can execute. Fig. 1 (left) illustrates a simple D 2NN with one control module (Q) and two regular modules (N1, N2), where the controller Q outputs a binary decision on whether module N2 executes. For certain inputs, the controller may decide that N2 is unnecessary and instead execute a dummy node D to save on computation” teaches using a control module that is a subnetwork (controller neural network) to control execution of other subnetworks, Fig. 1 shows input from subnetwork N1 to control module Q (preceding partition input))

    PNG
    media_image2.png
    245
    1038
    media_image2.png
    Greyscale


and wherein the controller neural network is configured to process the controller input to generate a score distribution comprising a respective score for each layer in the partition, (Page 3: “After we execute a control node, it outputs a set of control scores, one for each of its outgoing control edges. The control edge with the highest score is “activated”, meaning that the node being controlled is allowed to execute. The rest of the control edges are not activated, and their controllees are not allowed to execute. For example, in Fig 1 (right), the node Q controls N2 and N3. Either N2 or N3 will execute depending on which has the higher control score.” teaches that the control module generates a score distribution including a score for each subnetwork )

and selecting a subnetwork from the partition using the score distribution for the partition, wherein, for each partition after a first partition in the sequence, the preceding partition input for the partition identifies a subnetwork that was selected from the preceding partition in the sequence. (Page 3: “After we execute a control node, it outputs a set of control scores, one for each of its outgoing control edges. The control edge with the highest score is “activated”, meaning that the node being controlled is allowed to execute. The rest of the control edges are not activated, and their controllees are not allowed to execute. For example, in Fig 1 (right), the node Q controls N2 and N3. Either N2 or N3 will execute depending on which has the higher control score.” teaches selecting a subnetwork to execute based on a score distribution from a controller module; Page 3: “Also, when the execution of a node is skipped, its output will be either the default value or null. If the output is the default value, subsequent execution will continue as usual. If the output is null, any downstream nodes that depend on this output will in turn skip execution and have a null output unless a default value has been set. This “null” effect will propagate to the rest of the graph. Fig. 1 (right) shows a slightly more complicated example with default values: if N2 skips execution and outputs null, so will N4 and N6. But N8 will execute regardless because its input data edge has a default value.” teaches that the subnetworks output a null value if the controller module does not select it for execution, therefore a subnetwork dependent on a skipped subnetwork will receive a null value, this identifies that the preceding subnetwork was skipped, and suggests that the subnetwork that was selected can be identified based on these null values)

Almahairi and Liu are analogous art because they are directed to conditional computation using neural networks.
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Liu’s controller modules to determine execution of subnetworks of the DCN of Almahairi with a motivation improve efficiency of the neural network. (Liu, Page 1)

Regarding Claim 6, 
The combination of Almahairi and Liu teaches 
The method of claim 5,

Liu further teaches: 
wherein the controller input further comprises a partition input for the partition, and wherein the partition input is: for the first partition in the sequence, (i) the network input or (ii) an output generated by a component preceding the first partition in the sequence, and for each other partition in the sequence other than the first partition, an output generated by a component preceding the first partition in the sequence. (Fig. 1 teaches that the subnetworks (partitions) that are first in the sequence receive the neural network input as input (blank circle in Fig. 1), subnetworks that are not first receive the output of a preceding subnetwork)

    PNG
    media_image2.png
    245
    1038
    media_image2.png
    Greyscale

The combination of claim 5 has already incorporated the controller modules, therefore already incorporating the details of the partition inputs required by claim 6. 


Regarding Claim 8, 
The combination of Almahairi and Liu teaches 
The method of claim 5,

Liu further teaches: 
wherein the controller neural network has been trained jointly with the task neural network to maximize a reward function using reinforcement learning. (Page 1: “A D2NN is trained end to end. That is, regular modules and control modules are jointly trained to optimize both accuracy and efficiency. We achieve such training by integrating backpropagation with reinforcement learning, necessitated by the non-differentiability of control modules.” teaches that the controller modules are trained jointly using reinforcement learning; Page 4: “That is, the goal is to learn the parameters of the control node to maximize a user-defined reward, which in our case is a combination of accuracy and efficiency. This results in a classical reinforcement learning setting: learning a control policy to take actions so as to maximize reward.” teaches performing reinforcement learning to maximize a reward) 

The combination of claim 5 has already incorporated the controller modules, therefore already incorporating the details of the joint training required by claim 8. 

Regarding Claim 9,
This claim recites A system… which has limitations that are similar to those of claim 1, thus is rejected with the same rationale applied against claim 1.

Regarding Claim 10,
This claim recites The system of claim 9… which has limitations that are similar to those of claim 2, thus is rejected with the same rationale applied against claim 2.

Regarding Claim 11,
This claim recites The system of claim 10… which has limitations that are similar to those of claim 3, thus is rejected with the same rationale applied against claim 3.

Regarding Claim 12,
This claim recites The system of claim 10… which has limitations that are similar to those of claim 4, thus is rejected with the same rationale applied against claim 4.

Regarding Claim 13,
This claim recites The system of claim 10… which has limitations that are similar to those of claim 5, thus is rejected with the same rationale applied against claim 5.

Regarding Claim 14,
This claim recites The system of claim 13… which has limitations that are similar to those of claim 6, thus is rejected with the same rationale applied against claim 6.

Regarding Claim 16,
This claim recites The system of claim 13… which has limitations that are similar to those of claim 8, thus is rejected with the same rationale applied against claim 8.

Regarding Claim 17,
This claim recites One or more non-transitory computer readable storage media… which has limitations that are similar to those of claim 1, thus is rejected with the same rationale applied against claim 1.

Regarding Claim 18,
This claim recites The computer readable storage media of claim 17… which has limitations that are similar to those of claim 2, thus is rejected with the same rationale applied against claim 2.

Regarding Claim 19,
This claim recites The computer readable storage media of claim 18… which has limitations that are similar to those of claim 3, thus is rejected with the same rationale applied against claim 3.

Regarding Claim 20,
This claim recites The computer readable storage media of claim 18… which has limitations that are similar to those of claim 5, thus is rejected with the same rationale applied against claim 5.

Response to Arguments
35 U.S.C. 103: 
Applicant’s argument: 
“Applicant respectfully submits that the cited portions of Liu do not disclose or suggest receiving a usage input that is “different from the network input,” as recited by amended claim 1. Moreover, the cited portions of Liu do not disclose or suggest “generating, using a controller neural network that is conditioned on the usage input different from the network input, a respective score for each subnetwork” and “selecting a subnetwork from the plurality of subnetworks of the task neural network using the respective scores,” as recited by amended claim 1.”

Response: 
Almahairi teaches: 
receiving a network input for processing by a task neural network, the task neural network comprising a plurality of neural network layers; (Fig. 1 and Page 2: “We consider a deep neural network h, which we decompose into two parts: h(x) = g(f(x)) where f and g represent respectively the bottom layers and top layers of the network h while x is some input data.” teaches receiving input data x for a  deep neural network (task neural network) that comprises a plurality of layers)

Liu teaches:
receiving a usage input that is different from the network input and that specifies a respective weight for each of one or more usage factors, wherein each usage factor impacts how many computational resources are used… (Page 6: “During training we define the Q-learning reward as a linear combination of accuracy A and efficiency E (negative cost): r = λA + (1 − λ)E where λ ∈ [0, 1]. We train instances of high-low capacity D2NNs using different λ’s. As λ increases, the learned D2NN trades off efficiency for accuracy. Fig. 2a) plots the accuracy-cost curve on the test set; it also plots the accuracy and efficiency achieved by a conventional DNN with only the high capacity path N1+N2 (High NN) and a conventional DNN with only the low capacity path N1+N3 (Low NN). As we can see, the D2NN achieves a trade-off curve close to the upperbound: there are points on the curve that are as fast as the low-capacity node and as accurate as the high-capacity node. Fig. 4(left) plots the distribution of examples going through different execution paths. It shows that as λ increases, accuracy becomes more important and more examples go through the high-capacity node.” teaches receiving a Q learning reward that is a combination of accuracy and efficiency and includes a weight λ (usage input specifying a weight for a usage factor, different from the network input) and that as the weight λ increases, accuracy becomes more important, this causes a greater emphasis on high-capacity nodes which increases computational usage 

As noted above, the input data x (network input) of Liu is different than the weight λ (usage input) of Liu. Therefore, Liu does disclose receiving a usage input that is different from the network input. Furthermore, Liu teaches the following: 

generating, using a controller neural network that is conditioned on the usage input different from the network input, a respective score for each subnetwork of a plurality of subnetworks of the task neural network, (Page 3, Section 3: “Given a D2NN, we perform inference by traversing the graph starting from the input nodes. Because a D2NN is a DAG, we can execute each node in a topological order (the parents of a node are ordered before it; we take both data edges and control edges in consideration), same as conventional DNNs except that the control nodes can cause the computation of some nodes to be skipped. After we execute a control node, it outputs a set of control scores, one for each of its outgoing control edges. The control edge with the highest score is “activated”, meaning that the node being controlled is allowed to execute. The rest of the control edges are not activated, and their controllees are not allowed to execute. For example, in Fig 1 (right), the node Q controls N2 and N3. Either N2 or N3 will execute depending on which has the higher control score.” teaches using the D2NN (controller neural network) which is conditioned on λ to generate scores for nodes of the D2NN)

selecting a subnetwork from the plurality of subnetworks of the task neural network using the respective scores; and (Page 3, Section 3: “Given a D2NN, we perform inference by traversing the graph starting from the input nodes. Because a D2NN is a DAG, we can execute each node in a topological order (the parents of a node are ordered before it; we take both data edges and control edges in consideration), same as conventional DNNs except that the control nodes can cause the computation of some nodes to be skipped. After we execute a control node, it outputs a set of control scores, one for each of its outgoing control edges. The control edge with the highest score is “activated”, meaning that the node being controlled is allowed to execute. The rest of the control edges are not activated, and their controllees are not allowed to execute. For example, in Fig 1 (right), the node Q controls N2 and N3. Either N2 or N3 will execute depending on which has the higher control score.” teaches selecting the node with the highest score to activate among the plurality of nodes)

Conclusion

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHOUN ABRAHAM whose telephone number is (571)272-8144. The examiner can normally be reached Mon - Fri 08:00-16:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/S.J.A./Examiner, Art Unit 2125                                                                                                                                                                                                        
/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125