DETAILED ACTION
1. 	This Action is in response to amendments and arguments filed 18 November 2021 for  application 15/692371 filed on 31 August 2017.  Currently claims 1-4, 6, 9-23 are pending. Claims 5, 7, and 8 have been canceled. Claim objections have been withdrawn in light of the amendments. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments with respect to claims 1-4, 6, 9-23  have been considered but are moot because the new ground of rejection in further view of Li et al. does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-3, 6, 10-12, 15-17, 21, and 22 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. (“Densely Connected Convolutional Networks”, https://arxiv.org/pdf/1608.06993v4.pdf,arXiv:1608.06993v4 [cs.CV] 27 August 2017, pp. 1-9), hereinafter referred to as Huang, in view of Ioffe et al. (“Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”,  hereinafter referred to as Ioffe, in view of , in view of Jegou et al. (“The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation”, https://arxiv.org/pdf/1611.09326.pdf,arXiv:1611.09326v2 [cs.CV] 5 December 2016, pp.. 1-9), hereinafter referred to as Jegou, and in further view of Li et al. (“Ensemble Speaker Modeling using Speaker Adaptive Training Deep Neural Network for Speaker Adaptation”, INTERSPEECH, 2015, pp. 2892-2896), hereinafter referred to as Li.

In regards to claim 1, Huang teaches  a method for establishing a plurality of densely connected neural networks in a joint combined model, comprising: Creating each densely connected neural network from the plurality of densely connected neural networks by:
([Abstract, p. 3, Section 3, Figure 2], In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion… Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet., To facilitate down-sampling in our architecture we divide the network into multiple densely connected dense blocks; see Figure 2. We refer to layers between blocks as transition layers, which do convolution and pooling., wherein the densely connected neural network DenseNet is formed using a computer program and wherein a set of densenet blocks are connected sequentially as shown in Figure 2 which forms a joint combined model in the sense that each denseblock is a distinct model that performs a function complementary to the other models (namely, to accommodate downsizing of feature-maps).) creating, at a computer system, a first, a second, a third, and a fourth layer of a neural network, each of the layers comprising a respective dense function component, a respective batch normalization component and a respective dropout component that are sequentially connected; ([p 3, Section 3, p. 4, section 3, pp. 5-6, Section 4.2, Figure 1, Table 1], The network comprises L layers, each of which implements a non-linear transformation H`(·), where ` indexes the layer. H`(·) can be a composite function of operations such as Batch Normalization (BN) [14], rectified linear units (ReLU) [6], Pooling [19], or Convolution (Conv). … Consequently, the ` th layer receives the feature-maps of all preceding layers, x0, . . . , x`−1, as input: x` = H`([x0, x1, . . . , x`−1]), (2) where [x0, x1, . . . , x`−1] refers to the concatenation of the feature-maps produced in layers 0, . . . , `−1., We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., For the three datasets without data augmentation, i.e., C10, C100 and SVHN, we add a dropout layer [32] after each convolutional layer (except the first one) and set the dropout rate to 0.2., We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., For the three datasets without data augmentation, i.e., C10, C100 and SVHN, we add a dropout layer [32] after each convolutional layer (except the first one) and set the dropout rate to 0.2., wherein the densely connected neural network DenseNet (Figure 1) includes an input layer as well as first, second, third, and fourth layers and wherein each convolution layer (including layer 1) includes a dense function component in the form of H which expresses the connectivity of that layer to subsequent layers with a bottleneck transformation applied across those inputs, includes a batch normalization component which may also be incorporated into the composite function H, and includes a dropout component/component after each convolutional layer such as in the DenseNet-B for which the sequential implementation of the dense function operation and the batch normalization operation is indicated semantically by {BN-ReLU-Conv(1× 1)}-{BN}-ReLU-Conv(3×3) architecture with the dropout layer component following the last convolution layer in that architecture for each layer of the dense block shown in Figure 1.), wherein each dense function component comprises a plurality of neurons, each of which performs a transformative operation on a neuron input to form a respective dense function output, ([p. 4, Section 3], It has been noted in [36, 11] that a 1×1 convolution can be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency. We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., wherein the function H includes a bottleneck transformation (convolutional layer of CNN receiving neuron input from previous layers) applied to the inputs to a layer prior to a batch noise layer “BN” for each of the 4 layers of each dense block and wherein the output of this operation is a dense function output because it is a transformation across all inputs received from preceding layers.) each batch normalization component adjusts … the respective dense function output ([p. 4, Section 3], It has been noted in [36, 11] that a 1×1 convolution can be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency. We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., wherein the function H includes a batch normalization component applied after the application of the dense function transformation as indicated above.) and each dropout component selects a certain portion of the adjusted respective dense function output for dropout or replacement;  ([p. 4, section 3, pp. 5-6, Section 4.2, Figure 1], We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., For the three datasets without data augmentation, i.e., C10, C100 and SVHN, we add a dropout layer [32] after each convolutional layer (except the first one) and set the dropout rate to 0.2., wherein a dropout layer is applied after each convolution layer (except the first) which may be interpreted to be after the Conv(3x3) layer of DenseNet-B such that this dropout layer is applied to the adjusted  dense function output because its operation occurs after the batch normalization layer indicated above.) establishing a first set of data communication pathways, by the computer system, leading from outputs … of the first layer to respective dense function components of  the second layer , the third layer, and the fourth layer; ([p. 2, Section 1, Figure 1], Hence, the l th layer has ` inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all L−l subsequent layers. This introduces L(L+1) 2 connections in an L-layerlayer  network, instead of just L, as in traditional architectures., wherein each layer is connected to (data communication pathway) to all of the preceding layers (i.e., each layer passes its output features to each/respective successive layer and the respective dense function component H of each/respective successive layer) such that the output of the first layer is conveyed to the second, third, and fourth layers and wherein a dropout layer is applied after each convolution layer (except the first) which may be interpreted to be after the Conv(3x3) layer of DenseNet-B such that this dropout layer is applied to the adjusted dense function output because its operation occurs after the batch normalization layer indicated above with that output being thereby the output of that respective densely connected layer to be conveyed to subsequent densely connected layers.) establishing a second set of data communication pathways, by the computer system, leading from outputs of a second dropout component of the second layer to respective dense function components of the third layer and the fourth layer; ([p. 2, Section 1, p. 4, section 3, pp. 5-6, Section 4.2, Figure 1], Hence, the l th layer has ` inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all L−l subsequent layers. This introduces L(L+1) 2 connections in an L-layerlayer  network, instead of just L, as in traditional architectures., We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., For the three datasets without data augmentation, i.e., C10, C100 and SVHN, we add a dropout layer [32] after each convolutional layer (except the first one) and set the dropout rate to 0.2., ([pp. 5-6, Section 4.2, Figure 1], We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., For the three datasets without data augmentation, i.e., C10, C100 and SVHN, we add a dropout layer [32] after each convolutional layer (except the first one) and set the dropout rate to 0.2., wherein each layer is connected to (data communication pathway) to all of the preceding layers (i.e., each layer passes its output features to each/respective successive layer and the respective dense function component H of each/respective successive layer) such that the output of the second layer is conveyed to the third and fourth layers and wherein each layer consists of successive sublayers of operations at the end of which, as pointed out above is the application of the dropout operation for each layer such that the output of this operation is thereby directed to each of the successive layers.) and establishing a third set of data communication pathways, by the computer system, leading from outputs of a third dropout component of the third layer  to the respective dense function component of the fourth layer; ([p. 2, Section 1, p. 4, section 3, pp. 5-6, Section 4.2, Figure 1], Hence, the l th layer has ` inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all L−l subsequent layers. This introduces L(L+1) 2 connections in an L-layerlayer  network, instead of just L, as in traditional architectures., We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., For the three datasets without data augmentation, i.e., C10, C100 and SVHN, we add a dropout layer [32] after each convolutional layer (except the first one) and set the dropout rate to 0.2., ([pp. 5-6, Section 4.2, Figure 1], We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., For the three datasets without data augmentation, i.e., C10, C100 and SVHN, we add a dropout layer [32] after each convolutional layer (except the first one) and set the dropout rate to 0.2., wherein each layer is connected to (data communication pathway) to all of the preceding layers (i.e., each layer passes its output features to each/respective successive layer and the respective dense function component H of each/respective successive layer) such that the output of the third layer is conveyed to the fourth layer and wherein each layer consists of successive sublayers of operations at the end of which, as pointed out above is the application of the dropout operation for each layer such that the output of this operation is thereby directed to each of the successive layers.) wherein an input layer of the neural network provides inputs to respective dense function components of the first layer, the second layer, the third layer, and the fourth layer;  ([Abstract, Figure 1], For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers., wherein each layer is connected to (data communication pathway) to all of the preceding layers (i.e., the input layer passes its features to each/respective successive layer and the respective dense function component H of each/respective successive layer) such that the input layer data is conveyed to the first, second, third, and fourth layers.) and storing the created neural network together with the established sets of data communication pathways between layers in a memory of the computer system; …([Table 1, Table 2], wherein several different DenseNet architectures (including respective communication pathways as previously indicated) are formed/established (Table 1, also DenseNet-B) from alternative design parameters such that each one that is established is stored in the memory of the computer system to perform various evaluation assessments (Table 2).)
However, Huang does not explicitly teach wherein each of the layers comprises … a respective dropout component … a scale of … outputs of a first dropout component …. connect a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, in which each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output. Although Huang makes use of the commonly used batch normalization technique, he does not disclose that it adjusts a scale of output data. Also, although Huang teaches the application of a dropout function after the last convolutional layer for each layer in a dense block, he does not clearly teach that this is applied in the first layer in a dense block (i.e., the dropout layer is added after all convolutional layers except the first which may be interpreted as the convolutional layer at the end of the first layer with the dropout layer being applicable after that convolutional layer in each subsequent dense neural network layer).  Although Huang teaches the usage of a sequence of dense blocks each connected by transition layers (Figure 2) such that any one of these dense blocks generate different sets of features, in a general sense, a distinct (sub) task and such that the output of from the sequence of dense blocks is processed through a single neural network component (pooling, linear – Figure 2) to generate a prediction output, Huang does not teach a set of parallel sub-task operations that process a set of features such as may be output from the set of dense blocks.
However, Ioffe, in the analogous environment of training deep neural networks teaches wherein the batch normalization component is configured to adjust the scale of output data from the dense function component. ([p. 1, Section 1, p. 3, Section 3, Algorithm 1], However, the notion of covariate shift can be extended beyond the learning system as a whole, to apply to its parts, such as a sub-network or a layer. Consider a network computing ` = F2(F1(u, Θ1), Θ2) where F1 and F2 are arbitrary transformations, and the parameters Θ1, Θ2 are to be learned so as to minimize the loss., For a layer with d-dimensional input x = (x (1) . . . x(d) ), we will normalize each dimension xb (k) = x (k) − E[x (k) ] p Var[x (k) ] where the expectation and variance are computed over the training data set., wherein batch normalization rescales the input into any layer (output from any preceding layer) of a deep neural network as representable in a generalized learning framework of learning parameters of each layer.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang to incorporate the teachings of Ioffe to use batch normalization to adjust the scale of output data received by any layer of a densely connected convolutional network. The modification would have been obvious because one of ordinary skill would have been motivated to achieve superior learning capability of deep neural networks by normalizing the scale of network activations by reducing the internal co-variate shift on input distributions (Ioffe, [Abstract, p. 4, Section 3]).
Huang and Ioffe do not explicitly teach wherein each of the layers comprises … a respective dropout component … … outputs of a first dropout component of the first layer … connect a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, in which each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output.. Although Ioffe teaches the implementation of a dropout layer in combination with batch normalization, he does not disclose the communication of the dropout layer across successive layers of a dense neural network. Ioffe does not address multi-tasking frameworks. As noted previously, Huang teaches these limitations except for the first layer which he does not disclose as having a dropout component. 
However, Jegou, in the analogous environment of designing and implementing dense neural networks teaches wherein each of the layers comprises a respective dense function component, a respective batch normalization component and a respective dropout component that are sequentially connected … establishing a first set of data communication pathways, by the computer system, leading from outputs of a first dropout component of the first layer to the respective dense function components of second layer, third layer, and fourth layer. ([pp. 3-4, Section 3.1, Figure 2, p. 4, Section 3.3, Table 1], Thus, the output of the ` th layer is defined as <equation 3>  where [ ... ] represents the concatenation operation. In this case, H is defined as BN, followed by ReLU, a convolution and dropout….Figure 2 shows an example of dense block construction. Starting from an input x0 (input image or output of a transition down) with m feature maps, the first layer of the block generates an output x1 of dimension k by applying H1(x0). These k feature maps are then stacked to the previous m feature maps by concatenation ([x1, x0]) and used as input to the second layer. The same operation is repeated n times, leading to a new dense block with n × k feature maps., A second layer is then applied to create another k features maps, which are again concatenated to the previous feature maps. The operation is repeated 4 times. First, in Table 1, we define the dense block layer, transition down and transition up of the architecture. Dense block layers are composed of BN, followed by ReLU, a 3 × 3 same convolution (no resolution loss) and dropout with probability p = 0.2.,  wherein in the downsampling path of the dense neural network architecture, feature map outputs are generated from a DenseNet layer that successively processes an input consisting of a concatenation of outputs from previous layers and the input into the first layer (corresponding to the dense function transformation) by successively applying a batch normalization layer and a dropout layer (both components of which are incorporated into the function H that characterizes the overall operation of that layer and wherein the output of that layer, including the output (feature map) of the first layer, is conveyed to all successive layers since it is a part of the concatenation of inputs for each successive layer (equation 3).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang and Ioffe to incorporate the teachings of Jegou to communicate the output of a first dense neural network layer generated from the application of a dropout operation to each successive dense neural network layer. The modification would have been obvious because one of ordinary skill would have been motivated to achieve superior performance of deep neural networks using a DenseNets architecture in which each layer, including the first, generates its output from a dropout layer and conveys that to successive layers to encourage reuse of features with each layer in the architecture receiving a direct supervision signal (Jegou, [Abstract, p. 3, Section 3.1, p. 6, Section 4.3, Table 4]).
However, Huang, Ioffe, and Jegou do not explicitly teach connecting a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, wherein each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output.
Although Jegou teaches the concatenation of a sequence of denseblocks with output from one denseblock also densely routed to other denseblocks (Figure 1), he does not disclose multitasking operations that process output from, say, a dense block into a set of sub-task modules (with or without denseblocks) each of which generates a result that is then later combined.
However, Li, in the analogous environment of training deep multi-tasking neural networks teaches connecting a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, wherein each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output. [Abstract, p. 2893, Section 2.1, p. 6, Section 4.3, Figure 1, Figure 2] We first train a speaker-independent DNN (SIDNN) acoustic model as a universal speaker model (USM). Based on the USM, a SAT-DNN is used to obtain a set of speaker-dependent models by assuming that all other layers except one speaker-dependent (SD) layer are shared among speakers., The SAT-DNN proposed in [7] can be regarded as a multi-task learning. In NICT-SAT-DNN [7], the DNN architecture is configured as shown in Figure 1. All of the DNN layers are shared among speakers except one SD layer. The parameters in the SD layer are updated only for a specific speaker while the parameters for all of the shared layers are updated for all speakers., wherein a joint neural network model framework includes an initial generalized (speaker independent) neural network component that processes input vectors to generate features (e.g., deep neural network features) that are then processed by each neural network in a multi-tasking component of that joint neural network model framework in which each neural network of which is directed to a particular sub-task and in which each (sub-task) neural network produces a particular task-specific output (task value) that is then fed into a third component (shared speaker independent component) that combines the output from each of the individual/sub-task neural networks and generates a predicted output at the output layer such that each of the sub-task neural networks has a deep representation (Figure 2) which is also dense layer representation because each layer in that neural network is interpreted as having, in general, full connectivity (but also dense in a more general sense for forming a distributed neural representation of the speaker).)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang, Ioffe, and Jegou to incorporate the teachings of Li to connect a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, in which each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output. The modification would have been obvious because one of ordinary skill would have been motivated to achieve superior and efficient recognition performance in a multi-tasking application by modeling each sub-task as a separate deep component in a joint modeling framework with sub-task-specific adaptation (Li, [Abstract, p. 2892, Section 1, Table 2]).

In regards to claim 2, the rejection of claim 1 is incorporated and Huang further teaches further comprising optimizing the neural network on an output task using historical data related to the output task.  ([p 5, Section 4.1, p. 5, Section 4.2, Table 2], SVHN. The Street View House Numbers (SVHN) dataset [24] contains 32×32 colored digit images. There are 73,257 images in the training set, 26,032 images in the test set, and 531,131 images for additional training…. We select the model with the lowest validation error during training and report the test error., All the networks are trained using stochastic gradient descent (SGD)., wherein the neural network is optimized using stochastic gradient descent to minimize the test error when applied, for example, to an existing (historical) dataset consisting of street view house numbers in an image recognition/classification task.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang to incorporate the teachings of Ioffe, Jegou, and Li for the same reasons as pointed out for claim 1.

In regards to claim 3, the rejection of claim 2 is incorporated and Huang further teaches wherein the optimizing comprises making adjustments to individual neurons of the first layer, the second layer, the third layer, and the fourth layer of the neural network, and making adjustments to an output layer of the neural network.  ([p. 5, Section 4.2, p. 7, Section 5, Table 2, Figure 2, Figure 4], All the networks are trained using stochastic gradient descent (SGD). On CIFAR and SVHN we train using batch size 64 for 300 and 40 epochs, respectively., One explanation for the improved accuracy of dense convolutional networks may be that individual layers receive additional supervision from the loss function through the shorter connections. One can interpret DenseNets to perform a kind of “deep supervision”., wherein the neural network is optimized across all layers (every neuron and its associated connection) using deep supervision through the propagation of gradients derived from the error computed at the output layer (Figure 4 – test error per epoch of training) such that the stochastic gradient descent optimization process adjusts the neurons (weights) across all layers based on adjustments at the output layer in the form of the minimization of error at the output layer in the classification determination (Figure 2) through a deep supervision training process.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang to incorporate the teachings of Ioffe, Jegou, and Li for the same reasons as pointed out for claim 1.

In regards to claim 6, the rejection of claim 1 is incorporated and Huang further teaches wherein a plurality of neurons in the respective dense function component are each connected to a second plurality of neurons in the a subsequent layer.29 ([p 3, Section 3,], The network comprises L layers, each of which implements a non-linear transformation H`(·), where ` indexes the layer. H`(·) can be a composite function of operations such as Batch Normalization (BN) [14], rectified linear units (ReLU) [6], Pooling [19], or Convolution (Conv). … Consequently, the ` th layer receives the feature-maps of all preceding layers, x0, . . . , x`−1, as input: x` = H`([x0, x1, . . . , x`−1]), (2) where [x0, x1, . . . , x`−1] refers to the concatenation of the feature-maps produced in layers 0, . . . , `−1., wherein the each convolution layer (including layer 1) includes a dense function component in the form of H which comprises the expression of the connectivity of the neurons  of that layer (feature maps) to the neurons of subsequent layers, including the connectivity from layer 1 to each subsequent layer in the dense block.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang to incorporate the teachings of Ioffe, Jegou, and Li for the same reasons as pointed out for claim 1.

In regards to claim 10, Huang teaches a computer system for establishing a plurality of densely connected neural networks in a joint combined model, comprising: a processor; and a computer-readable medium having stored thereon a data structure representing a neural network, instructions that are executable to cause the computer system to perform operations comprising: creating a plurality of densely connected neural networks by: ([Abstract, p. 3, Section 3, Figure 2], In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion… Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet., To facilitate down-sampling in our architecture we divide the network into multiple densely connected dense blocks; see Figure 2. We refer to layers between blocks as transition layers, which do convolution and pooling., wherein the densely connected neural network DenseNet is formed using a computer program and wherein a representation that characterizes the flow dependencies of the data processed by that neural network (for example as represented by the architectural parameters shown in Table 1) is a data structure corresponding to an implemented neural network and wherein a set of densenet blocks are connected sequentially as shown in Figure 2 which forms a joint combined model in the sense that each denseblock is a distinct model that performs a function complementary to the other models (namely, to accommodate downsizing of feature-maps).) creating a plurality of layers that successively include an input layer, a first layer, a second layer, a third layer, and an output layer, wherein each of the first layer, the second layer, and the third layer comprises a respective dense function component, a respective batch normalization component and a respective dropout component that are sequentially connected; ([p 3, Section 3, p. 4, section 3, pp. 5-6, Section 4.2, Figure 1, Figure 2], The network comprises L layers, each of which implements a non-linear transformation H`(·), where ` indexes the layer. H`(·) can be a composite function of operations such as Batch Normalization (BN) [14], rectified linear units (ReLU) [6], Pooling [19], or Convolution (Conv). … Consequently, the ` th layer receives the feature-maps of all preceding layers, x0, . . . , x`−1, as input: x` = H`([x0, x1, . . . , x`−1]), (2) where [x0, x1, . . . , x`−1] refers to the concatenation of the feature-maps produced in layers 0, . . . , `−1…. We refer to layers between blocks as transition layers, which do convolution and pooling. The transition layers used in our experiments consist of a batch normalization layer and an 1×1 convolutional layer followed by a 2×2 average pooling layer., We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., For the three datasets without data augmentation, i.e., C10, C100 and SVHN, we add a dropout layer [32] after each convolutional layer (except the first one) and set the dropout rate to 0.2., wherein the densely connected neural network DenseNet includes an input layer as well as first, second, third, and fourth layers (Figure 1) and an output layer (Figures 1 and 2), wherein each convolution layer (including layer 1, 2, 3, or 4) includes a dense function component in the form of H which expresses the connectivity of that layer to subsequent layers with a bottleneck transformation applied across those inputs, includes (with sequential connectivity) a batch normalization component which may also be incorporated into the composite function H, and includes a dropout component/component (at least for layers 2, 3, and 4) after each convolutional layer such as in the DenseNet-B for which the sequential implementation of the dense function operation and the batch normalization operation is indicated semantically and sequentially by {BN-ReLU-Conv(1× 1)}-{BN}-ReLU-Conv(3×3) architecture with the dropout layer component following the last convolution layer in that architecture for each layer (except possibly for the first) of the dense block shown in Figure 1.)  wherein each dense function component comprises a plurality of neurons, each of which performs a transformative operation on a neuron input to form a respective dense function output, ([p. 4, Section 3], It has been noted in [36, 11] that a 1×1 convolution can be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency. We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., wherein the function H includes a bottleneck transformation applied to the inputs to a layer prior to a batch noise layer “BN” for each of the 4 layers of each dense block and wherein the output of this operation is a dense function output because it is a transformation across all inputs received from preceding layers.) each batch normalization component adjusts … the respective dense function output ([p. 4, Section 3], It has been noted in [36, 11] that a 1×1 convolution can be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency. We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., wherein the function H includes a batch normalization component applied after the application of the dense function transformation as indicated above.) and each dropout component selects a certain portion of the adjusted respective dense function output for dropout or replacement;  ([p. 4, section 3, pp. 5-6, Section 4.2, Figure 1], We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., For the three datasets without data augmentation, i.e., C10, C100 and SVHN, we add a dropout layer [32] after each convolutional layer (except the first one) and set the dropout rate to 0.2., wherein a dropout layer is applied after each convolution layer (except possibly for the first layer)  which may be interpreted to be after the Conv(3x3) layer of DenseNet-B (or after 1x1 convolutional layer in the output layer) such that this dropout layer is applied to the adjusted dense function output because its operation occurs after the batch normalization layer indicated above.)  establishing data communication pathways between the plurality of layers such that for each layer except the output layer, an output from each prior layer of the respective layer is ([p. 2, Section 1, Figure 1], Hence, the l th layer has ` inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all L−l subsequent layers. This introduces L(L+1) 2 connections in an L-layerlayer  network, instead of just L, as in traditional architectures., wherein each layer is connected to (data communication pathway) to all of the preceding layers such that the output of the first layer is conveyed to each of the subsequent layers (e.g., the second, third, and fourth layers) and wherein the output of this block passes through an output layer consisting of pooling and rectified linear units to produce the predicted output (e.g., “horse”) such that the only input received from this output layer comes from the final dense network layer in the dense block.)  wherein an input layer of the neural network provides inputs to respective dense function components of the first layer, the second layer, the third layer, and the fourth layer; ([Abstract, Figure 1], For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers., wherein each layer is connected to (data communication pathway) to all of the preceding layers (i.e., the input layer passes its features to each/respective successive layer and the respective dense function component H of each/respective successive layer) such that the input layer data is conveyed to the first, second, third, and fourth layers.) and training the neural network of the plurality of densely connected neural networks to optimize on a particular output task… ([p 5, Section 4.1, p. 5, Section 4.2, Table 2], SVHN. The Street View House Numbers (SVHN) dataset [24] contains 32×32 colored digit images. There are 73,257 images in the training set, 26,032 images in the test set, and 531,131 images for additional training…. We select the model with the lowest validation error during training and report the test error., All the networks are trained using stochastic gradient descent (SGD)., wherein the neural network (including each of the dense blocks shown in Figure 2) is optimized using stochastic gradient descent to minimize the test error when applied, for example, to an existing (historical) dataset consisting of street view house numbers in an image recognition/classification task.)
However, Huang does not explicitly teach wherein each of the first layer, … a respective dropout component …; … a scale of …. connecting a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, wherein each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output. Although Huang makes use of the commonly used batch normalization technique, he does not disclose that it adjusts a scale of output data. Also, although Huang teaches the application of a dropout function after the last convolutional layer for each layer in a dense block, he does not clearly teach that this is applied in the first layer in a dense block (i.e., the dropout layer is added after all convolutional layers except the first which may be interpreted as the convolutional layer at the end of the first layer with the dropout layer being applicable after that convolutional layer in each subsequent dense neural network layer). Although Huang teaches the usage of a sequence of dense blocks each connected by transition layers (Figure 2) such that any one of these dense blocks generate different sets of features, in a general sense, a distinct (sub) task and such that the output of from the sequence of dense blocks is processed through a single neural network component (pooling, linear – Figure 2) to generate a prediction output, Huang does not teach a set of parallel sub-task operations that process a set of features such as may be output from the set of dense blocks.
However, Ioffe, in the analogous environment of training deep neural networks teaches wherein the batch normalization component is configured to adjust the scale of output data from the dense function component. ([p. 1, Section 1, p. 3, Section 3, Algorithm 1], However, the notion of covariate shift can be extended beyond the learning system as a whole, to apply to its parts, such as a sub-network or a layer. Consider a network computing ` = F2(F1(u, Θ1), Θ2) where F1 and F2 are arbitrary transformations, and the parameters Θ1, Θ2 are to be learned so as to minimize the loss., For a layer with d-dimensional input x = (x (1) . . . x(d) ), we will normalize each dimension xb (k) = x (k) − E[x (k) ] p Var[x (k) ] where the expectation and variance are computed over the training data set., wherein batch normalization rescales the input into any layer (output from any preceding layer) of a deep neural network as representable in a generalized learning framework of learning parameters of each layer.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang to incorporate the teachings of Ioffe to use batch normalization to adjust the scale of output data received by any layer of a densely connected convolutional network. The modification would have been obvious because one of ordinary skill would have been motivated to achieve superior learning capability of deep neural networks by normalizing the scale of network activations by reducing the internal co-variate shift on input distributions (Ioffe, [Abstract, p. 4, Section 3]).
However, Huang and Ioffe do not explicitly teach wherein each of the first layer, … a respective dropout component …connect a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, in which each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output. Although Ioffe teaches the implementation of a dropout layer in combination with batch normalization, he does not disclose the communication of the dropout layer across successive layers of a dense neural network. Ioffe does not address multi-tasking frameworks. As previously indicated, Huang does not clearly teach the use of the dropout component in the first layer although he does teach this limitation for all other layers including the output layer as also point out above.
However, Jegou, in the analogous environment of designing and implementing dense neural networks teaches wherein each of the first layer, the second layer, and the third layer comprises a respective dense function component, a respective batch normalization component and a respective dropout component that are sequentially connected … ([pp. 3-4, Section 3.1, Figure 2, p. 4, Section 3.3, Table 1], Thus, the output of the ` th layer is defined as <equation 3>  where [ ... ] represents the concatenation operation. In this case, H is defined as BN, followed by ReLU, a convolution and dropout….Figure 2 shows an example of dense block construction. Starting from an input x0 (input image or output of a transition down) with m feature maps, the first layer of the block generates an output x1 of dimension k by applying H1(x0). These k feature maps are then stacked to the previous m feature maps by concatenation ([x1, x0]) and used as input to the second layer. The same operation is repeated n times, leading to a new dense block with n × k feature maps., A second layer is then applied to create another k features maps, which are again concatenated to the previous feature maps. The operation is repeated 4 times. First, in Table 1, we define the dense block layer, transition down and transition up of the architecture. Dense block layers are composed of BN, followed by ReLU, a 3 × 3 same convolution (no resolution loss) and dropout with probability p = 0.2.,  wherein in the downsampling path of the dense neural network architecture, feature map outputs are generated from a DenseNet layer that successively processes an input consisting of a concatenation of outputs from previous layers and the input into the first layer (corresponding to the dense function transformation) by successively applying a batch normalization layer and a dropout layer (both components of which are incorporated into the function H that characterizes the overall operation of that layer and wherein the output of that layer, including the output (feature map) of the first layer, is conveyed to all successive layers since it is a part of the concatenation of inputs for each successive layer (equation 3).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang and Ioffe to incorporate the teachings of Jegou to communicate the output of a first dense neural network layer generated from the application of a dropout operation to each successive dense neural network layer. The modification would have been obvious because one of ordinary skill would have been motivated to achieve superior performance of deep neural networks using a DenseNets architecture in which each layer, including the first, generates its output from a dropout layer and conveys that to successive layers to encourage reuse of features with each layer in the architecture receiving a direct supervision signal (Jegou, [Abstract, p. 3, Section 3.1, p. 6, Section 4.3, Table 4]).
However, Huang, Ioffe, and Jegou do not explicitly teach connecting a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, wherein each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output.
Although Jegous teaches the concatenation of a sequence of denseblocks with output from one denseblock also densely routed to other denseblocks (Figure 1), he does not disclose multitasking operations that process output from, say, a dense block into a set of sub-task modules (with or without denseblocks) each of which generates a result that is then later combined.
However, Li, in the analogous environment of training deep multi-tasking neural networks teaches connecting a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, wherein each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output. [Abstract, p. 2893, Section 2.1, p. 6, Section 4.3, Figure 1, Figure 2] We first train a speaker-independent DNN (SIDNN) acoustic model as a universal speaker model (USM). Based on the USM, a SAT-DNN is used to obtain a set of speaker-dependent models by assuming that all other layers except one speaker-dependent (SD) layer are shared among speakers., The SAT-DNN proposed in [7] can be regarded as a multi-task learning. In NICT-SAT-DNN [7], the DNN architecture is configured as shown in Figure 1. All of the DNN layers are shared among speakers except one SD layer. The parameters in the SD layer are updated only for a specific speaker while the parameters for all of the shared layers are updated for all speakers., wherein a joint neural network model framework includes an initial generalized (speaker independent) neural network component that processes input vectors to generate features (e.g., deep neural network features) that are then processed by each neural network in a multi-tasking component of that joint neural network model framework in which each neural network of which is directed to a particular sub-task and in which each (sub-task) neural network produces a particular task-specific output (task value) that is then fed into a third component (shared speaker independent component) that combines the output from each of the individual/sub-task neural networks and generates a predicted output at the output layer such that each of the sub-task neural networks has a deep representation (Figure 2) which is also dense layer representation because each layer in that neural network is interpreted as having, in general, full connectivity (but also dense in a more general sense for forming a distributed neural representation of the speaker).)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang, Ioffe, and Jegou to incorporate the teachings of Li to connect a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, in which each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output. The modification would have been obvious because one of ordinary skill would have been motivated to achieve superior and efficient recognition performance in a multi-tasking application by modeling each sub-task as a separate deep component in a joint modeling framework with sub-task-specific adaptation (Li, [Abstract, p. 2892, Section 1, Table 2]).

In regards to claim 11, the rejection of claim 10 is incorporated and Huang further teaches  wherein the operations further comprise performing an input selection operation on inputs received at the third layer.  ([p 4, Section 3], Although each layer only produces k output feature-maps, it typically has many more inputs. It has been noted in [36, 11] that a 1×1 convolution can be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency. … In our experiments, we let each 1×1 convolution produce 4k feature-maps.., wherein each layer (including the third layer) produces k feature maps even though it may have received more than that as aggregated (concatenated) inputs from prior layers such that an input selection operation is performed to reduce the dimensionality of the feature maps (input selection operation) to k features maps.)   
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang to incorporate the teachings of Ioffe, Jegou, and Li for the same reasons as pointed out for claim 10.
 
In regards to claim 12, the rejection of claim 11 is incorporated and Huang further teaches  wherein the input selection operation includes weighting output from one previous layer differently than an output from another previous layer.30   ([p 4, Section 3], Although each layer only produces k output feature-maps, it typically has many more inputs. It has been noted in [36, 11] that a 1×1 convolution can be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency. … In our experiments, we let each 1×1 convolution produce 4k feature-maps.., wherein an input selection of inputs into a layer (for example the third layer) is achieved from the 1x1 convolutional filter dimensionality (bottleneck) reduction operation which is a weighting operation applied to the concatenation of each of the sets of k feature maps formed from the preceding layers which reduces the dimensionality of the concatenated feature maps to a fixed number that is the same for each layer such that the convolution operation (at least comprising the 1x1 convolution function but also the 3x3 convolution function) applies a weighting to the feature maps (outputs) received from each preceding layer, wherein the weighting is different across different received outputs because the effects of the effects of the application of the convolution filter (kernel weights) is different between two layer outputs that appear successively in the concatenation than between two layer outputs that do not appear successively in that concatenation, and wherein the weighting may further be different because the output from a previous layer as recited in claim 12 need not correspond to received “inputs” as recited in claim 11.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang to incorporate the teachings of Ioffe, Jegou, and Li for the same reasons as pointed out for claim 10.

In regards to claim 15, Huang teaches  a non-transitory computer-readable medium having stored thereon instructions for establishing a plurality of densely connected neural networks in a joint combined model, that are executable by a computer system to cause the computer system to perform operations comprising: creating each densely connected neural network from the plurality of densely connected neural networks by:
([Abstract, p. 3, Section 3, Figure 2], In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion… Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet., To facilitate down-sampling in our architecture we divide the network into multiple densely connected dense blocks; see Figure 2. We refer to layers between blocks as transition layers, which do convolution and pooling., wherein the densely connected neural network DenseNet is formed using a computer program and wherein a set of densenet blocks are connected sequentially as shown in Figure 2 which forms a joint combined model in the sense that each denseblock is a distinct model that performs a function complementary to the other models (namely, to accommodate downsizing of feature-maps).) creating a first layer, a second layer, a third layer, and a fourth layer of a neural network, each of the layers comprising one or more neurons a respective dense function component, a respective batch normalization component and a respective dropout component that are sequentially connected; ([p 3, Section 3, p. 4, section 3, pp. 5-6, Section 4.2, Figure 1], The network comprises L layers, each of which implements a non-linear transformation H`(·), where ` indexes the layer. H`(·) can be a composite function of operations such as Batch Normalization (BN) [14], rectified linear units (ReLU) [6], Pooling [19], or Convolution (Conv). … Consequently, the ` th layer receives the feature-maps of all preceding layers, x0, . . . , x`−1, as input: x` = H`([x0, x1, . . . , x`−1]), (2) where [x0, x1, . . . , x`−1] refers to the concatenation of the feature-maps produced in layers 0, . . . , `−1., We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., For the three datasets without data augmentation, i.e., C10, C100 and SVHN, we add a dropout layer [32] after each convolutional layer (except the first one) and set the dropout rate to 0.2., ([pp. 5-6, Section 4.2, Figure 1], We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., For the three datasets without data augmentation, i.e., C10, C100 and SVHN, we add a dropout layer [32] after each convolutional layer (except the first one) and set the dropout rate to 0.2., wherein the densely connected neural network DenseNet (Figure 1) includes an input layer as well as first, second, third, and fourth layers and wherein each convolution layer (including layer 1) includes a dense function component in the form of H which expresses the connectivity of that layer to subsequent layers with a bottleneck transformation applied across those inputs, includes a batch normalization component which may also be incorporated into the composite function H, and includes a dropout component/component after each convolutional layer such as in the DenseNet-B for which the sequential implementation of the dense function operation and the batch normalization operation is indicated semantically by {BN-ReLU-Conv(1× 1)}-{BN}-ReLU-Conv(3×3) architecture with the dropout layer component following the last convolution layer in that architecture for each layer of the dense block shown in Figure 1.)  wherein each dense function component comprises a plurality of neurons, each of which performs a transformative operation on a neuron input to form a respective dense function output, ([p. 4, Section 3], It has been noted in [36, 11] that a 1×1 convolution can be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency. We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., wherein the function H includes a bottleneck transformation applied to the inputs to a layer prior to a batch noise layer “BN” for each of the 4 layers of each dense block and wherein the output of this operation is a dense function output because it is a transformation across all inputs received from preceding layers.) each batch normalization component adjusts … the respective dense function output ([p. 4, Section 3], It has been noted in [36, 11] that a 1×1 convolution can be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency. We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., wherein the function H includes a batch normalization component applied after the application of the dense function transformation as indicated above.) and each dropout component selects a certain portion of the adjusted respective dense function output for dropout or replacement;  ([p. 4, section 3, pp. 5-6, Section 4.2, Figure 1], We find this design especially effective for DenseNet and we refer to our network with such a bottleneck layer, i.e., to the BN-ReLU-Conv(1× 1)-BN-ReLU-Conv(3×3) version of H`, as DenseNet-B., For the three datasets without data augmentation, i.e., C10, C100 and SVHN, we add a dropout layer [32] after each convolutional layer (except the first one) and set the dropout rate to 0.2., wherein a dropout layer is applied after each convolution layer which may be interpreted to be after the Conv(3x3) layer of DenseNet-B such that this dropout layer is applied to the adjusted  dense function output because its operation occurs after the batch normalization layer indicated above.) establishing a first set of data communication pathways leading from outputs of the neurons of the first layer to respective dense function components of the second layer, the third layer, and the fourth layer; ([p. 2, Section 1, Figure 1], Hence, the l th layer has ` inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all L−l subsequent layers. This introduces L(L+1) 2 connections in an L-layerlayer  network, instead of just L, as in traditional architectures., wherein each layer is connected to (data communication pathway) to all of the preceding layers (i.e., each layer passes its output features to each/respective successive layer and the respective dense function component H of each/respective successive layer) such that the output of the first layer is conveyed to the second, third, and fourth layers.) establishing a second set of data communication pathways leading ([p. 2, Section 1, Figure 1], Hence, the l th layer has ` inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all L−l subsequent layers. This introduces L(L+1) 2 connections in an L-layerlayer  network, instead of just L, as in traditional architectures., wherein each layer is connected to (data communication pathway) to all of the preceding layers (i.e., each layer passes its output features to each/respective successive layer and the respective dense function component H of each/respective successive layer) such that the output of the second layer is conveyed to the third and fourth layers.) and establishing a third set of data communication pathways leading ([p. 2, Section 1, Figure 1], Hence, the l th layer has ` inputs, consisting of the feature-maps of all preceding convolutional blocks. Its own feature-maps are passed on to all L−l subsequent layers. This introduces L(L+1) 2 connections in an L-layerlayer  network, instead of just L, as in traditional architectures., wherein each layer is connected to (data communication pathway) to all of the preceding layers (i.e., each layer passes its output features to each/respective successive layer and the respective dense function component H of each/respective successive layer) such that the output of the third layer is conveyed to the fourth layer.) wherein an input layer of the neural network provides inputs ([Abstract, Figure 1], For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers., wherein each layer is connected to (data communication pathway) to all of the preceding layers such that the input layer data is conveyed to the first, second, third, and fourth layers.)
However, Huang does not explicitly teach each of the layers comprising … a respective dropout component … a scale of …. connecting a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, wherein each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output. Although Huang makes use of the commonly used batch normalization technique, he does not disclose that it adjusts a scale of output data. Also, although Huang teaches the application of a dropout function after the last convolutional layer for each layer in a dense block, he does not clearly teach that this is applied in the first layer in a dense block (i.e., the dropout layer is added after all convolutional layers except the first which may be interpreted as the convolutional layer at the end of the first layer with the dropout layer being applicable after that convolutional layer in each subsequent dense neural network layer).  Although Huang teaches the usage of a sequence of dense blocks each connected by transition layers (Figure 2) such that any one of these dense blocks generate different sets of features, in a general sense, a distinct (sub) task and such that the output of from the sequence of dense blocks is processed through a single neural network component (pooling, linear – Figure 2) to generate a prediction output, Huang does not teach a set of parallel sub-task operations that process a set of features such as may be output from the set of dense blocks.
However, Ioffe, in the analogous environment of training deep neural networks teaches wherein the batch normalization component is configured to adjust the scale of output data from the dense function component. ([p. 1, Section 1, p. 3, Section 3, Algorithm 1], However, the notion of covariate shift can be extended beyond the learning system as a whole, to apply to its parts, such as a sub-network or a layer. Consider a network computing ` = F2(F1(u, Θ1), Θ2) where F1 and F2 are arbitrary transformations, and the parameters Θ1, Θ2 are to be learned so as to minimize the loss., For a layer with d-dimensional input x = (x (1) . . . x(d) ), we will normalize each dimension xb (k) = x (k) − E[x (k) ] p Var[x (k) ] where the expectation and variance are computed over the training data set., wherein batch normalization rescales the input into any layer (output from any preceding layer) of a deep neural network as representable in a generalized learning framework of learning parameters of each layer.)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang to incorporate the teachings of Ioffe to use batch normalization to adjust the scale of output data received by any layer of a densely connected convolutional network. The modification would have been obvious because one of ordinary skill would have been motivated to achieve superior learning capability of deep neural networks by normalizing the scale of network activations by reducing the internal co-variate shift on input distributions (Ioffe, [Abstract, p. 4, Section 3]).
However, Huang and Ioffe do not explicitly teach each of the layers comprising … a respective dropout component….connecting a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, wherein each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output. Although Ioffe teaches the implementation of a dropout layer in combination with batch normalization, he does not disclose the communication of the dropout layer across successive layers of a dense neural network. Ioffe does not address multi-tasking frameworks. As previously indicated, Huang does not clearly teach the use of the dropout component in the first layer although he does teach this limitation for all others including the output layer as also point out above.
However, Jegou, in the analogous environment of designing and implementing dense neural networks teaches each of the layers comprising one or more neurons a respective dense function component, a respective batch normalization component and a respective dropout component that are sequentially connected …. ([pp. 3-4, Section 3.1, Figure 2, p. 4, Section 3.3, Table 1], Thus, the output of the ` th layer is defined as <equation 3>  where [ ... ] represents the concatenation operation. In this case, H is defined as BN, followed by ReLU, a convolution and dropout….Figure 2 shows an example of dense block construction. Starting from an input x0 (input image or output of a transition down) with m feature maps, the first layer of the block generates an output x1 of dimension k by applying H1(x0). These k feature maps are then stacked to the previous m feature maps by concatenation ([x1, x0]) and used as input to the second layer. The same operation is repeated n times, leading to a new dense block with n × k feature maps., A second layer is then applied to create another k features maps, which are again concatenated to the previous feature maps. The operation is repeated 4 times. First, in Table 1, we define the dense block layer, transition down and transition up of the architecture. Dense block layers are composed of BN, followed by ReLU, a 3 × 3 same convolution (no resolution loss) and dropout with probability p = 0.2.,  wherein in the downsampling path of the dense neural network architecture, feature map outputs are generated from a DenseNet layer that successively processes an input consisting of a concatenation of outputs from previous layers and the input into the first layer (corresponding to the dense function transformation) by successively applying a batch normalization layer and a dropout layer (both components of which are incorporated into the function H that characterizes the overall operation of that layer and wherein the output of that layer, including the output (feature map) of the first layer, is conveyed to all successive layers since it is a part of the concatenation of inputs for each successive layer (equation 3).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang and Ioffe to incorporate the teachings of Jegou to communicate the output of a first dense neural network layer generated from the application of a dropout operation to each successive dense neural network layer. The modification would have been obvious because one of ordinary skill would have been motivated to achieve superior performance of deep neural networks using a DenseNets architecture in which each layer, including the first, generates its output from a dropout layer and conveys that to successive layers to encourage reuse of features with each layer in the architecture receiving a direct supervision signal (Jegou, [Abstract, p. 3, Section 3.1, p. 6, Section 4.3, Table 4]).
However, Huang, Ioffe, and Jegou do not explicitly teach connecting a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, wherein each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output.
Although Jegous teaches the concatenation of a sequence of denseblocks with output from one denseblock also densely routed to other denseblocks (Figure 1), he does not disclose multitasking operations that process output from, say, a dense block into a set of sub-task modules (with or without denseblocks) each of which generates a result that is then later combined.
However, Li, in the analogous environment of training deep multi-tasking neural networks teaches connecting a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, wherein each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output. [Abstract, p. 2893, Section 2.1, p. 6, Section 4.3, Figure 1, Figure 2] We first train a speaker-independent DNN (SIDNN) acoustic model as a universal speaker model (USM). Based on the USM, a SAT-DNN is used to obtain a set of speaker-dependent models by assuming that all other layers except one speaker-dependent (SD) layer are shared among speakers., The SAT-DNN proposed in [7] can be regarded as a multi-task learning. In NICT-SAT-DNN [7], the DNN architecture is configured as shown in Figure 1. All of the DNN layers are shared among speakers except one SD layer. The parameters in the SD layer are updated only for a specific speaker while the parameters for all of the shared layers are updated for all speakers., wherein a joint neural network model framework includes an initial generalized (speaker independent) neural network component that processes input vectors to generate features (e.g., deep neural network features) that are then processed by each neural network in a multi-tasking component of that joint neural network model framework in which each neural network of which is directed to a particular sub-task and in which each (sub-task) neural network produces a particular task-specific output (task value) that is then fed into a third component (shared speaker independent component) that combines the output from each of the individual/sub-task neural networks and generates a predicted output at the output layer such that each of the sub-task neural networks has a deep representation (Figure 2) which is also dense layer representation because each layer in that neural network is interpreted as having, in general, full connectivity (but also dense in a more general sense for forming a distributed neural representation of the speaker).)  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang, Ioffe, and Jegou to incorporate the teachings of Li to connect a set of sub-task modules in parallel to an output of the plurality of densely connected neural networks, in which each sub-task module contains a separate task component that generates a separate task value based on at least one dense layer; and connecting the set of sub-task modules to a combined prediction component that calculates a prediction output. The modification would have been obvious because one of ordinary skill would have been motivated to achieve superior and efficient recognition performance in a multi-tasking application by modeling each sub-task as a separate deep component in a joint modeling framework with sub-task-specific adaptation (Li, [Abstract, p. 2892, Section 1, Table 2]).

In regards to claim 16, the rejection of claim 15 is incorporated and Huang further teaches wherein the operations further comprise optimizing the neural network on an output task using historical data related to the output task. ([p 5, Section 4.1, p. 5, Section 4.2, Table 2], SVHN. The Street View House Numbers (SVHN) dataset [24] contains 32×32 colored digit images. There are 73,257 images in the training set, 26,032 images in the test set, and 531,131 images for additional training…. We select the model with the lowest validation error during training and report the test error., All the networks are trained using stochastic gradient descent (SGD)., wherein the neural network is optimized using stochastic gradient descent to minimize the test error when applied, for example, to an existing (historical) dataset consisting of street view house numbers in an image recognition/classification task.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang to incorporate the teachings of Ioffe, Jegou, and Li for the same reasons as pointed out for claim 15.

In regards to claim 17, the rejection of claim 16 is incorporated and Huang further teaches wherein the optimizing comprises making adjustments to inputs received at the third layer and the fourth layer of the neural network, wherein the adjustments include at least one or more mathematical operations on outputs from the first layer and the second layer of the neural network. ([p 2, Section 1, p. 5, Section 4.2, Table 2, Figure 4], Besides better parameter efficiency, one big advantage of DenseNets is their improved flow of information and gradients throughout the network, which makes them easy to train. Each layer has direct access to the gradients from the loss function and the original input signal, leading to an implicit deep supervision [20]., All the networks are trained using stochastic gradient descent (SGD). On CIFAR and SVHN we train using batch size 64 for 300 and 40 epochs, respectively., wherein the neural network is optimized across all layers using deep supervision through the propagation of gradients computed from the error computed at the output layer (Figure 4 – test error per epoch of training) such that the stochastic gradient descent optimization process adjusts the neurons (weights) across all layers based on adjustments at the output layer in the form of minimization of error at the output layer.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang to incorporate the teachings of  Ioffe, Jegou, and Li for the same reasons as pointed out for claim 15.

In regards to claim 21, the rejection of claim 1 is incorporated and Huang further teaches  wherein the operations further comprise performing an input selection operation on inputs received at the third layer.  ([p 4, Section 3], Although each layer only produces k output feature-maps, it typically has many more inputs. It has been noted in [36, 11] that a 1×1 convolution can be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency. … In our experiments, we let each 1×1 convolution produce 4k feature-maps.., wherein each layer (including the third layer) produces k feature maps even though it may have received more than that as aggregated (concatenated) inputs from prior layers such that an input selection operation is performed to reduce the dimensionality of the feature maps (input selection operation) to k features maps.)   
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang to incorporate the teachings of Ioffe, Jegou, and Li for the same reasons as pointed out for claim 1.
 
In regards to claim 22, the rejection of claim 21 is incorporated and Huang further teaches  wherein the input selection operation includes weighting output from one previous layer differently than an output from another previous layer.30   ([p 4, Section 3], Although each layer only produces k output feature-maps, it typically has many more inputs. It has been noted in [36, 11] that a 1×1 convolution can be introduced as bottleneck layer before each 3×3 convolution to reduce the number of input feature-maps, and thus to improve computational efficiency. … In our experiments, we let each 1×1 convolution produce 4k feature-maps.., wherein an input selection of inputs into a layer (for example the third layer) is achieved from the 1x1 convolutional filter dimensionality (bottleneck) reduction operation which is a weighting operation applied to the concatenation of each of the sets of k feature maps formed from the preceding layers which reduces the dimensionality of the concatenated feature maps to a fixed number that is the same for each layer such that the convolution operation (at least comprising the 1x1 convolution function but also the 3x3 convolution function) applies a weighting to the feature maps (outputs) received from each preceding layer, wherein the weighting is different across different received outputs because the effects of the effects of the application of the convolution filter (kernel weights) is different between two layer outputs that appear successively in the concatenation than between two layer outputs that do not appear successively in that concatenation, and wherein the weighting may further be different because the output from a previous layer as recited in claim 12 need not correspond to received “inputs” as recited in claim 11.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang to incorporate the teachings of  Ioffe, Jegou, and Li for the same reasons as pointed out for claim 1.

Claims 4, 9, 13, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Huang in view of Ioffe, in view of Jegou, in view of Li, and in further view of Wangperawong et al. (”Churn analysis using deep convolutional neural networks and autoencoders”, https://arxiv.org/ftp/arxiv/papers/1604/1604.05377.pdf, arXiv preprint arXiv:1604.05377, 18 April 2016, pp. 1-6), hereinafter referred to as Wangperawong.

In regards to claim 4, the rejection of claim 2 is incorporated, and Huang, Ioffe, Jegou, and Li do not further teach wherein the output task is a predicted customer value. Huang and Jegou teach the application of densenets to CNN-based image classification or interpretation problems not to the prediction of a customer value.
However, Wangperawong, in the analogous environment of using deep convolutional neural networks, teaches  wherein the output task is a predicted customer value  ([Abstract, p. 2, Figure 2, Figure 3], Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer., If the customer registers any activity within these 30 days, we label them with 0 for active/not-churned. In Fig. 2, a green circle demarks this label for the first, top-most customer LTL. If the customer has no activity in this time frame, then we label them as 1 for churned. These are the second and third LTLs in Fig. 2., wherein a deep CNN was designed and applied to the task of predicting customer churn such that the predicted churn in itself is a numeric value associated with a customer but also is a quantification of the valuation of the customer with respect to the product.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang, Ioffe, Jegou, and Li to incorporate the teachings of Wangperawong to apply densely connected convolutional networks to the task of predicting a customer value. The modification would have been obvious because one of ordinary skill would have been motivated to leverage advances in image classification to achieve superior customer value/churn prediction performance (Wangperawong, [p. 1, Table 1]).

In regards to claim 9, the rejection of claim 1 is incorporated and Huang, Ioffe, Jegou, and Li do not further teach further comprising predicting a particular customer value for a particular user over a particular time period using the densely connected neural network to; and storing the particular predicted customer value in a database.  Huang and Jegou apply Densenets to image recognition/classification/interpretation problems. Li applies his multi-tasking framework to speech recognition.
 	However, Wangperawong, in the analogous environment of using deep convolutional neural networks, teaches   further comprising predicting a particular customer value for a particular user over a particular time period using the densely connected neural network to; ([Abstract, p. 2, Figure 2, Figure 3], Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification., We used a 30- day predictor window for our analyses here, but it is conceivable to vary this time frame to yield improved results., wherein a deep CNN was designed and applied to the task of predicting customer churn of a 30 day period such that the predicted churn in itself is a numeric value associated with a customer but also is a quantification of the valuation of the customer with respect to the product.) and storing the particular predicted customer value in a database.  ([p. 3, Table 1], Training and testing this architecture end-to-end yields results superior to that of a CHAID decision tree model when judging by the area-under-the-curve (AUC) benchmark (Table 1). The AUC of a receiver operating curve is a commonly accepted benchmark for comparing models; it accounts for both true and false positives [5,6]., wherein the results of the analysis of the predicted are retained (stored in a dataset) for performing a ROC analysis on the results.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang, Ioffe, Jegou, and Li to incorporate the teachings of Wangperawong to apply densely connected convolutional networks to the task of predicting a customer value over a time frame and storing the results of that prediction. The modification would have been obvious because one of ordinary skill would have been motivated to leverage advances in image classification to achieve superior customer value/churn prediction performance over a window of time as measured according to a ROC curve analysis (Wangperawong, [p. 1, Table 1]).

In regards to claim 13, the rejection of claim 10 is incorporated, and Huang, Ioffe, Jegou, and Li do not further teach wherein the output task is a predicted customer value. Huang and Jegou teach the application of densenets to CNN-based image classification/interpretation problems not to the prediction of a customer value.
However, Wangperawong, in the analogous environment of using deep convolutional neural networks, teaches  wherein the output task is a predicted customer value  ([Abstract, p. 2, Figure 2, Figure 3], Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer., If the customer registers any activity within these 30 days, we label them with 0 for active/not-churned. In Fig. 2, a green circle demarks this label for the first, top-most customer LTL. If the customer has no activity in this time frame, then we label them as 1 for churned. These are the second and third LTLs in Fig. 2., wherein a deep CNN was designed and applied to the task of predicting customer churn such that the predicted churn in itself is a numeric value associated with a customer but also is a quantification of the valuation of the customer with respect to the product.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang, Ioffe, Jegou, and Li to incorporate the teachings of Wangperawong to apply densely connected convolutional networks to the task of predicting a customer value. The modification would have been obvious because one of ordinary skill would have been motivated to leverage advances in image classification to achieve superior customer value/churn prediction performance (Wangperawong, [p. 1, Table 1]).

In regards to claim 18, the rejection of claim 16 is incorporated, and Huang, Ioffe, Jegou, and Li do not further teach wherein the output task is predicted customer value. Huang and Jegou teach the application of densenets to CNN-based image classification/interpretation problems not to the prediction of a customer value.
However, Wangperawong, in the analogous environment of using deep convolutional neural networks, teaches  wherein the output task is predicted customer value  ([Abstract, p. 2, Figure 2, Figure 3], Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification. Supervised learning was performed on labeled data of over 6 million customers using deep convolutional neural networks, which achieved an AUC of 0.743 on the test dataset using no more than 12 temporal features for each customer., If the customer registers any activity within these 30 days, we label them with 0 for active/not-churned. In Fig. 2, a green circle demarks this label for the first, top-most customer LTL. If the customer has no activity in this time frame, then we label them as 1 for churned. These are the second and third LTLs in Fig. 2., wherein a deep CNN was designed and applied to the task of predicting customer churn such that the predicted churn in itself is a numeric value associated with a customer but also is a quantification of the valuation of the customer with respect to the product.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang, Ioffe, Jegou, and Li to incorporate the teachings of Wangperawong to apply densely connected convolutional networks to the task of predicting a customer value. The modification would have been obvious because one of ordinary skill would have been motivated to leverage advances in image classification to achieve superior customer value/churn prediction performance (Wangperawong, [p. 1, Table 1]).

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Huang in view of Ioffe, in view of Jegou, in view of Li, in view of Wangperawong, and in further view of Fu et al. (“Credit card fraud detection using convolutional neural networks.” International Conference on Neural Information Processing, Springer, Cham, 2016, pp. 483-490), hereinafter referred to as Fu.

In regards to claim 14, the rejection of claim 13 is incorporated and Huang, Ioffe, Jegou, and Li do not further teach wherein the operations further comprise calculating predicted future customer value for a particular period of time using the neural network for a plurality of users of an electronic payment transaction service.  Huang and Jegou apply Densenets to an image recognition/classification problem. Li applies a multi-tasking framework to speech recognition.
 However, Wangperawong, in the analogous environment of using deep convolutional neural networks, teaches   wherein the operations further comprise calculating predicted future customer value for a particular period of time using the neural network for a plurality of users of an electronic … service; ([Abstract, p. 1, p. 2, Figure 2, Figure 3], In order to leverage such advances to predict churn and take pro-active measures to prevent it, we represent customers as images. Specifically, we construct a 2- dimensional array of normalized pixels where each row is for each day and each column is for each type of behavior tracked (Fig. 1). The type of behavior can include data usage, top up amount, top up frequency, voice calls, voice minutes, SMS messages, etc., Customer temporal behavioral data was represented as images in order to perform churn prediction by leveraging deep learning architectures prominent in image classification., We used a 30- day predictor window for our analyses here, but it is conceivable to vary this time frame to yield improved results., wherein a deep CNN was designed and applied to the task of predicting customer churn of a 30 day period such that the predicted churn in itself is a numeric value associated with a customer but also is a quantification of the valuation of the customer with respect to the product and wherein the product is related to an electronic service involving, for example, data usage behavior.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang, Ioffe, Jegou, and Li to incorporate the teachings of Wangperawong to apply densely connected convolutional networks to the task of predicting a customer value. The modification would have been obvious because one of ordinary skill would have been motivated to leverage advances in image classification to achieve superior customer value/churn prediction performance (Wangperawong, [p. 1, Table 1]).
However, Huang, Ioffe, Jegou, Li, and Wangperawong do not explicitly teach electronic payment transaction service. In other words neither Huang, Ioffe, Jegou, nor Wangperawong disclose an application involving payment transaction customer behavior.
However, Fu, in the analogous environment of using convolutional neural networks, teaches   wherein the operations further comprise calculating predicted future customer value for a particular period of time using the neural network for a plurality of users of an electronic payment transaction service; ([Abstract, p. 486, Section 2.4, Figure 2, Figure 4], Experiments on real-world massive transactions of a major commercial bank demonstrate its superior performance compared with some state-of-the-art methods., The method of feature transformations is proposed to adapt the CNN model. The features of credit card transactions can be partitioned into several groups. And each group has different features by different time windows., wherein a CNN is applied to predict fraud given each of a plurality of user credit card transaction histories such that the electronic payment transaction service is the credit card system of commercial banking and wherein it is also noted that Fu, like Wangperawong, teaches a prediction based on a time window.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang, Ioffe, Jegou, Li, and Wangperawong to incorporate the teachings of Fu to apply densely connected convolutional networks to a predictive task for an electronic payment transaction service. The modification would have been obvious because one of ordinary skill would have been motivated to achieve superior fraud prediction performance through the use of a convolutional neural network (Fu, [p. 489, Section 3.3, Figure 6]).

Claims 19, 20, and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Huang in view of Ioffe, in view of Jegou, in view of Li, and in further view of Fu.

In regards to claim 19, the rejection of claim 15 is incorporated and Huang, Ioffe, and Jegou do not further teach wherein the operations further comprise approving or denying an electronic payment transaction based on a value predicted by the neural network for a user of an electronic payment transaction service provider.  Huang and Jegou apply Densenets to  image recognition/classification/interpretation problems. Li applies a multi-tasking framework to speech recognition.
However, Fu, in the analogous environment of using convolutional neural networks, teaches   wherein the operations further comprise approving or denying an electronic payment transaction based on a value predicted by the neural network for a user of an electronic payment transaction service provider; ([Abstract, p. 486, Section 2.4, Figure 1, Figure 2, Figure 4], Experiments on real-world massive transactions of a major commercial bank demonstrate its superior performance compared with some state-of-the-art methods., The method of feature transformations is proposed to adapt the CNN model. The features of credit card transactions can be partitioned into several groups. And each group has different features by different time windows., wherein a CNN is applied to predict fraud given each of a plurality of user credit card transaction histories such that the electronic payment transaction service is the credit card system of commercial banking and wherein the output of the classification process (Figure 1) is the declaration (value predicted) that a transaction is fraudulent (= denying that an electronic payment transaction is OK) or legitimate (=approving that an electronic payment transaction is OK).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang, Ioffe, Jegou, and Li to incorporate the teachings of Fu to apply densely connected convolutional networks to the prediction of fraudulent vs. legitimate user transactions for an electronic payment transaction service. The modification would have been obvious because one of ordinary skill would have been motivated to achieve superior fraud prediction performance through the use of a convolutional neural network (Fu, [p. 489, Section 3.3, Figure 6]).

In regards to claim 20, the rejection of claim 15 is incorporated, and Huang, Ioffe, Jegou, and Li do not further teach wherein the operations further comprise predicting, using the neural network, a quantity for a plurality of users of an electronic payment transaction service based on profile information for the plurality of users.
However, Fu, in the analogous environment of using convolutional neural networks, teaches   wherein the operations further comprise predicting, using the neural network, a quantity for a plurality of users of an electronic payment transaction service based on profile information for the plurality of users; ([Abstract, p. 485, Section 2.2, p. 486, Section 2.4, Table 1, Figure 1, Figure 3], For traditional features, we can define the average amount of the transactions with the same customer during the past period of time as AvgAmountT. T means the time window length., Experiments on real-world massive transactions of a major commercial bank demonstrate its superior performance compared with some state-of-the-art methods., The method of feature transformations is proposed to adapt the CNN model. The features of credit card transactions can be partitioned into several groups. And each group has different features by different time windows., wherein a CNN is applied to predict fraud given a set of features associated with each of a plurality of user credit card transaction histories which include, for example, the average amount of transactions with that user/customer during a time period (profile information), wherein the electronic payment transaction service is the credit card system of commercial banking, and wherein the predicted output of the classification process (Figure 1) is the declaration (value/quantity predicted) that a transaction is fraudulent or legitimate based upon the transaction history (profile information – also Table 1) for the customers.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang, Ioffe, Jegou, and Li to incorporate the teachings of Fu to apply densely connected convolutional networks to the prediction of fraudulent user transactions for an electronic payment transaction service based on profile information for the users. The modification would have been obvious because one of ordinary skill would have been motivated to achieve superior fraud prediction performance through the use of a convolutional neural network applied to pertinent transaction profile features (Fu, [p. 489, Section 3.3, Figure 6]).

In regards to claim 23, the rejection of claim 1 is incorporated, and Huang, Ioffe, Jegou, and Li do not further teach wherein the operations further comprise predicting, using the neural network, a quantity for a plurality of users of an electronic payment transaction service based on profile information for the plurality of users.
However, Fu, in the analogous environment of using convolutional neural networks, teaches   wherein the operations further comprise predicting, using the neural network, a quantity for a plurality of users of an electronic payment transaction service based on profile information for the plurality of users; ([Abstract, p. 485, Section 2.2, p. 486, Section 2.4, Table 1, Figure 1, Figure 3], For traditional features, we can define the average amount of the transactions with the same customer during the past period of time as AvgAmountT. T means the time window length., Experiments on real-world massive transactions of a major commercial bank demonstrate its superior performance compared with some state-of-the-art methods., The method of feature transformations is proposed to adapt the CNN model. The features of credit card transactions can be partitioned into several groups. And each group has different features by different time windows., wherein a CNN is applied to predict fraud given a set of features associated with each of a plurality of user credit card transaction histories which include, for example, the average amount of transactions with that user/customer during a time period (profile information), wherein the electronic payment transaction service is the credit card system of commercial banking, and wherein the predicted output of the classification process (Figure 1) is the declaration (value/quantity predicted) that a transaction is fraudulent or legitimate based upon the transaction history (profile information – also Table 1) for the customers.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Huang, Ioffe, Jegou, and Li to incorporate the teachings of Fu to apply densely connected convolutional networks to the prediction of fraudulent user transactions for an electronic payment transaction service based on profile information for the users. The modification would have been obvious because one of ordinary skill would have been motivated to achieve superior fraud prediction performance through the use of a convolutional neural network applied to pertinent transaction profile features (Fu, [p. 489, Section 3.3, Figure 6]).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Sebastian Ruder (“An Overview of Multi-Task Learning in Deep Neural Networks”, arXiv:1706.05098v1 [cs.LG], 15 June 2017, pp. 1-14) teaches a review of various deep neural network multi-tasking architecture topologies including those with constrained sub-task components with parameter sharing (Figure 2), those with a deep neural network for generating features that are then processed through distinct sub-task modules that generate a sub-task output (Figure 3), and various stitched or sluiced networks.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT LEWIS KULP whose telephone number is (571)272-7983. The examiner can normally be reached M, Th, F 8-5:30; Tu 8-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang, can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ROBERT LEWIS KULP/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124