DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 03/23/2022 has been entered.
Response to Arguments
Applicant’s arguments with respect to claim(s) 1, 20, 24, and 28 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-8, 13-23 and 28-32 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. (US 20170147905 A1, hereinafter Huang) in view of Kokkinos (“Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory”).
Regarding Claim 1,
Huang discloses a system for training a multitask network comprising (Huang fig. 5A and 8 & [0109] recites “FIG. 8 depicts a simplified block diagram of a computing system comprising an FCN to perform end-to-end multi-task object detection, according to various embodiments of the present invention.” Computing system comprising a Fully Convolutional Neural Network (FCN) with methods depicted in Fig. 5A (i.e. system for training a multitask network)): 
non-transitory memory configured to store: 
executable instructions (Huang [0112] recites “Embodiments of the present invention may be encoded upon one or more non-transitory computer-readable media with instructions for one or more processors or processing units to cause steps to be performed. It shall be noted that the one or more non-transitory computer-readable media shall include volatile and non-volatile memory.” Non-transitory computer-readable media including memory with instructions (i.e. non-transitory memory storing executable instructions)), and 
a multitask network for determining outputs associated with a plurality of tasks (Huang [0109] recites “FIG. 8 depicts a simplified block diagram of a computing system comprising an FCN to perform end-to-end multi-task object detection, according to various embodiments of the present invention.” FCN to perform end-to-end multi-task object detection (i.e. multitask network for determining outputs from tasks)); and 
a hardware processor in communication with the non-transitory memory, the hardware processor programmed by the executable instructions to (Huang [0112] recites, in part, “…one or more non-transitory computer-readable media shall include volatile and non-volatile memory… alternative implementations are possible, including a hardware implementation or a software/hardware implementation. Hardware-implemented functions may be realized using ASIC(s), programmable arrays, digital signal processing circuitry, or the like... With these implementation alternatives in mind, it is to be understood that the figures and accompanying description provide the functional information one skilled in the art would require to write program code (i.e., software) and/or to fabricate circuits (i.e., hardware) to perform the processing required.” Hardware-implemented functions realized using ASIC(s), programmable arrays, DSP circuitry, or the like (i.e. hardware processor programmed by instructions) with non-transitory CRM including memory (i.e. in communication with non-transitory memory)): 
receive a training image associated with a plurality of reference task outputs for the plurality of tasks (Huang [0063] recites “In training, an input patch may be considered a "positive patch" if it comprises an object centered in the center at a specific scale. These patches comprise mainly negative samples around the positive samples. In embodiments, in order to fully explore the negative samples in the whole dataset, patches are cropped at random scale from training images, and resized to the same size and fed to the network.” Patches from training images (i.e. receive training image) with negative and positive samples (i.e. reference task outputs)); 
for each task of the plurality of tasks and during training time, 
determine a gradient norm of a single-task loss of (1) a task output for the task determined using the multitask network with the training image as input, and (2) a corresponding reference task output for the task associated with the training image, adjusted by a task weight for the task, with respect to a plurality of network weights of the multitask network (Huang [0058] and [0063] recites, in part, “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction. [0063] In training, an input patch may be considered a “positive patch” if it comprises an object centered in the center at a specific scale. These patches comprise mainly negative samples around the positive samples. In embodiments, in order to fully explore the negative samples in the whole dataset, patches are cropped at random scale from training images, and resized to the same size and fed to the network...The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning.” Gradient descent learning on samples using loss and output gradients (i.e. gradient norm of tasks loss of a task output), positive and negative samples (i.e. reference tasks), and scaled by number of contributing pixels (i.e. reference task output adjusted by task weights)); and 
determine a relative training rate for the task based on the single-task loss for the task (Huang [0063] recites “The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations.” Learning rate (i.e. training rate) and scaling loss and gradients comparable in multi-task (i.e. based on the single-task loss)); 
determine a gradient loss function comprising a difference between (1) the determined gradient norm for each task and (2) a corresponding target gradient norm determined based on (a) an average gradient norm of the plurality of tasks, (b) the relative training rate for the task, and (c) a hyperparameter (Huang [0054], [0058], [0062] and [0063] recite, in part, “a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. [0058] After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction. [0062] …that combining may be performed in any other manner. In embodiments, the balance between classification and regression tasks is controlled by the parameter λloc. For example, the regression target d* may be normalized by dividing by the standard object height… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations.” Loss function (i.e. gradient loss function), gradient descent learning on samples (i.e. a difference between gradient norm for each task and target gradient norm), normalized by dividing by the standard object height (i.e. average gradient norm), learning rate (i.e. training rate), and parameter λloc (i.e. hyperparameter)); 
determine a gradient of the gradient loss function with respect to a task weight for each task of the plurality of tasks (Huang [0054], [0060] and [0063] recite, in part, “a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch outputs the bounding box regression loss... [0060] …and combining the classification loss (Eq. I) and bounding box regression loss (Eq. 2) with the masks… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Combined loss functions (i.e. gradient of the gradient loss function)); and 
determine an updated task weight for each of the plurality of tasks using the gradient of the gradient loss function with respect to the task weight (Huang [0063] recites “The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used” Momentum term weight (i.e. updated task weight)).
Huang does not explicitly disclose
determine a relative training rate for the task based on the single-task loss for the task so each of the plurality of tasks are trained at a similar training rate, wherein the relative training rate for the task is associated with the task weight that is configured at each step such that each step in the plurality of tasks completes its training over a similar length of training time; 
determine an updated task weight for each of the plurality of tasks using the gradient of the gradient loss function with respect to the task weight, wherein the updated task weights are an improvement over the task weights such that the -2-Application No.: 16/169,840 Filing Date:October 24, 2018 updated task weights result in each step in the plurality of tasks completing over a more similar length of training time as compared to using the task weights.  
	However, Kokkinos teaches
determine a relative training rate for the task based on the single-task loss for the task so each of the plurality of tasks are trained at a similar training rate (pg. 5456; Our training objective is the sum of per-task losses and regularization terms applied to task-specific, as well as shared layers: L(w0,1,...,T )=R(w0)+ T t=1 γt(R(wt)+Lt (w0, wt)), (1) where t indexes tasks, w0 denotes shared CNN weights, wt are task-specific weights, γt determines the relative importance of task t, R(w∗) = λ 2 w∗2 is an 2 regularization, and Lt (w0, wt) is the task-specific loss: A loss is calculated for each task based on the weight (wt) and the relative importance of the task R(w∗) which determines the training rate.), wherein the relative training rate for the task is associated with the task weight that is configured at each step (pg. 5456; In Eq. 2 we use i to index training samples, Lt for the resulting task-specific loss between the network prediction f i t and ground truth yi t for the i-th example, wt to indicate the task-specific network parameters, and δt,i ∈ {0, 1} to indicate whether example i has ground truth for task t.) such that each step in the plurality of tasks completes its training over a similar length of training time (pg. 5454; We obtain competitive performance while jointly addressing all tasks in 0.7 seconds on a GPU. Our system will be made publicly available. And pg. 5461, section 6; We have shown that one can effectively scale up to many and diverse tasks, since the memory complexity is independent of the number of tasks, and incoherently annotated datasets can be combined during training. This has allowed us to train a single network that can solve multiple tasks in a fraction of a second with competitive performance. Tasks all finish at around 0.7 seconds (i.e. similar training rate).); 
determine an updated task weight for each of the plurality of tasks using the gradient of the gradient loss function with respect to the task weight (pg. 5456, section 3; where the weight decay term results from 2 regularization and ∇w∗ Lt (ˆy, y) denotes the gradient of the loss for task t with respect to the parameter vector w∗. The difference between the two update terms is that the common, trunk parameters, w0 affect all tasks, and as such accumulate the gradients over all tasks.), wherein the updated task weights are an improvement over the task weights such that the -2-Application No.: 16/169,840 Filing Date:October 24, 2018 updated task weights result in each step in the plurality of tasks completing over a more similar length of training time as compared to using the task weights (pg. 5459; The performance of our network on the set of tasks it adresses depends on the weights assigned to the different task losses in Eq. 1. A large weight for one task can skew the network’s internal representation in favor of the particular task and neglect the rest. Tasks are balanced based on updated weights. Weight can be increased for important tasks but will cause other task performances’ to suffer. Pg. 5461; We have shown that one can effectively scale up to many and diverse tasks, since the memory complexity is independent of the number of tasks, and incoherently annotated datasets can be combined during training. This has allowed us to train a single network that can solve multiple tasks in a fraction of a second with competitive performance.).  
	Huang and Kokkinos are analogous arts because they are both directed to the field of computing losses for multitask convolutional neural networks.
	It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the convolutional neural network of Huang with the method of training diverse tasks of Kokkinos.
	Doing so would allow for increasing the number of tasks while with low memory complexity. This allows for scaling tasks while addressing the memory demands of back propagation of the tasks (pg. 5455, col.1;)

Regarding claim 2,
Huang and Kokkinos disclose the system of claim 1, wherein the hardware processor is further programmed by the executable instructions to: 
determine the single-task loss of (1) the task output for each task determined using the multitask network with the training image as input, and (2) the corresponding task output for the task associated with the training image (Huang [0032]-[0033] and [0053] recites, in part, “FIG. 1 illustrates an exemplary object detection pipeline for a convolutional network…100 receives input image 112 or image pyramid that is fed to network 104. After several layers of convolution and pooling, feature map 106 is upsampled and convolution layers are applied to obtain final output 108. [0053] In embodiments, independent branch 360 outputs the confidence score ŷ (per pixel in the output map) of being a target object. Given the ground truth label y*∈ {0,1}, the classification loss can be defined…” Input image (i.e. training image as input)).  

Regarding claim 3,
Huang and Kokkinos disclose the system of claim 2, wherein the non-transitory memory is configured to further store: a plurality of loss functions associated with the plurality of tasks (Huang [0053]-[0054] recite, in part, “In embodiments, like Fast R-CNN, the network has two sibling output branches. In embodiments, independent branch 360 outputs the confidence score ŷ (per pixel in the output map) of being a target object. Given the ground truth label y*∈ {0,1}, the classification loss can be defined as follows: Lcls (ŷ, y*)=∥ŷ−y∥2  (Eq. 1) [0054] In embodiments, a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the bounding-box regression loss, denoted as Lloc, for example, to minimize the L2 loss between predicted location offsets d^=d^tx, d^ty, d^tx, d^ty) and targets d*=(d*tx, d*ty, d*tx, d*ty), as formulized by: Lloc(d^,d*)=Σi∈{tx,ty,bx,by} ∥d^i- d*i∥2 (Eq. 2)”. Classification loss, regression loss, L2 loss (i.e. plurality of loss functions)). 

Regarding claim 4,
Huang and Kokkinos disclose the system of claim 3, wherein to determine the single-task loss, the hardware processor is further programmed by the executable instructions to: 
determine the single-task loss of (1) the task output for each task determined using the multitask network with the training image as input, and (2) the corresponding task output for the task associated with the training image, using a loss function of the plurality of loss functions associated with the task (Huang [0032]-[0033] and [0053]-[0054] recites, in part, “FIG. 1 illustrates an exemplary object detection pipeline for a convolutional network…100 receives input image 112 or image pyramid that is fed to network 104. After several layers of convolution and pooling, feature map 106 is upsampled and convolution layers are applied to obtain final output 108. [0053] In embodiments, independent branch 360 outputs the confidence score ŷ (per pixel in the output map) of being a target object. Given the ground truth label y*∈ {0,1}, the classification loss can be defined… [0054] In embodiments, a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the bounding-box regression loss, denoted as Lloc, for example, to minimize the L2 loss between predicted location offsets d^=d^tx, d^ty, d^tx, d^ty) and targets d*=(d*tx, d*ty, d*tx, d*ty), as formulized by: Lloc(d^,d*)=Σi∈{tx,ty,bx,by} ∥d^i- d*i∥2 (Eq. 2)”).  

Regarding claim 5,
Huang and Kokkinos disclose the system of claim 1, wherein the hardware processor is further programmed by the executable instructions to: 
determine a multitask loss function comprising the single-task loss adjusted by the task weight for each task (Huang [0033] and [0053] recites, in part, “a single convolutional network simultaneously outputs multiple predicted bounding boxes 120 and class confidences. In embodiments, except for a non-maximum suppression (NMS) step, components of object detection are modeled as an FCN, such that it becomes unnecessary to engage in region proposal generation. [0053] …independent branch 360 outputs the confidence score ŷ (per pixel in the output map) of being a target object. Given the ground truth label y*∈ {0,1}, the classification loss can be defined as follows: Lcls (ŷ, y*) = ∥ŷ−y∥2  (Eq. 1).” Multiple predicted bounding boxes and class confidences each with classification loss (i.e. multitask loss function)); 
determine a gradient of the multitask loss function with respect to all network weights of the multitask network (Huang [0058] and [0063] recite, in part, “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction… In embodiments, in the forward propagation phase, the classification loss (Eq. 1) of output pixels is sorted in descending order, and the top 1% are assigned as hard-negative. [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Loss and output gradients (i.e. a gradient of the multitask loss function)); and 
determine updated network weights of the multitask network based on the gradient of the multitask loss function (Huang [0057] and [0063] recite, in part, “… the loss weight is set to 0, and for each pixel labeled non-positive in the output coordinate space, an ignore flag fign is set to 1 if a pixel with positive label within rnear=2 pixel length exists. [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Momentum term weight (i.e. updated task weight)).  


Regarding claim 6,
Huang and Kokkinos disclose the system of claim 1, wherein the gradient norm of the single-task loss adjusted by the task weight is a L2 norm of the single-task loss adjusted by the task weight (Huang [0053]-[0054] recite, in part, “In embodiments, like Fast R-CNN, the network has two sibling output branches. In embodiments, independent branch 360 outputs the confidence score ŷ (per pixel in the output map) of being a target object. Given the ground truth label y*∈ {0,1}, the classification loss can be defined as follows: Lcls (ŷ, y*)=∥ŷ−y∥2  (Eq. 1) [0054] In embodiments, a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the bounding-box regression loss, denoted as Lloc, for example, to minimize the L2 loss between predicted location offsets d^=d^tx, d^ty, d^tx, d^ty) and targets d*=(d*tx, d*ty, d*tx, d*ty), as formulized by: Lloc(d^,d*)=Σi∈{tx,ty,bx,by} ∥d^i- d*i∥2 (Eq. 2)”. L2 Loss such as Lcls with L2 norm notation ∥ŷ−y∥2 (i.e. L2 norm)).  

Regarding claim 7,
Huang and Kokkinos disclose the system of claim 1, wherein the gradient loss function is a L1 loss function (Huang [0053]-[0054] recites “In embodiments, like Fast R-CNN, the network has two sibling output branches. In embodiments, independent branch 360 outputs the confidence score ŷ (per pixel in the output map) of being a target object. Given the ground truth label y*∈ {0,1}, the classification loss can be defined as follows: Lcls (ŷ, y*)=∥ŷ−y∥2  (Eq. 1) [0054] In embodiments, a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the bounding-box regression loss, denoted as Lloc, for example, to minimize the L2 loss between predicted location offsets d^=d^tx, d^ty, d^tx, d^ty) and targets d*=(d*tx, d*ty, d*tx, d*ty), as formulized by: Lloc(d^,d*)=Σi∈{tx,ty,bx,by} ∥d^i- d*i∥2 (Eq. 2)” Loss functions such as hinge loss and cross-entropy loss (i.e. L1 loss function)).

Regarding claim 8,
Huang and Kokkinos disclose the system of claim 1, wherein the hardware processor is further programmed by the executable instructions to: determine an average of the gradient norms of the plurality of tasks as the average gradient norm (Huang [0058] and [0062]-[0063] recites “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction. [0062] …that combining may be performed in any other manner. In embodiments, the balance between classification and regression tasks is controlled by the parameter λloc. For example, the regression target d* may be normalized by dividing by the standard object height… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used” Normalized by dividing by the standard object height (i.e. average of the gradient norms)).
Regarding claim 13,
Huang discloses the system of claim 1, wherein to determine the gradient of the gradient loss function, the hardware processor is further programmed by the executable instructions to: 
determine the gradient of the gradient loss function with respect to the task weight for each task of the plurality of tasks while keeping the target gradient norm for the task constant (Huang [0058], [0060] and [0063] recites “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction. [0060] …and combining the classification loss (Eq. I) and bounding box regression loss (Eq. 2) with the masks… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Combined loss functions (i.e. gradient of the gradient loss function)).  

Regarding claim 14,
Huang discloses the system of claim 1, wherein the hardware processor is further programmed by the executable instructions to: normalize the updated weights for the plurality of tasks (Huang [0062]-[0063] recites “…that combining may be performed in any other manner. In embodiments, the balance between classification and regression tasks is controlled by the parameter λloc. For example, the regression target d* may be normalized by dividing by the standard object height… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.”).  
Regarding claim 15,
Huang and Kokkinos disclose the system of claim 14, wherein to normalize the updated weights for the plurality of tasks, the hardware processor is further programmed by the executable instructions to: normalize the updated weights for the plurality of tasks to a number of the plurality of tasks (Huang [0062]-[0063] recites “One of skill in the art will also appreciate that combining may be performed in any other manner. In embodiments, the balance between classification and regression tasks is controlled by the parameter λloc. For example, the regression target d* may be normalized by dividing by the standard object height, which is 50/4 in ground truth map, and setting λloc=3. [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.”).  

Regarding claim 16,
Huang and Kokkinos disclose the system of claim 1, wherein the plurality of tasks comprises a regression task, a classification task, or a combination thereof (Huang [0044] and [0062] recite, in part, “FIG. 3 illustrates a network architecture according to various embodiments of the present disclosure. Network architecture 300 in example in FIG. 3 is derived from the VGG 19 model used for image classification. [0062] In embodiments, the balance between classification and regression tasks is controlled by the parameter λloc.” Classification and regression tasks.).

Regarding claim 17,
Huang and Kokkinos disclose the system of claim 16, wherein the classification task comprises perception, face recognition, visual search, gesture recognition, semantic segmentation, object detection, room layout estimation, cuboid detection, lighting detection, simultaneous localization and mapping, relocalization, speech processing, speech recognition, natural language processing, or a combination thereof (Huang [0030] and [0104] recites, in part, “Object detection often involves multi-task learning, such as landmark localization, pose estimation, and semantic segmentation. [0104] Training and Testing. As with face detection, two models—one with and one without landmark localization—are trained on the KITTI object detection training set. Since KITTI does not provide landmarks for cars, 8 landmarks shown in FIG. 5 are annotated for 7790 cars…” Object detection, landmark localization (i.e. simultaneous localization and mapping), and semantic segmentation).

Regarding claim 18,
Huang and Kokkinos disclose the system of claim 1, wherein the multitask network comprises a plurality of shared layers and an output layer comprising a plurality of task specific filters (Huang fig. 3 and [0044] & [0046] recites “Network architecture 300 in example in FIG. 3 is derived from the VGG 19 model used for image classification… network architecture 300 comprises 16 convolution layers, 12 convolution layers labeled Conv1_1 304 through Conv4_4 330; and 3 pooling layers 340-344. [0046] Upsampling is performed by bi-linear filtering, for example, to generate a 4×4 matrix from a 2×2 matrix patch using linear interpolation. In embodiments, to compute a final score, the upsampled feature map is input to two independent branches 360-362. In FIG. 3, the first branch begins with Conv5_1_det 352, a convolution layer for detection, and the second branch begins with Conv5_1_loc 356, a convolution layer for localization. One of ordinary skill in the art will appreciate that computations in the independent branches may be performed simultaneously.” Convolutional layers with pooling layers (i.e. shared layers) and bi-linear filtering to two independent branches for detection and localization (i.e. output layer with task specific filters)).  

Regarding claim 19,
Huang and Kokkinos disclose the system of claim 18, wherein the output layer of the multitask network comprises an affine transformation layer (Huang [0046] and [0048] recites “Upsampling is performed by bi-linear filtering, for example, to generate a 4×4 matrix from a 2×2 matrix patch using linear interpolation. In embodiments, to compute a final score, the upsampled feature map is input to two independent branches 360-362. In FIG. 3, the first branch begins with Conv5_1_det 352, a convolution layer for detection, and the second branch begins with Conv5_1_loc 356, a convolution layer for localization. One of ordinary skill in the art will appreciate that computations in the independent branches may be performed simultaneously. [0048] Multi-Level Feature Fusion. In embodiments, features from different convolution layers are combined to enhance the performance of certain tasks, such as edge detection and segmentation. Part-level features focus on local details of objects to find discriminative appearance parts, whereas object-level or high-level features usually have a larger receptive field in order to recognize objects. A larger receptive field also comprises context information that may aid in predicting more accurate results.” Upsampling to generate 4x4 matrix from 2x2 matrix using linear interpolation through bi-linear filtering (i.e. affine transformation layer) for computing a final score).  

Regarding claim 20,
Huang discloses a method for training a multitask network comprising: 
under control of a hardware processor 
-50-receiving a training datum of a plurality of training data each associated with a plurality of reference task outputs for the plurality of tasks (Huang [0032] and [0063] recites “FIG. 1 illustrates an exemplary object detection pipeline for a convolutional network according to various embodiments of the present disclosure. In embodiments, pipeline 100 receives input image 112 or image pyramid that is fed to network 104. After several layers of convolution and pooling, feature map 106 is upsampled and convolution layers are applied to obtain final output 108. [0063] In training, an input patch may be considered a “positive patch” if it comprises an object centered in the center at a specific scale. These patches comprise mainly negative samples around the positive samples. In embodiments, in order to fully explore the negative samples in the whole dataset, patches are cropped at random scale from training images, and resized to the same size and fed to the network.” Input patch (i.e. training datum) from received images (i.e. training data) with positive and negative samples (i.e. reference task outputs)); 
for each task of the plurality of tasks during a training, 
determining a gradient norm of a single-task loss adjusted by a task weight for the task, with respect to a plurality of network weights of the multitask network, the single-task loss being of (1) a task output for the task determined using a multitask network with the training datum as input, and (2) a corresponding reference task output for the task associated with the training datum (Huang [0058] and [0063] recite “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction. [0063] In training, an input patch may be considered a “positive patch” if it comprises an object centered in the center at a specific scale. These patches comprise mainly negative samples around the positive samples. In embodiments, in order to fully explore the negative samples in the whole dataset, patches are cropped at random scale from training images, and resized to the same size and fed to the network... The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning.” Gradient descent learning on samples using loss and output gradients (i.e. gradient norm of tasks loss of a task output), positive and negative samples (i.e. reference tasks) and scaled by number of contributing pixels (i.e. reference task output adjusted by task weights)) [0032]-[0039]); and 
determining a gradient loss function comprising a difference between (1) the determined gradient norm for each task and (2) a corresponding target gradient norm determined based on (a) an average gradient norm of the plurality of tasks, and (b) the relative training rate for the task (Huang [0054], [0060] and [0063] recites “…a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the bounding-box regression loss… [0060] …combining the classification loss (Eq. 1) and bounding box regression loss (Eq. 2) with the masks… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Loss function (i.e. gradient loss function), gradient descent learning on samples (i.e. a difference between gradient norm for each task and target gradient norm), normalized by dividing by the standard object height (i.e. average gradient norm), learning rate (i.e. training rate), and parameter λloc (i.e. hyperparameter)); and 
determining an updated task weight for each of the plurality of tasks using a gradient of a gradient loss function with respect to the task weight (Huang [0063] recites “The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Momentum term weight (i.e. updated task weight)).
Huang does not explicitly disclose
determining a relative training rate for the task based on the single- task loss for the task; 
wherein the relative training rates for the plurality of task are determined so that each step in the plurality of tasks completes over a similar length of training time 
However, Kokkinos teaches
determining a relative training rate for the task based on the single-task loss for the task (pg. 5456; Our training objective is the sum of per-task losses and regularization terms applied to task-specific, as well as shared layers: L(w0,1,...,T )=R(w0)+ T t=1 γt(R(wt)+Lt (w0, wt)), (1) where t indexes tasks, w0 denotes shared CNN weights, wt are task-specific weights, γt determines the relative importance of task t, R(w∗) = λ 2 w∗2 is an 2 regularization, and Lt (w0, wt) is the task-specific loss: A loss is calculated for each task based on the weight (wt) and the relative importance of the task R(w∗) which determines the training rate.); 
wherein the relative training rates for the plurality of task are determined so that each step in the plurality of tasks completes over a similar length of training time (pg. 5454; We obtain competitive performance while jointly addressing all tasks in 0.7 seconds on a GPU. Our system will be made publicly available. And pg. 5461, section 6; We have shown that one can effectively scale up to many and diverse tasks, since the memory complexity is independent of the number of tasks, and incoherently annotated datasets can be combined during training. This has allowed us to train a single network that can solve multiple tasks in a fraction of a second with competitive performance. Tasks all finish at around 0.7 seconds (i.e. similar training rate).); 
Huang and Kokkinos are analogous arts because they are both directed to the field of computing losses for multitask convolutional neural networks.
	It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the convolutional neural network of Huang with the method of training diverse tasks of Kokkinos.
	Doing so would allow for increasing the number of tasks while with low memory complexity. This allows for scaling tasks while addressing the memory demands of back propagation of the tasks (pg. 5455, col.1;)

Regarding claim 21,
Huang and Kokkinos disclose the method of claim 20, wherein the corresponding target gradient norm is determined based on (a) an average gradient norm of the plurality of tasks, (b) the relative training rate for the task, and (c) a hyperparameter (Huang [0058] and [0062]-[0063] recites “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction. [0062] …that combining may be performed in any other manner. In embodiments, the balance between classification and regression tasks is controlled by the parameter λloc. For example, the regression target d* may be normalized by dividing by the standard object height… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning.” Loss function (i.e. gradient loss function), gradient descent learning on samples (i.e. a difference between gradient norm for each task and target gradient norm), normalized by dividing by the standard object height (i.e. average gradient norm), learning rate (i.e. training rate), and parameter λloc (i.e. hyperparameter)).

Regarding claim 22,
Huang and Kokkinos disclose the method of claim 20, further comprising: determining the gradient of the gradient loss function with respect to a task weight for each task of the plurality of tasks (Huang [0058], [0060] and [0063] recites “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction. [0060] …and combining the classification loss (Eq. I) and bounding box regression loss (Eq. 2) with the masks… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning.” Combined loss functions (i.e. gradient of the gradient loss function)). 
 
Regarding claim 23,
Huang and Kokkinos disclose the method of claim 20, wherein the plurality of training data comprises a plurality of training images, and wherein the plurality of tasks comprises computer vision tasks (Huang [0029], [0030] and [0032] recites, in part, “More recent designs use deep CNNs to locate objects…these methods use shared computation of convolutions, which has been attracting increased attention due to its relatively efficient and accurate visual recognition. [0030] Object detection often involves multi-task learning, such as landmark localization, pose estimation, and semantic segmentation. [0032] FIG. 1 illustrates an exemplary object detection pipeline for a convolutional network according to various embodiments of the present disclosure. In embodiments, pipeline 100 receives input image 112 or image pyramid that is fed to network 104. After several layers of convolution and pooling, feature map 106 is upsampled and convolution layers are applied to obtain final output 108. In embodiments, output feature 108 map is converted to bounding boxes 120, and non-maximum suppression is applied to bounding boxes 120 that exceed a threshold.” Pipeline for receiving input images (i.e. training data comprising training images) and object detection with CNNs for visual recognition (i.e. computer vision)).   

Regarding Claim 28,
Huang teaches A method for training a multitask neural network for determining outputs associated with a plurality of tasks, the method comprising: under control of a hardware processor, and during a training: 
receiving a training sample set associated with a plurality of reference task outputs for the plurality of tasks (Huang [0063] recites “In training, an input patch may be considered a “positive patch” if it comprises an object centered in the center at a specific scale. These patches comprise mainly negative samples around the positive samples. In embodiments, in order to fully explore the negative samples in the whole dataset, patches are cropped at random scale from training images, and resized to the same size and fed to the network.” Input patch from training images with positive and negative samples (i.e. training sample set associated with reference tasks outputs)); 
calculating a multitask loss function based at least partly on a weighted combination of a plurality of single task loss functions associated with the plurality of tasks, wherein a plurality of weights associated with the plurality of single task loss functions in the multitask loss function can vary at each training timestep (Huang [0054] and [0063] recites, in part, “…a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the bounding-box regression loss… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Loss function used in both face and car detection tasks (i.e. multitask loss function), loss functions such as L2 loss, hinge loss, and cross-entropy loss (i.e. combination of single task loss functions), weight decay and loss and output gradients scaled (i.e. weights with varying loss), iterations (i.e. steps)); 
determining the plurality of weights for associated single task loss functions such that each task of the plurality of tasks is trained at a similar training rate, (Huang [0063] recites, in part, “In training, an input patch may be considered a “positive patch” if it comprises an object centered in the center at a specific scale. These patches comprise mainly negative samples around the positive samples. In embodiments, in order to fully explore the negative samples in the whole dataset, patches are cropped at random scale from training images, and resized to the same size and fed to the network… The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Loss and output gradients be scaled so that both loss and output gradients are comparable in multi-task learning using learning rate and weight decay (i.e. weights and loss at similar scale and rate)); and 
outputting a trained multitask neural network based at least in part on the training (Huang [0067] and [0100] recites, in part, “the final output refine branch uses the classification score map and landmark localization maps as input to refine the detection results. In embodiments, to further increase detection performance, a high level spatial model may be used to learn the constraints of landmark confidence and bounding box scores. [0100] …the models trained with different batch iterations still have high diversity since another significant boost has been seen by the model ensemble.” Final output branch and trained model (i.e. trained multitask neural network)).
Huang does not explicitly disclose
determining a plurality of weights associated with the plurality of single task loss functions in the multitask loss function such that each weight can vary at each training timestep,  
wherein the plurality of weights for associated single task loss functions are determined such that each step in the plurality of tasks is trained at a similar training rate, wherein the training rate for each task is -7-Application No.: 16/169,840 Filing Date:October 24, 2018 determined so that each of the plurality of tasks completes over a similar length of training time;
However, Kokkinos teaches
determining a plurality of weights associated with the plurality of single task loss functions in the multitask loss function such that each weight can vary at each training timestep (pg. 5456; Our training objective is the sum of per-task losses and regularization terms applied to task-specific, as well as shared layers: L(w0,1,...,T )=R(w0)+ T t=1 γt(R(wt)+Lt (w0, wt)), (1) where t indexes tasks, w0 denotes shared CNN weights, wt are task-specific weights, γt determines the relative importance of task t, R(w∗) = λ 2 w∗2 is an 2 regularization, and Lt (w0, wt) is the task-specific loss: A loss is calculated for each task based on the weight (wt) and the relative importance of the task R(w∗) which determines the training rate.);
wherein the plurality of weights for associated single task loss functions are determined such that each step in the plurality of tasks is trained at a similar training rate (pg. 5456; Our training objective is the sum of per-task losses and regularization terms applied to task-specific, as well as shared layers: L(w0,1,...,T )=R(w0)+ T t=1 γt(R(wt)+Lt (w0, wt)), (1) where t indexes tasks, w0 denotes shared CNN weights, wt are task-specific weights, γt determines the relative importance of task t, R(w∗) = λ 2 w∗2 is an 2 regularization, and Lt (w0, wt) is the task-specific loss: A loss is calculated for each task based on the weight (wt) and the relative importance of the task R(w∗) which determines the training rate.), wherein the training rate for each task is -7-Application No.: 16/169,840 Filing Date:October 24, 2018 determined so that each of the plurality of tasks completes over a similar length of training time (pg. 5454; We obtain competitive performance while jointly addressing all tasks in 0.7 seconds on a GPU. Our system will be made publicly available. And pg. 5461, section 6; We have shown that one can effectively scale up to many and diverse tasks, since the memory complexity is independent of the number of tasks, and incoherently annotated datasets can be combined during training. This has allowed us to train a single network that can solve multiple tasks in a fraction of a second with competitive performance. Tasks all finish at around 0.7 seconds (i.e. similar training rate).); and 
Huang and Kokkinos are analogous arts because they are both directed to the field of computing losses for multitask convolutional neural networks.
	It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the convolutional neural network of Huang with the method of training diverse tasks of Kokkinos.
	Doing so would allow for increasing the number of tasks while with low memory complexity. This allows for scaling tasks while addressing the memory demands of back propagation of the tasks (pg. 5455, col.1;)

Regarding claim 29,
Huang and Kokkinos disclose the method of claim 28, wherein the tasks comprise computer vision tasks, speech recognition tasks, natural language processing tasks, or medical diagnostic tasks (Huang [0029]-[0030] recites “More recent designs use deep CNNs to locate objects… these methods use shared computation of convolutions, which has been attracting increased attention due to its relatively efficient and accurate visual recognition. [0030] Object detection often involves multi-task learning, such as landmark localization, pose estimation, and semantic segmentation… Some deep net based object detection designs integrate multi-task learning, for example, to simultaneously learn facial landmarks and expressions, or simultaneously use a pose joint regressor and sliding window body part detector in a deep network architecture.” Visual recognition and object detection (i.e. computer vision tasks)).  

Regarding claim 30,
Huang and Kokkinos disclose the method of claim 28, wherein the multitask loss function is a linear combination of the weights and the single task loss functions (Huang [0048], [0054] and [0057] recite, in part, “…features from different convolution layers are combined to enhance the performance of certain tasks, such as edge detection and segmentation. Part-level features focus on local details of objects to find discriminative appearance parts, whereas object-level or high-level features usually have a larger receptive field in order to recognize objects. A larger receptive field also comprises context information that may aid in predicting more accurate results. [0054] … a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the bounding-box regression loss… [0057] …the loss weight is set to 0, and for each pixel labeled non-positive in the output coordinate space, an ignore flag fign is set to 1 if a pixel with positive label within rnear=2 pixel length exists.” Convolution layers combined to enhance performance of certain tasks (i.e. combination), using loss functions such as L2 loss, hinge loss and cross-entropy loss (i.e. combination of single task loss), and loss weights for each pixel (i.e. weights)).  

Regarding claim 31,
Huang and Kokkinos disclose the method of claim 28, wherein determining the weights for each of the single task loss functions comprises penalizing the multitask neural network when backpropagated gradients from a first task of the plurality of tasks are substantially different from backpropagated gradients from a second task of the plurality of tasks (Huang [0058] and [0067] recites “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction. In embodiments, negative mining may be performed efficiently by using information about previous decision boundaries (also referred to as online bootstraping). [0067] … the final output refine branch uses the classification score map and landmark localization maps as input to refine the detection results. In embodiments, to further increase detection performance, a high level spatial model may be used to learn the constraints of landmark confidence and bounding box scores.” Gradient descent learning on samples (i.e. backpropated gradients), negative mining with badly predicted samples (i.e. penalizing network), using classification and localization maps to refine (i.e. a first task and a second task)).

Regarding claim 32,
Huang and Kokkinos disclose the method of claim 28, wherein determining the weights for each of the single task loss functions comprises decreasing a first weight for a first task of the plurality of tasks relative to a second weight for a second task of the plurality of tasks when a first training rate for the first task exceeds a second training rate for the second task (Kokkinos pg. 5459 “The performance of our network on the set of tasks it adresses depends on the weights assigned to the different task losses in Eq. 1. A large weight for one task can skew the network’s internal representation in favor of the particular task and neglect the rest.”).

Claims 9-12 and 33-36 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. (US 20170147905 A1, hereinafter Huang) in view of Kokkinos (“Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory”) and Natarajan (US 20130013275 A1).
Regarding claim 9,
Huang and Kokkinos disclose the system of claim 1 (Huang [0109] recites, in part, “FIG. 8 depicts a simplified block diagram of a computing system comprising an FCN to perform end-to-end multi-task object detection…”). 
However, Huang does not disclose wherein the corresponding target gradient norm is determined based on (a) an average gradient norm of the plurality of tasks, (b) an inverse of the relative training rate for the task, and (c) a hyperparameter.
Natarajan teaches wherein the corresponding target gradient norm is determined based on (a) an average gradient norm of the plurality of tasks, (b) an inverse of the relative training rate for the task, and (c) a hyperparameter (Natarajan [0035], [0100], [0105], and [0107] recite, in part, “Thus, if μ denotes the mean parameter, then from the formula (3) μ =b'(θ), where b'(*) is an invertible mean-value mapping (i.e., a mapping between the canonical parameter and the mean parameter is an invertible change of the parameters), and its inverse b'-1(*) is termed a canonical link function. [0100] Alternatively, the special form (e.g., the formula (12) of the likelihood function in these cases (i.e., the Normal, Gamma and Inverse Gaussian distributions) suggests that existing GLM macros (i.e., a rule or pattern that specifies how a certain input sequence should be mapped to an output sequence according to GLM) can be used to handle joint modeling of the mean and dispersion. The mean and dispersion sub-models in the formulas (52)-(53) are equivalent to loss functions used in the GLM. Therefore, starting from an initial estimate of Φ, the computing system 1600 alternates between solving a GLM model for μ with Φ held fixed, followed by the GLM for Φ with μ held fixed (the latter GLM will use standard macros that are available for performing Gamma regression in the Normal and Inverse Gaussian cases). [0105] Every function in the function space can be represented as a linear combination of basis functions, just as every vector in a vector space can be represented as a linear combination of basis vectors. [0107] …the computing system 1600 chooses the basis functions that maximally correlate with a corresponding steepest-descent gradient direction (i.e., a gradient method in which a choice of a direction is where a function f decreases most quickly, which is the direction opposite to a gradient of the function f) of the deviance loss function.” Mean parameter μ (i.e. average gradient norm), invertible mean-value mapping (i.e. inverse of the relative training rate), and Φ (i.e. hyperparameter)).
Natarajan and Huang are both directed to problems involving joint modeling and optimization for a variety of data sets. In view of the teachings of Natarajan, it would have been obvious to one of ordinary skill in the art to apply the teachings of Natarajan to Huang before the effective filing date of the claimed invention in order to perform more advanced model training by incorporating joint modeling of the mean and the dispersion for a variety of data sets thereby improving Huang (cf. Natarajan [0006]-[0007] recites, in part, the following: 
“Traditionally, GLM (General Linear Model), which is widely used for mean regression modeling, has been used for conditional response distributions from the exponential dispersion family. The GLM is a statistical linear model for a suitable transformation of the mean, term the link transformation. The GLM may be represented as g(Y)=XB+U, where Y is a vector with series of response measurements, g(.) is the link function that is chosen appropriately for the assumed response distribution, X is a design matrix, B is a vector including parameters to be estimated, and U is a vector including errors and noises. The design matrix refers to a matrix of explanatory variables (one or zeroes, or reals), that represents a specific statistical or experimental model). However, a traditional methodology such as the GLM cannot perform joint modeling of the mean and the dispersion without inventing additional art, particularly in the case when the covariates in the data are complex, and must be simplified and grouped in a preprocessing step that considerably detracts from the quality of resulting model.

Therefore, it is highly desirable to provide a system and method to perform joint modeling of a mean and dispersion suitable for a wide variety of data sets without requiring any preprocessing and grouping of the sample data.”
).


Regarding claim 10,
The Huang/Kokkinos/Natarajan Combination teaches the system of claim 9, wherein the hardware processor is further programmed by the executable instructions to: 
determine the average gradient norm of the plurality of tasks multiplied by the inverse relative training rate for the task to a power corresponding of the hyperparameter as the corresponding target gradient norm (Natarajan [0035], [0062], [0100], [0105] and [0107] recite, in part, “Thus, if μ denotes the mean parameter, then from the formula (3) μ =b'(θ), where b'(*) is an invertible mean-value mapping (i.e., a mapping between the canonical parameter and the mean parameter is an invertible change of the parameters), and its inverse b'-1(*) is termed a canonical link function. A link function serves to link a random or stochastic component of a model, a probability distribution of a response variable, to a systematic component of a model (e.g., a linear predictor). [0062] The form c(y, Φ) in the formula (11), which is exact only for the Normal, Gamma and Inverse Gaussian distributions, also has the same form with a saddlepoint density approximation in a leading-order term for Φ→0 for other conditional response distributions from the exponential dispersion family. [0100] Alternatively, the special form (e.g., the formula (12) of the likelihood function in these cases (i.e., the Normal, Gamma and Inverse Gaussian distributions) suggests that existing GLM macros (i.e., a rule or pattern that specifies how a certain input sequence should be mapped to an output sequence according to GLM) can be used to handle joint modeling of the mean and dispersion. The mean and dispersion sub-models in the formulas (52)-(53) are equivalent to loss functions used in the GLM. Therefore, starting from an initial estimate of Φ, the computing system 1600 alternates between solving a GLM model for μ with Φ held fixed, followed by the GLM for Φ with μ held fixed (the latter GLM will use standard macros that are available for performing Gamma regression in the Normal and Inverse Gaussian cases). [0105] The basis function refers to an element of a particular basis (i.e., a set of vectors) for a function space (i.e., a set of functions). Every function in the function space can be represented as a linear combination of basis functions, just as every vector in a vector space can be represented as a linear combination of basis vectors. [0107] …the computing system 1600 chooses the basis functions that maximally correlate with a corresponding steepest-descent gradient direction (i.e., a gradient method in which a choice of a direction is where a function f decreases most quickly, which is the direction opposite to a gradient of the function f) of the deviance loss function.” Examiner interprets μ (i.e. average gradient), inverse b'-1(*) as inverse and corresponding steepest-descent gradient direction as average gradient norm multiplied by inverse relative training rate, and Φ as hyperparameter.).  
Please see motivation for claim 9 above.

Regarding claim 11,
The Huang/Kokkinos/Natarajan Combination teaches The system of claim 9, wherein to determine the relative training rate for the task based on the single-task loss for the task, the hardware processor is further programmed by the executable instructions to: 
determine the inverse of the relative training rate for the task based on a loss ratio of the single-task loss for the task and another single-task loss for the task (Natarajan [0118] and [0135] recite, in part, “Where μk{I}, Φk{I} denote the mean and dispersion estimates respectively at the k'th stage from training data for the I'th fold, and L{-I}( μk{I}, Φk{I}) denotes a loss function for test data in the I'th fold at the k'th stage. [0135] The table 3 describes for various response distributions… when a "correct" loss function is used for a model fit. For all three response distributions (i.e., Normal, Gamma and Inverse Gaussian), a choice k=1 yields the best model fit, and this is also the simplest basis function that is consistent with an assumed piecewise-constant variation in the synthetic data.” Loss ratios depicted in Table 3 for different distributions and tasks).
Please see motivation for claim 9.
  
Regarding claim 12,
The Huang/Kokkinos/Natarajan Combination teaches the system of claim 11, wherein to determine the inverse of the relative rate for the task, the hardware processor is further programmed by the executable instructions to: 
determine a ratio of the loss ratio of the task and an average of loss ratios of the plurality of tasks as the inverse of the relative training rate (Natarajan [0103], [0118], [0122] and [0135] recites, in part, “…the computing system 1600 obtains a deviance loss function (e.g., a formula (30)) [0118] Where μk{I}, Φk{I} denote the mean and dispersion estimates respectively at the k'th stage from training data for the I'th fold, and L{-I}(μk{I}, Φk{I}) denotes a loss function for test data in the I'th fold at the k'th stage. The number of cross-validation folds NCV is typically 5 or 10. Other criteria such as the 1-SE rule, in which the K is the smallest number of stages for which the cross-validation loss is within 1 standard error of a minimum cross-validation loss, can also be used as an alternative to the formula (66). [0122] Another important benefit of the use of the least-squares fitting criterion arises in a treatment of categorical covariate splits, where a convexity of the least-squares splitting criterion ensures that if .OMEGA. is a cardinality of a categorical covariate, then the best split in this covariate can be found in just O(.OMEGA.) steps, without having to search through the space of all possible splits which would require O(2.sup..OMEGA.) steps. An ability to evaluate categorical splits in a linear rather than exponential number of steps allows categorical features of high cardinality to be used in the regression modeling without requiring any preprocessing and/or grouping of category levels in these features for computational tractability in a fitting procedure. In contrast, regression trees that directly use the overall loss function as the splitting criterion may not have this useful property (i.e., not requiring preprocessing and grouping) for categorical covariates, from the formula (9) relevant loss functions may not always be convex in a mean regression, and are certainly non-convex in the joint regression of the mean and dispersion. [0135] The table 3 describes for various response distributions, an effect of varying tree depths in respective fitted models in a variable dispersion case, when a "correct" loss function is used for a model fit. For all three response distributions (i.e., Normal, Gamma and Inverse Gaussian)...” Examiner interprets cross-validation as ratio of loss ratios and cross-validating across stages as average of loss ratios).  
Please see motivation for claim 9.

Regarding claim 33,
Huang and Kokkinos disclose the method of claim 28, wherein determining the weights for each of the single task loss functions comprises: 
evaluating a gradient norm of a weighted single-task loss function for each task of the plurality of tasks with respect to the weights at a training time (Huang [0057]-[0058] and [0063] recite, in part, “… the loss weight is set to 0, and for each pixel labeled non-positive in the output coordinate space, an ignore flag fign is set to 1 if a pixel with positive label within rnear=2 pixel length exists. [0058] After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Gradient descent learning on samples using loss and output gradients (i.e. gradient norm of weighted tasks loss of for each task output)); 
evaluating an average gradient norm across all tasks at the training time (Huang [0057]-[0058] and [0063] recite, in part, “… the loss weight is set to 0, and for each pixel labeled non-positive in the output coordinate space, an ignore flag fign is set to 1 if a pixel with positive label within rnear=2 pixel length exists. [0058] After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Examiner interprets normalized by dividing by the standard object height (i.e. average gradient norm));
However, Huang does not disclose calculating a relative inverse training rate for each task of the plurality of tasks; and calculating a gradient loss function based at least partly on differences between the gradient norms of each of the weighted single-task loss functions and the average gradient norm multiplied by a function of the relative inverse training rate. 
Natarajan teaches calculating a relative inverse training rate for each task of the plurality of tasks (Natarajan [0035], [0100], [0105], and [0107] recite, in part, “Thus, if μ denotes the mean parameter, then from the formula (3) μ =b'(θ), where b'(*) is an invertible mean-value mapping (i.e., a mapping between the canonical parameter and the mean parameter is an invertible change of the parameters), and its inverse b'-1(*) is termed a canonical link function. [0100] Alternatively, the special form (e.g., the formula (12) of the likelihood function in these cases (i.e., the Normal, Gamma and Inverse Gaussian distributions) suggests that existing GLM macros (i.e., a rule or pattern that specifies how a certain input sequence should be mapped to an output sequence according to GLM) can be used to handle joint modeling of the mean and dispersion. The mean and dispersion sub-models in the formulas (52)-(53) are equivalent to loss functions used in the GLM. Therefore, starting from an initial estimate of Φ, the computing system 1600 alternates between solving a GLM model for μ with Φ held fixed, followed by the GLM for Φ with μ held fixed (the latter GLM will use standard macros that are available for performing Gamma regression in the Normal and Inverse Gaussian cases). [0105] Every function in the function space can be represented as a linear combination of basis functions, just as every vector in a vector space can be represented as a linear combination of basis vectors. [0107] …the computing system 1600 chooses the basis functions that maximally correlate with a corresponding steepest-descent gradient direction (i.e., a gradient method in which a choice of a direction is where a function f decreases most quickly, which is the direction opposite to a gradient of the function f) of the deviance loss function.” Examiner interprets mean parameter μ (i.e. average gradient norm) and invertible mean-value mapping (i.e. inverse of the relative training rate)); and  
calculating a gradient loss function based at least partly on differences between the gradient norms of each of the weighted single-task loss functions and the average gradient norm multiplied by a function of the relative inverse training rate (Natarajan [0035], [0100], [0105], and [0107] recite, in part, “Thus, if μ denotes the mean parameter, then from the formula (3) μ =b'(θ), where b'(*) is an invertible mean-value mapping (i.e., a mapping between the canonical parameter and the mean parameter is an invertible change of the parameters), and its inverse b'-1(*) is termed a canonical link function. [0100] Alternatively, the special form (e.g., the formula (12) of the likelihood function in these cases (i.e., the Normal, Gamma and Inverse Gaussian distributions) suggests that existing GLM macros (i.e., a rule or pattern that specifies how a certain input sequence should be mapped to an output sequence according to GLM) can be used to handle joint modeling of the mean and dispersion. The mean and dispersion sub-models in the formulas (52)-(53) are equivalent to loss functions used in the GLM. Therefore, starting from an initial estimate of Φ, the computing system 1600 alternates between solving a GLM model for μ with Φ held fixed, followed by the GLM for Φ with μ held fixed (the latter GLM will use standard macros that are available for performing Gamma regression in the Normal and Inverse Gaussian cases). [0105] Every function in the function space can be represented as a linear combination of basis functions, just as every vector in a vector space can be represented as a linear combination of basis vectors. [0107] …the computing system 1600 chooses the basis functions that maximally correlate with a corresponding steepest-descent gradient direction (i.e., a gradient method in which a choice of a direction is where a function f decreases most quickly, which is the direction opposite to a gradient of the function f) of the deviance loss function.” Examiner interprets deviance loss function (i.e. gradient loss function) and function f (i.e. a function of the relative inverse training rate))).  
Natarajan and Huang are both directed to problems involving joint modeling and optimization for a variety of data sets. In view of the teachings of Natarajan, it would have been obvious to one of ordinary skill in the art to apply the teachings of Natarajan to Huang before the effective filing date of the claimed invention in order to perform more advanced model training by incorporating joint modeling of the mean and the dispersion for a variety of data sets thereby improving Huang (cf. Natarajan [0006]-[0007] recites, in part, the following: 
“Traditionally, GLM (General Linear Model), which is widely used for mean regression modeling, has been used for conditional response distributions from the exponential dispersion family. The GLM is a statistical linear model for a suitable transformation of the mean, term the link transformation. The GLM may be represented as g(Y)=XB+U, where Y is a vector with series of response measurements, g(.) is the link function that is chosen appropriately for the assumed response distribution, X is a design matrix, B is a vector including parameters to be estimated, and U is a vector including errors and noises. The design matrix refers to a matrix of explanatory variables (one or zeroes, or reals), that represents a specific statistical or experimental model). However, a traditional methodology such as the GLM cannot perform joint modeling of the mean and the dispersion without inventing additional art, particularly in the case when the covariates in the data are complex, and must be simplified and grouped in a preprocessing step that considerably detracts from the quality of resulting model.

Therefore, it is highly desirable to provide a system and method to perform joint modeling of a mean and dispersion suitable for a wide variety of data sets without requiring any preprocessing and grouping of the sample data.”
).

Regarding claim 34,
The Huang/Kokkinos/Natarajan Combination teaches the method of claim 33, wherein the gradient loss function comprises an L                        
                            
                                
                                    1
                                
                                
                            
                        
                     loss function (Huang [0053]-[0054] recites “In embodiments, like Fast R-CNN, the network has two sibling output branches. In embodiments, independent branch 360 outputs the confidence score ŷ (per pixel in the output map) of being a target object. Given the ground truth label y*∈ {0,1}, the classification loss can be defined as follows: Lcls (ŷ, y*)=∥ŷ−y∥2  (Eq. 1) [0054] In embodiments, a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the bounding-box regression loss, denoted as Lloc, for example, to minimize the L2 loss between predicted location offsets d^=d^tx, d^ty, d^tx, d^ty) and targets d*=(d*tx, d*ty, d*tx, d*ty), as formulized by: Lloc(d^,d*)=Σi∈{tx,ty,bx,by} ∥d^i- d*i∥2 (Eq. 2)” Loss functions such as hinge loss and cross-entropy loss (i.e. L1 loss function)).  
Please see motivation for claim 33 above.

Regarding claim 35,
The Huang/Kokkinos/Natarajan Combination teaches the method of claim 33, wherein the function of the relative inverse training rate comprises a power law function (Natarajan [0031], [0037], and [0062] recites, in part, “[0031] The present invention thus provides: (1) An easy incorporation of relevant nonlinear and low-order covariate interaction effects in regression functions by representing the regression functions as piecewise-constant, additive and non-linear function (nonlinear effects implies that a covariate enters into a regression function not only as a linear term Xi, but also as nonlinear terms such as Xi2 etc. [0037] A convolution property of the exponential dispersion family yields a relationship for statistical parameters of a distribution of homogeneous sample aggregates from an underlying distribution ED(μ, Φ). [0062] The form c(y, Φ) in the formula (11), which is exact only for the Normal, Gamma and Inverse Gaussian distributions, also has the same form with a saddlepoint density approximation in a leading-order term for Φ →0 for other conditional response distributions from the exponential dispersion family.” Examiner interprets the form c(y, Φ) for the inverse Gaussian distribution as the function of the relative inverse training rate and nonlinear term such as Xi2 as a power law function.).  
Please see motivation for claim 33 above.

Regarding claim 36,
The Huang/Kokkinos/Natarajan Combination teaches the method of claim 35, wherein the power law function has a power law exponent in a range from -1 to 3 (Natarajan [0037] and [0003] recites, in part, “A convolution property of the exponential dispersion family yields a relationship for statistical parameters of a distribution of homogeneous sample aggregates from an underlying distribution ED(μ, Φ). [0003] The (sample) dispersion is a measure of the spread of the data about its central value, and is generally measured by at least one or more of: (1) Range, (2) Mean absolute deviation, (3) Standard deviation, (4) Variance and (5) Covariance, etc. The range refers to the difference between a largest value and a smallest value among the sample data.” Examiner interprets exponential dispersion as a power law exponent and the dispersion measured by range between a largest value and a smallest value as having a range of -1 to 3.).  
Please see motivation for claim 33 above.

Claims 24-27 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. (US 20170147905 A1, hereinafter Huang) in view of Kokkinos (“Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory”) and Hoellwarth (US 20100079356 A1).
Regarding claim 24,
Huang discloses non-transitory memory configured to store: 
executable instructions (Huang [0112] recites, in part, “Embodiments of the present invention may be encoded upon one or more non-transitory computer-readable media with instructions for one or more processors or processing units to cause steps to be performed. It shall be noted that the one or more non-transitory computer-readable media shall include volatile and non-volatile memory.”), and 
a multitask network for determining outputs associated with a plurality of tasks, wherein the multitask network is trained using, for each task of the plurality of tasks (Huang fig. 8 and [0109] recites, in part, “FIG. 8 depicts a simplified block diagram of a computing system comprising an FCN to perform end-to-end multi-task object detection…” Fully Convolutional Network (FCN) to perform end-to-end multi-task object detections (i.e. multitask network for determining outputs associated with tasks)):  
a gradient norm of a single-task loss, of (1) a task output for a task of the plurality of tasks determined using the multitask network with a training image as input, and (2) a corresponding reference task output for the task associated with the training image, adjusted by a task weight for the task, with respect to a plurality of network weights of the multitask network (Huang [0058] and [0063] recites, in part, “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction… [0063] In training, an input patch may be considered a “positive patch” if it comprises an object centered in the center at a specific scale. These patches comprise mainly negative samples around the positive samples. In embodiments, in order to fully explore the negative samples in the whole dataset, patches are cropped at random scale from training images, and resized to the same size and fed to the network...The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning.” Gradient descent learning on samples using loss and output gradients (i.e. gradient norm of tasks loss of a task output), positive and negative samples (i.e. reference tasks), and scaled by number of contributing pixels (i.e. reference task output adjusted by task weights)), 
a relative training rate for the task determined based on the single-task loss for the task (Huang [0063] recites “The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations.” Learning rate (i.e. training rate)), 
a gradient loss function comprising a difference between (1) the determined gradient norm for the task and (2) a corresponding target gradient norm determined based on (a) an average gradient norm of the plurality of tasks, (b) the relative training rate for the task, and (c) a hyperparameter (Huang [0054], [0058] and [0062]-[0063] recites, in part, “…a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. [0058] After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction… [0062] …that combining may be performed in any other manner. In embodiments, the balance between classification and regression tasks is controlled by the parameter λloc. For example, the regression target d* may be normalized by dividing by the standard object height… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning.” Loss function (i.e. gradient loss function), gradient descent learning on samples (i.e. a difference between gradient norm for each task and target gradient norm), normalized by dividing by the standard object height (i.e. average gradient norm), learning rate (i.e. training rate), and parameter λloc (i.e. hyperparameter)), and 
an updated task weight for the task using a gradient of the gradient loss function with respect to the task weight for the task (Huang [0063] recites “The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Momentum term weight (i.e. updated task weight)); 
However, Huang does not disclose 
 an updated task weight for the task using a gradient of the gradient loss function with respect to the task weight for the task such that the updated task weight for the task is based at least in part on the relative training rate for the task; 
wherein the relative training rates for the plurality of tasks are determined so that each step in the plurality of tasks completes over a similar length of training time,
a head mounted display system comprising: a display; a sensor; and a hardware processor in communication with the non-transitory memory and the display, the hardware processor programmed by the executable instructions to: receive a sensor datum captured by the sensor; determine a task output for each task of the plurality of tasks using the multitask network with the sensor datum as input; and cause the display to show information related to the determined task outputs to a user of the augmented reality device.  
Hoellwarth teaches a head mounted display system comprising (Hoellwarth fig. 1 element 100 and [0047] recites “Head-mounted display system 100 can include a variety of features, which can be provided by one or both devices of the system when they are connected and in communications with one another.”): 
a display (Hoellwarth [0049] recites “The image based content may for example be viewed on the display of the head mounted display system.”); 
a sensor (Hoellwarth [0049] recites “As yet another example, the head-mounted system 100 can utilize a proximity sensor on one or both of the head mounted device and portable electronic device to detect and identify the relationship between the two devices or to detect and identify things in the outside environment.”); and 
a hardware processor in communication with the non-transitory memory and the display, the hardware processor programmed by the executable instructions to (Hoellwarth [0047] recites, in part, “Head-mounted display system 100 can include a variety of features, which can be provided by one or both devices of the system when they are connected and in communications with one another. For example, each device may include one or more of the following components: processors, display screen, controls (e.g., buttons, switches, touch pads, and/or screens), camera, receiver, antenna, microphone, speaker, batteries, optical subassembly, sensors, memory, communication systems, input/output ("I/O") systems, connectivity systems, cooling systems, connectors, and/or the like. If activated, these components may be configured to work together or separately depending on the needs of the system.” Processor connected and in communication with memory as part of head-mounted display system (i.e. hardware processor in communication with memory)): 
receive a sensor datum captured by the sensor (Hoellwarth fig. 17A element 1708 and [0254] recites “At step 1708, the head-mounted display system can determine whether a user input has been detected from sensors.” Detected input from sensors (i.e. sensor datum captured by sensor)); 
determine a task output for each task of the plurality of tasks using the multitask network with the sensor datum as input (Hoellwarth fig. 17A and [0254] recites “For example, accelerometers on the head-mounted display system can detect if the user has made any head movements. Based on the detection of a particular head movement, the head-mounted display system can determine if the head movement is an indication that the user would like to view image based content from the outside world.” Image viewing based on content from outside world (i.e. task output)); and 
cause the display to show information related to the determined task outputs to a user of the head mounted display system (Hoellwarth fig. 17A element 1706 and [0254] & [0258] recites “If, at step 1708, the head-mounted display system determines that a user input has been received from the sensors, process 1700 moves to step 1706. [0258] At step 1706, a PIP image frame overlaid on at least one of displayed left and right image frames (e.g., PIP mode) can be displayed.” PIP Image frame overlaid on left and right image frames (i.e. display to show information related to task outputs) from user input received by sensors of head-mounted display (i.e. user of the augmented reality device)).  
Hoellwarth and Huang are both directed to problems related to image detection and imaging. In view of the teachings of Hoellwarth, it would have been obvious to one of ordinary skill in the art to apply the teachings of Hoellwarth to Huang before the effective filing date of the claimed invention in order to allow for enhanced and improved viewing (cf. Hoellwarth [0099] recites, in part, “In other cases, however, the optical sub assembly may also be a more complicated system of optical components that enhance and improved the viewing experience (i.e., help focus the user's eyes on the image frames being displayed on the display screen of the portable electronic device).”).
Kokkinos teaches 	
an updated task weight for the task using a gradient of the gradient loss function with respect to the task weight for the task (pg. 5456; Our training objective is the sum of per-task losses and regularization terms applied to task-specific, as well as shared layers: L(w0,1,...,T )=R(w0)+ T t=1 γt(R(wt)+Lt (w0, wt)), (1) where t indexes tasks, w0 denotes shared CNN weights, wt are task-specific weights, γt determines the relative importance of task t, R(w∗) = λ 2 w∗2 is an 2 regularization, and Lt (w0, wt) is the task-specific loss: A loss is calculated for each task based on the weight (wt) and the relative importance of the task R(w∗) which determines the training rate.) such that the updated task weight for the task is based at least in part on the relative training rate for the task (pg. 5459; The performance of our network on the set of tasks it adresses depends on the weights assigned to the different task losses in Eq. 1. A large weight for one task can skew the network’s internal representation in favor of the particular task and neglect the rest. Tasks are balanced based on updated weights. Weight can be increased for important tasks but will cause other task performances’ to suffer.); 
wherein the relative training rates for the plurality of tasks are determined so that each step in the plurality of tasks completes over a similar length of training time (pg. 5454; We obtain competitive performance while jointly addressing all tasks in 0.7 seconds on a GPU. Our system will be made publicly available. And pg. 5461, section 6; We have shown that one can effectively scale up to many and diverse tasks, since the memory complexity is independent of the number of tasks, and incoherently annotated datasets can be combined during training. This has allowed us to train a single network that can solve multiple tasks in a fraction of a second with competitive performance. Tasks all finish at around 0.7 seconds (i.e. similar training rate).);
Huang and Kokkinos are analogous arts because they are both directed to the field of computing losses for multitask convolutional neural networks.
	It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the convolutional neural network of Huang with the method of training diverse tasks of Kokkinos.
	Doing so would allow for increasing the number of tasks while with low memory complexity. This allows for scaling tasks while addressing the memory demands of back propagation of the tasks (pg. 5455, col.1;)

Regarding claim 25,
The Huang/Kokkinos/Hoellwarth Combination teaches the system of claim 24, wherein the plurality of tasks comprises a plurality of perceptual tasks (Huang [0030], [0072] and [0074] recites, in part, “Object detection often involves multi-task learning, such as landmark localization, pose estimation, and semantic segmentation. [0072] Neural network-based face detector refers to those face detection system using neural network before the recent break-through results of CNNs for image classification… While the systems and methods presented herein have a similar detection pipeline, embodiments use modern CNNs as detectors. [0074] Deep Dense Face Detector (DDFD)… is a face detection system based on convolutional neural networks… Although DDFD is a complete detection pipeline, it is not an end-to-end framework … In contrast, embodiments of the present disclosure can be optimized directly for detection and can be easily improved by incorporating landmark information.” Face detection with landmark localization (i.e. perceptual tasks)).
Please see motivation for claim 24 above.

Regarding claim 26,
The Huang/Kokkinos/Hoellwarth Combination teaches the system of claim 25, wherein the plurality of perceptual tasks comprises the face recognition, visual search, gesture identification, semantic segmentation, object detection, lighting detection, simultaneous localization and mapping, relocalization, or a combination thereof (Huang [0030], [0072] and [0074] recite, in part, “Object detection often involves multi-task learning, such as landmark localization, pose estimation, and semantic segmentation. [0072] Neural network-based face detector refers to those face detection system using neural network before the recent break-through results of CNNs for image classification… While the systems and methods presented herein have a similar detection pipeline, embodiments use modern CNNs as detectors. [0074] Deep Dense Face Detector (DDFD)… is a face detection system based on convolutional neural networks… Although DDFD is a complete detection pipeline, it is not an end-to-end framework … In contrast, embodiments of the present disclosure can be optimized directly for detection and can be easily improved by incorporating landmark information.” Face detection (i.e. face recognition), object detection, landmark localization, and semantic segmentation”).
Please see motivation for claim 24 above.

Regarding claim 27,
The Huang/Kokkinos/Hoellwarth Combination teaches the system of claim 24, wherein the sensor comprises an inertial measurement unit, an outward-facing camera, a depth sensing camera, a microphone, an eye imaging camera, or a combination thereof (Hoellwarth [0225]-[0226], [0228] and [0230] recites, in part, “Head-mounted device 1304 can include one or more sensors 1324 to detect various signals. Suitable sensors can include, for example, ambient sound detectors, proximity sensors, accelerometers, light detectors, cameras, and temperature sensors. [0226] To identify the detected words, the ambient sound detector can attempt to match the words to a stored library of words. [0228] Accelerometers on head-mounted device 1304 can detect the user's head movements. [0230] Sensors 1324 can include a camera which can capture image based content of the outside world.” Sensor including ambient sound detector (i.e. microphone), camera capturing outside world (i.e. outward-facing camera), and accelerometers (i.e. inertial measurement unit)).  
Please see motivation for claim 24 above.

Claim(s) 38 is rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. (US 20170147905 A1, hereinafter Huang) in view of Kokkinos (“Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory”) and Pong et al. (“Trace norm regularization: Reformulations, algorithms, and multi-task learning”).
Regarding claim 38,
Huang and Kokkinos teach the method of claim 28. 
	Huang and Kokkinos do not explicitly disclose
wherein the trained multitask neural network is based on an average gradient norm of the plurality of tasks and the determined training rates.
However, Pong teaches
wherein the trained multitask neural network is based on an average gradient norm of the plurality of tasks and the determined training rates (pg. 2; We focus on linear predictors f`(x) = w T ` x, where w` is the weight vector for the `th task. The convex multi-task learning formulation based on the trace norm regularization can be formulated as the following optimization problem: And pg. 11; Thus the method uses a weighted average of all past gradients instead of the most recent gradient.)
Huang and Kokkinos and Pong are analogous arts because they are directed towards learning gradients for multitask learning.
It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the multi task learning of Huang and Kokkinos with the computation of gradients of Pong.
Doing so would allow for solving convex but non-smooth optimization problems. The proposed method overcomes non-smoothness and nonconvex optimization problems where convergence to a global minimum is not guaranteed (pg. 2)
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Aizono (US-20190066131-A1) – discloses a average gradient norm.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HENRY K NGUYEN whose telephone number is (571)272-0217. The examiner can normally be reached Mon - Fri 7:00am-4:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 5712723768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/H.N./Examiner, Art Unit 2121                               



                                                                                                                                                                         /Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121