DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Acknowledgement is made of Applicant’s claim amendments on 09/21/2021. The claim amendments are entered. Presently, claims 1-37 remain pending. Claims 1, 9-10, 20, 24, 28, and 34 have been amended.
Response to Arguments
Applicant’s arguments with respect to claim(s) 1, 20, 24, and 28 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Claim Objections
Claim 24 is objected to because of the following informalities:  Claim 24 recites the limitation “wherein the relative training rates are determined so that each of the plurality of tasks completes over a similar length of training time” it appears the limitation is missing a semicolon and should recite “wherein the relative training rates are determined so that each of the plurality of tasks completes over a similar length of training time;”. Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-8, 13-23 and 28-32 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. (US 20170147905 A1, hereinafter Huang) in view of Kendall et al. (“Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics”).
Regarding Claim 1,
Huang discloses a system for training a multitask network comprising (Huang fig. 5A and 8 & [0109] recites “FIG. 8 depicts a simplified block diagram of a computing system comprising an FCN to perform end-to-end multi-task object detection, according to various embodiments of the present invention.” Computing system comprising a Fully Convolutional Neural Network (FCN) with methods depicted in Fig. 5A (i.e. system for training a multitask network)): 
non-transitory memory configured to store: 
executable instructions (Huang [0112] recites “Embodiments of the present invention may be encoded upon one or more non-transitory computer-readable media with instructions for one or more processors or processing units to cause steps to be performed. It shall be noted that the one or more non-transitory computer-readable media shall include volatile and non-volatile memory.” Non-transitory computer-readable media including memory with instructions (i.e. non-transitory memory storing executable instructions)), and 
a multitask network for determining outputs associated with a plurality of tasks (Huang [0109] recites “FIG. 8 depicts a simplified block diagram of a computing system comprising an FCN to perform end-to-end multi-task object detection, according to various embodiments of the present invention.” FCN to perform end-to-end multi-task object detection (i.e. multitask network for determining outputs from tasks)); and 
a hardware processor in communication with the non-transitory memory, the hardware processor programmed by the executable instructions to (Huang [0112] recites, in part, “…one or more non-transitory computer-readable media shall include volatile and non-volatile memory… alternative implementations are possible, including a hardware implementation or a software/hardware implementation. Hardware-implemented functions may be realized using ASIC(s), programmable arrays, digital signal processing circuitry, or the like... With these implementation alternatives in mind, it is to be understood that the figures and accompanying description provide the functional information Hardware-implemented functions realized using ASIC(s), programmable arrays, DSP circuitry, or the like (i.e. hardware processor programmed by instructions) with non-transitory CRM including memory (i.e. in communication with non-transitory memory)): 
receive a training image associated with a plurality of reference task outputs for the plurality of tasks (Huang [0063] recites “In training, an input patch may be considered a "positive patch" if it comprises an object centered in the center at a specific scale. These patches comprise mainly negative samples around the positive samples. In embodiments, in order to fully explore the negative samples in the whole dataset, patches are cropped at random scale from training images, and resized to the same size and fed to the network.” Patches from training images (i.e. receive training image) with negative and positive samples (i.e. reference task outputs)); 
for each task of the plurality of tasks and during training time, 
determine a gradient norm of a single-task loss of (1) a task output for the task determined using the multitask network with the training image as input, and (2) a corresponding reference task output for the task associated with the training image, adjusted by a task weight for the task, with respect to a plurality of network weights of the multitask network (Huang [0058] and [0063] recites, in part, “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction. [0063] In training, an input patch may be considered a “positive patch” if it comprises an object centered in the center at a specific scale. These patches comprise mainly negative samples around the positive samples. In embodiments, in order to fully explore the negative samples in the whole dataset, patches are cropped at random scale from training images, and resized to the same size and fed to the network...The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning.” Gradient descent learning on samples using loss and output gradients (i.e. gradient norm of tasks loss of a task output), positive and negative samples (i.e. reference tasks), and scaled by number of contributing pixels (i.e. reference task output adjusted by task weights)); and 
determine a relative training rate for the task based on the single-task loss for the task (Huang [0063] recites “The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global Learning rate (i.e. training rate) and scaling loss and gradients comparable in multi-task (i.e. based on the single-task loss)); 
determine a gradient loss function comprising a difference between (1) the determined gradient norm for each task and (2) a corresponding target gradient norm determined based on (a) an average gradient norm of the plurality of tasks, (b) the relative training rate for the task, and (c) a hyperparameter (Huang [0054], [0058], [0062] and [0063] recite, in part, “a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. [0058] After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction. [0062] …that combining may be performed in any other manner. In embodiments, the balance between classification and regression tasks is controlled by the parameter λloc. For example, the regression target d* may be normalized by dividing by the standard object height… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations.” Loss function (i.e. gradient loss function), gradient descent learning on samples (i.e. a difference between gradient norm for each task and target gradient norm), normalized by dividing by the standard object height (i.e. average gradient norm), learning rate (i.e. training rate), and parameter λloc (i.e. hyperparameter)); 
determine a gradient of the gradient loss function with respect to a task weight for each task of the plurality of tasks (Huang [0054], [0060] and [0063] recite, in part, “a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch outputs the bounding box regression loss... [0060] …and combining the classification loss (Eq. I) and bounding box regression loss (Eq. 2) with the masks… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Combined loss functions (i.e. gradient of the gradient loss function)); and 
determine an updated task weight for each of the plurality of tasks using the gradient of the gradient loss function with respect to the task weight (Huang [0063] recites “The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by Momentum term weight (i.e. updated task weight)).
Huang does not explicitly disclose
	determine a relative training rate for the task based on the single- task loss for the task so each of the plurality of tasks are trained at a similar training rate, wherein the relative training rate is associated with the task weight that is configured at each step such that each task completes its training over a similar length of training time;
 	determine an updated task weight for each of the plurality of tasks using the gradient of the gradient loss function with respect to the task weight, wherein the updated task weights are an improvement over the task weights such that the -2-Application No.: 16/169,840 Filing Date:October 24, 2018 updated task weights result in each of the plurality of tasks completing over a more similar length of training time than using the task weights.
	However, Kendall teaches
determine a relative training rate for the task based on the single- task loss for the task so each of the plurality of tasks are trained at a similar training rate (Pg. 6, section 3; This last objective can be seen as learning the relative weights of the losses for each output. Large scale values σ2 will decrease the contribution of L2(W), whereas small scale σ2 will increase its contribution. The scale is regulated by the last term in the equation. The objective is penalised when setting σ2 too large (with the last term contributing a constant value log C – with C classes – to the loss). The multi-task  Equation 11 represents the training rate.), wherein the relative training rate is associated with the task weight that is configured at each step such that each task completes its training over a similar length of training time (Pg. 11; Figure 4: Training plots showing convergence of homoscedastic noise and task loss for an array of initialisation choices for the homoscedastic uncertainty terms for all three tasks. The left plot shows that the loss converges to the same minimum from varying initialisation choices. The centre plot shows the the homoscedastic noise value optimises to the same solution from a variety of initialisations. The plots on the right show a zoomed in view of the homoscedastic noise plot, showing the initialisation and convergence over a few hundred training iterations. Despite the network taking 10, 000+ iterations for the training loss to converge, the task uncertainty converges very rapidly after only 100 iterations. The three tasks finish at the same time (100 iterations).);
	 determine an updated task weight for each of the plurality of tasks using the gradient of the gradient loss function with respect to the task weight (Pg. 1, section 1; We interpret homoscedastic uncertainty as task-dependant weighting and show how to derive a principled multi-task loss function which can learn to balance various regression and classification losses. Our method can learn to balance these weightings optimally, resulting in superior performance, compared with learning each task individually.), wherein the updated task weights are an improvement over the task weights such that the -2-Application No.: 16/169,840 Filing Date:October 24, 2018 updated task weights result in each of the plurality of tasks completing over a more similar length of training time than using the task weights (Pg. 11; Figure 4: Training plots showing convergence of homoscedastic noise and task loss for an array of initialisation choices for the homoscedastic uncertainty terms for all three tasks. The left plot shows that the loss converges to the same minimum from varying initialisation choices. The centre plot shows the the homoscedastic noise value optimises to the same solution from a variety of initialisations. The plots on the right show a zoomed in view of the homoscedastic noise plot, showing the initialisation and convergence over a few hundred training iterations. Despite the network taking 10, 000+ iterations for the training loss to converge, the task uncertainty converges very rapidly after only 100 iterations. The three tasks finish at the same time (100 interations).).
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the multi-task learning of Huang with the learning of multiple objectives using homoscedastic uncertainty of Kendall.
Doing so would allow for finding optimal weights for tasks. Balancing loss helps find the optimal weighting for each task which leads to improved performance (pg. 1; We interpret homoscedastic uncertainty as task-dependent weighting and show how to derive a principled multi-task loss function which can learn to balance various regression and classification losses. Our method can learn to balance these weightings optimally, resulting in superior performance, compared with learning each task individually.)
Regarding claim 2,
the system of claim 1, wherein the hardware processor is further programmed by the executable instructions to: 
determine the single-task loss of (1) the task output for each task determined using the multitask network with the training image as input, and (2) the corresponding task output for the task associated with the training image (Huang [0032]-[0033] and [0053] recites, in part, “FIG. 1 illustrates an exemplary object detection pipeline for a convolutional network…100 receives input image 112 or image pyramid that is fed to network 104. After several layers of convolution and pooling, feature map 106 is upsampled and convolution layers are applied to obtain final output 108. [0053] In embodiments, independent branch 360 outputs the confidence score ŷ (per pixel in the output map) of being a target object. Given the ground truth label y*∈ {0,1}, the classification loss can be defined…” Input image (i.e. training image as input)).  

Regarding claim 3,
Huang and Kendall disclose the system of claim 2, wherein the non-transitory memory is configured to further store: a plurality of loss functions associated with the plurality of tasks (Huang [0053]-[0054] recite, in part, “In embodiments, like Fast R-CNN, the network has two sibling output branches. In embodiments, independent branch 360 outputs the confidence score ŷ (per pixel in the output map) of being a target object. Given the ground truth label y*∈ {0,1}, the classification loss can be defined as follows: Lcls (ŷ, y*)=∥ŷ−y∥2  (Eq. 1) [0054] In embodiments, a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face loc, for example, to minimize the L2 loss between predicted location offsets d^=d^tx, d^ty, d^tx, d^ty) and targets d*=(d*tx, d*ty, d*tx, d*ty), as formulized by: Lloc(d^,d*)=Σi∈{tx,ty,bx,by} ∥d^i- d*i∥2 (Eq. 2)”. Classification loss, regression loss, L2 loss (i.e. plurality of loss functions)). 

Regarding claim 4,
Huang and Kendall disclose the system of claim 3, wherein to determine the single-task loss, the hardware processor is further programmed by the executable instructions to: 
determine the single-task loss of (1) the task output for each task determined using the multitask network with the training image as input, and (2) the corresponding task output for the task associated with the training image, using a loss function of the plurality of loss functions associated with the task (Huang [0032]-[0033] and [0053]-[0054] recites, in part, “FIG. 1 illustrates an exemplary object detection pipeline for a convolutional network…100 receives input image 112 or image pyramid that is fed to network 104. After several layers of convolution and pooling, feature map 106 is upsampled and convolution layers are applied to obtain final output 108. [0053] In embodiments, independent branch 360 outputs the confidence score ŷ (per pixel in the output map) of being a target object. Given the ground truth label y*∈ {0,1}, the classification loss can be defined… [0054] In embodiments, a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the loc, for example, to minimize the L2 loss between predicted location offsets d^=d^tx, d^ty, d^tx, d^ty) and targets d*=(d*tx, d*ty, d*tx, d*ty), as formulized by: Lloc(d^,d*)=Σi∈{tx,ty,bx,by} ∥d^i- d*i∥2 (Eq. 2)”).  


Regarding claim 5,
Huang and Kendall disclose the system of claim 1, wherein the hardware processor is further programmed by the executable instructions to: 
determine a multitask loss function comprising the single-task loss adjusted by the task weight for each task (Huang [0033] and [0053] recites, in part, “a single convolutional network simultaneously outputs multiple predicted bounding boxes 120 and class confidences. In embodiments, except for a non-maximum suppression (NMS) step, components of object detection are modeled as an FCN, such that it becomes unnecessary to engage in region proposal generation. [0053] …independent branch 360 outputs the confidence score ŷ (per pixel in the output map) of being a target object. Given the ground truth label y*∈ {0,1}, the classification loss can be defined as follows: Lcls (ŷ, y*) = ∥ŷ−y∥2  (Eq. 1).” Multiple predicted bounding boxes and class confidences each with classification loss (i.e. multitask loss function)); 
determine a gradient of the multitask loss function with respect to all network weights of the multitask network (Huang [0058] and [0063] recite, in part, “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction… In embodiments, in the forward propagation phase, the classification loss (Eq. 1) of output pixels is sorted in descending order, and  The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Loss and output gradients (i.e. a gradient of the multitask loss function)); and 
determine updated network weights of the multitask network based on the gradient of the multitask loss function (Huang [0057] and [0063] recite, in part, “… the loss weight is set to 0, and for each pixel labeled non-positive in the output coordinate space, an ignore flag fign is set to 1 if a pixel with positive label within rnear=2 pixel length exists. [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Momentum term weight (i.e. updated task weight)).  


Regarding claim 6,
Huang and Kendall disclose the system of claim 1, wherein the gradient norm of the single-task loss adjusted by the task weight is a L2 norm of the single-task loss adjusted by the task weight (Huang [0053]-[0054] recite, in part, “In embodiments, like Fast R-CNN, the network has two sibling output branches. In embodiments, independent branch 360 outputs the confidence score ŷ (per pixel in the ∈ {0,1}, the classification loss can be defined as follows: Lcls (ŷ, y*)=∥ŷ−y∥2  (Eq. 1) [0054] In embodiments, a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the bounding-box regression loss, denoted as Lloc, for example, to minimize the L2 loss between predicted location offsets d^=d^tx, d^ty, d^tx, d^ty) and targets d*=(d*tx, d*ty, d*tx, d*ty), as formulized by: Lloc(d^,d*)=Σi∈{tx,ty,bx,by} ∥d^i- d*i∥2 (Eq. 2)”. L2 Loss such as Lcls with L2 norm notation ∥ŷ−y∥2 (i.e. L2 norm)).  

Regarding claim 7,
Huang and Kendall disclose the system of claim 1, wherein the gradient loss function is a L1 loss function (Huang [0053]-[0054] recites “In embodiments, like Fast R-CNN, the network has two sibling output branches. In embodiments, independent branch 360 outputs the confidence score ŷ (per pixel in the output map) of being a target object. Given the ground truth label y*∈ {0,1}, the classification loss can be defined as follows: Lcls (ŷ, y*)=∥ŷ−y∥2  (Eq. 1) [0054] In embodiments, a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the bounding-box regression loss, denoted as Lloc, for example, to minimize the L2 loss between predicted location offsets d^=d^tx, d^ty, d^tx, d^ty) and targets d*=(d*tx, d*ty, d*tx, d*ty), as formulized by: Lloc(d^,d*)=Σi∈{tx,ty,bx,by} ∥d^i- d*i∥2 (Eq. 2)” Loss functions such as hinge loss and cross-entropy loss (i.e. L1 loss function)).

Regarding claim 8,
Huang and Kendall disclose the system of claim 1, wherein the hardware processor is further programmed by the executable instructions to: determine an average of the gradient norms of the plurality of tasks as the average gradient norm (Huang [0058] and [0062]-[0063] recites “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction. [0062] …that combining may be performed in any other manner. In embodiments, the balance between classification and regression tasks is controlled by the parameter λloc. For example, the regression target d* may be normalized by dividing by the standard object height… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used” Normalized by dividing by the standard object height (i.e. average of the gradient norms)).
Regarding claim 13,
Huang discloses the system of claim 1, wherein to determine the gradient of the gradient loss function, the hardware processor is further programmed by the executable instructions to: 
determine the gradient of the gradient loss function with respect to the task weight for each task of the plurality of tasks while keeping the target gradient norm for the task constant (Huang [0058], [0060] and [0063] recites “After negative Combined loss functions (i.e. gradient of the gradient loss function)).  

Regarding claim 14,
Huang discloses the system of claim 1, wherein the hardware processor is further programmed by the executable instructions to: normalize the updated weights for the plurality of tasks (Huang [0062]-[0063] recites “…that combining may be performed in any other manner. In embodiments, the balance between classification and regression tasks is controlled by the parameter λloc. For example, the regression target d* may be normalized by dividing by the standard object height… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.”).  
Regarding claim 15,
the system of claim 14, wherein to normalize the updated weights for the plurality of tasks, the hardware processor is further programmed by the executable instructions to: normalize the updated weights for the plurality of tasks to a number of the plurality of tasks (Huang [0062]-[0063] recites “One of skill in the art will also appreciate that combining may be performed in any other manner. In embodiments, the balance between classification and regression tasks is controlled by the parameter λloc. For example, the regression target d* may be normalized by dividing by the standard object height, which is 50/4 in ground truth map, and setting λloc=3. [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.”).  

Regarding claim 16,
Huang and Kendall disclose the system of claim 1, wherein the plurality of tasks comprises a regression task, a classification task, or a combination thereof (Huang [0044] and [0062] recite, in part, “FIG. 3 illustrates a network architecture according to various embodiments of the present disclosure. Network architecture 300 in example in FIG. 3 is derived from the VGG 19 model used for image classification. [0062] In embodiments, the balance between classification and regression tasks is controlled by the parameter λloc.” Classification and regression tasks.).

Regarding claim 17,
Huang and Kendall disclose the system of claim 16, wherein the classification task comprises perception, face recognition, visual search, gesture recognition, semantic segmentation, object detection, room layout estimation, cuboid detection, lighting detection, simultaneous localization and mapping, relocalization, speech processing, speech recognition, natural language processing, or a combination thereof (Huang [0030] and [0104] recites, in part, “Object detection often involves multi-task learning, such as landmark localization, pose estimation, and semantic segmentation. [0104] Training and Testing. As with face detection, two models—one with and one without landmark localization—are trained on the KITTI object detection training set. Since KITTI does not provide landmarks for cars, 8 landmarks shown in FIG. 5 are annotated for 7790 cars…” Object detection, landmark localization (i.e. simultaneous localization and mapping), and semantic segmentation).

Regarding claim 18,
Huang and Kendall disclose the system of claim 1, wherein the multitask network comprises a plurality of shared layers and an output layer comprising a plurality of task specific filters (Huang fig. 3 and [0044] & [0046] recites “Network architecture 300 in example in FIG. 3 is derived from the VGG 19 model used for image classification… network architecture 300 comprises 16 convolution layers, 12 convolution layers labeled Conv1_1 304 through Conv4_4 330; and 3 pooling layers 340-344. [0046] Upsampling is performed by bi-linear filtering, for example, to generate a 4×4 matrix from a 2×2 matrix patch using linear interpolation. In embodiments, to Convolutional layers with pooling layers (i.e. shared layers) and bi-linear filtering to two independent branches for detection and localization (i.e. output layer with task specific filters)).  

Regarding claim 19,
Huang and Kendall disclose the system of claim 18, wherein the output layer of the multitask network comprises an affine transformation layer (Huang [0046] and [0048] recites “Upsampling is performed by bi-linear filtering, for example, to generate a 4×4 matrix from a 2×2 matrix patch using linear interpolation. In embodiments, to compute a final score, the upsampled feature map is input to two independent branches 360-362. In FIG. 3, the first branch begins with Conv5_1_det 352, a convolution layer for detection, and the second branch begins with Conv5_1_loc 356, a convolution layer for localization. One of ordinary skill in the art will appreciate that computations in the independent branches may be performed simultaneously. [0048] Multi-Level Feature Fusion. In embodiments, features from different convolution layers are combined to enhance the performance of certain tasks, such as edge detection and segmentation. Part-level features focus on local details of objects to find discriminative appearance parts, whereas object-level or high-level features usually have a larger receptive field in order to recognize objects. A larger receptive field also Upsampling to generate 4x4 matrix from 2x2 matrix using linear interpolation through bi-linear filtering (i.e. affine transformation layer) for computing a final score).  
Regarding claim 20,
Huang discloses a method for training a multitask network comprising: 
under control of a hardware processor 
-50-receiving a training datum of a plurality of training data each associated with a plurality of reference task outputs for the plurality of tasks (Huang [0032] and [0063] recites “FIG. 1 illustrates an exemplary object detection pipeline for a convolutional network according to various embodiments of the present disclosure. In embodiments, pipeline 100 receives input image 112 or image pyramid that is fed to network 104. After several layers of convolution and pooling, feature map 106 is upsampled and convolution layers are applied to obtain final output 108. [0063] In training, an input patch may be considered a “positive patch” if it comprises an object centered in the center at a specific scale. These patches comprise mainly negative samples around the positive samples. In embodiments, in order to fully explore the negative samples in the whole dataset, patches are cropped at random scale from training images, and resized to the same size and fed to the network.” Input patch (i.e. training datum) from received images (i.e. training data) with positive and negative samples (i.e. reference task outputs)); 
for each task of the plurality of tasks during a training, 
determining a gradient norm of a single-task loss adjusted by a task weight for the task, with respect to a plurality of network weights of the multitask network, the single-task loss being of (1) a task output for the task determined using a multitask network with the training datum as input, and (2) a corresponding reference task output for the task associated with the training datum (Huang [0058] and [0063] recite “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction. [0063] In training, an input patch may be considered a “positive patch” if it comprises an object centered in the center at a specific scale. These patches comprise mainly negative samples around the positive samples. In embodiments, in order to fully explore the negative samples in the whole dataset, patches are cropped at random scale from training images, and resized to the same size and fed to the network... The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning.” Gradient descent learning on samples using loss and output gradients (i.e. gradient norm of tasks loss of a task output), positive and negative samples (i.e. reference tasks) and scaled by number of contributing pixels (i.e. reference task output adjusted by task weights)) [0032]-[0039]); and 
determining a gradient loss function comprising a difference between (1) the determined gradient norm for each task and (2) a corresponding target gradient norm determined based on (a) an average gradient norm of the plurality of tasks, and (b) the relative training rate for the task (Huang [0054], [0060] and [0063] recites “…a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the bounding-box regression loss… [0060] …combining the classification loss (Eq. 1) and bounding box regression loss (Eq. 2) with the masks… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Loss function (i.e. gradient loss function), gradient descent learning on samples (i.e. a difference between gradient norm for each task and target gradient norm), normalized by dividing by the standard object height (i.e. average gradient norm), learning rate (i.e. training rate), and parameter λloc (i.e. hyperparameter)); and 
determining an updated task weight for each of the plurality of tasks using a gradient of a gradient loss function with respect to the task weight (Huang [0063] recites “The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Momentum term weight (i.e. updated task weight)).
Huang does not explicitly disclose
determining a relative training rate for the task based on the single- task loss for the task; 
wherein the relative training rates are determined so that each of the plurality of tasks completes over a similar length of training time;
However, Kendall teaches
determining a relative training rate for the task based on the single- task loss for the task (Pg. 6, section 3; This last objective can be seen as learning the relative weights of the losses for each output. Large scale values σ2 will decrease the contribution of L2(W), whereas small scale σ2 will increase its contribution. The scale is regulated by the last term in the equation. The objective is penalised when setting σ2 too large (with the last term contributing a constant value log C – with C classes – to the loss). The multi-task objective with homoscedastic task uncertainty now becomes: L(W, σ1, σ2, ..., σi) = X i 1 2σ 2 i Li(W) + log σ 2 i (11) over all tasks indexed by i. Again, we  Equation 11 represents the training rate.); 
wherein the relative training rates are determined so that each of the plurality of tasks completes over a similar length of training time (Pg. 11; Figure 4: Training plots showing convergence of homoscedastic noise and task loss for an array of initialisation choices for the homoscedastic uncertainty terms for all three tasks. The left plot shows that the loss converges to the same minimum from varying initialisation choices. The centre plot shows the the homoscedastic noise value optimises to the same solution from a variety of initialisations. The plots on the right show a zoomed in view of the homoscedastic noise plot, showing the initialisation and convergence over a few hundred training iterations. Despite the network taking 10, 000+ iterations for the training loss to converge, the task uncertainty converges very rapidly after only 100 iterations. The three tasks finish at the same time (100 iterations).);
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the multi-task learning of Huang with the learning of multiple objectives using homoscedastic uncertainty of Kendall.
Doing so would allow for finding optimal weights for tasks. Balancing loss helps find the optimal weighting for each task which leads to improved performance (pg. 1; We interpret homoscedastic uncertainty as task-dependent weighting and show how to derive a principled multi-task loss function which can learn to balance various regression and classification losses. Our method can learn to balance these weightings optimally, resulting in superior performance, compared with learning each task individually.)
Regarding claim 21,
Huang and Kendall disclose the method of claim 20, wherein the corresponding target gradient norm is determined based on (a) an average gradient norm of the plurality of tasks, (b) the relative training rate for the task, and (c) a hyperparameter (Huang [0058] and [0062]-[0063] recites “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction. [0062] …that combining may be performed in any other manner. In embodiments, the balance between classification and regression tasks is controlled by the parameter λloc. For example, the regression target d* may be normalized by dividing by the standard object height… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning.” Loss function (i.e. gradient loss function), gradient descent learning on samples (i.e. a difference between gradient norm for each task and target gradient norm), normalized by dividing by the standard object height (i.e. average gradient norm), learning rate (i.e. training rate), and parameter λloc (i.e. hyperparameter)).

Regarding claim 22,
Huang and Kendall disclose the method of claim 20, further comprising: determining the gradient of the gradient loss function with respect to a task weight for each task of the plurality of tasks (Huang [0058], [0060] and [0063] recites “After negative mining, the badly predicted samples are relatively more likely to Combined loss functions (i.e. gradient of the gradient loss function)). 
 
Regarding claim 23,
Huang and Kendall disclose the method of claim 20, wherein the plurality of training data comprises a plurality of training images, and wherein the plurality of tasks comprises computer vision tasks (Huang [0029], [0030] and [0032] recites, in part, “More recent designs use deep CNNs to locate objects…these methods use shared computation of convolutions, which has been attracting increased attention due to its relatively efficient and accurate visual recognition. [0030] Object detection often involves multi-task learning, such as landmark localization, pose estimation, and semantic segmentation. [0032] FIG. 1 illustrates an exemplary object detection pipeline for a convolutional network according to various embodiments of the present disclosure. In embodiments, pipeline 100 receives input image 112 or image pyramid that is fed to network 104. After several layers of convolution and pooling, feature map 106 is upsampled and convolution layers are applied to obtain final output 108. In embodiments, output feature 108 map is converted to bounding boxes 120, and non-maximum suppression is applied to bounding boxes 120 that exceed a threshold.” Pipeline for receiving input images (i.e. training data comprising training images) and object detection with CNNs for visual recognition (i.e. computer vision)).   
Regarding Claim 28,
Huang teaches A method for training a multitask neural network for determining outputs associated with a plurality of tasks, the method comprising: under control of a hardware processor, and during a training: 
receiving a training sample set associated with a plurality of reference task outputs for the plurality of tasks (Huang [0063] recites “In training, an input patch may be considered a “positive patch” if it comprises an object centered in the center at a specific scale. These patches comprise mainly negative samples around the positive samples. In embodiments, in order to fully explore the negative samples in the whole dataset, patches are cropped at random scale from training images, and resized to the same size and fed to the network.” Input patch from training images with positive and negative samples (i.e. training sample set associated with reference tasks outputs)); 
calculating a multitask loss function based at least partly on a weighted combination of a plurality of single task loss functions associated with the plurality of tasks, wherein a plurality of weights associated with the plurality of single task loss functions in the multitask loss function can vary at each training timestep (Huang [0054] and [0063] recites, in part, “…a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the bounding-box regression loss… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In Loss function used in both face and car detection tasks (i.e. multitask loss function), loss functions such as L2 loss, hinge loss, and cross-entropy loss (i.e. combination of single task loss functions), weight decay and loss and output gradients scaled (i.e. weights with varying loss), iterations (i.e. steps)); 
determining the plurality of weights for associated single task loss functions such that each task of the plurality of tasks is trained at a similar training rate, (Huang [0063] recites, in part, “In training, an input patch may be considered a “positive patch” if it comprises an object centered in the center at a specific scale. These patches comprise mainly negative samples around the positive samples. In embodiments, in order to fully explore the negative samples in the whole dataset, patches are cropped at random scale from training images, and resized to the same size and fed to the network… The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Loss and output gradients be scaled so that both loss and output gradients are comparable in multi-task learning using learning rate and weight decay (i.e. weights and loss at similar scale and rate)); and 
outputting a trained multitask neural network based at least in part on the training (Huang [0067] and [0100] recites, in part, “the final output refine branch uses the classification score map and landmark localization maps as input to refine the Final output branch and trained model (i.e. trained multitask neural network)).
Huang does not explicitly disclose
wherein the training rate is determined so that each of the plurality of tasks completes over a similar length of training time;
However, Kendall teaches
wherein the training rate (Pg. 6, section 3; This last objective can be seen as learning the relative weights of the losses for each output. Large scale values σ2 will decrease the contribution of L2(W), whereas small scale σ2 will increase its contribution. The scale is regulated by the last term in the equation. The objective is penalised when setting σ2 too large (with the last term contributing a constant value log C – with C classes – to the loss). The multi-task objective with homoscedastic task uncertainty now becomes: L(W, σ1, σ2, ..., σi) = X i 1 2σ 2 i Li(W) + log σ 2 i (11) over all tasks indexed by i. Again, we write Li(W) = ||yi − fW(x)||2 for regression losses yi , and Li(W) = − log Softmax(yi ,fW(x)) for classification losses. Equation 11 represents the training rate.) is determined so that each of the plurality of tasks completes over a similar length of training time (Pg. 11; Figure 4: Training plots showing convergence of homoscedastic noise and task loss for an array of initialisation choices for the homoscedastic uncertainty terms for all three tasks. The left plot shows that the  The three tasks finish at the same time (100 iterations).);
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the multi-task learning of Huang with the learning of multiple objectives using homoscedastic uncertainty of Kendall.
Doing so would allow for finding optimal weights for tasks. Balancing loss helps find the optimal weighting for each task which leads to improved performance (pg. 1; We interpret homoscedastic uncertainty as task-dependent weighting and show how to derive a principled multi-task loss function which can learn to balance various regression and classification losses. Our method can learn to balance these weightings optimally, resulting in superior performance, compared with learning each task individually.)

Regarding claim 29,
Huang and Kendall disclose the method of claim 28, wherein the tasks comprise computer vision tasks, speech recognition tasks, natural language processing tasks, or medical diagnostic tasks (Huang [0029]-[0030] recites “More recent designs use deep CNNs to locate objects… these methods use shared Visual recognition and object detection (i.e. computer vision tasks)).  

Regarding claim 30,
Huang and Kendall disclose the method of claim 28, wherein the multitask loss function is a linear combination of the weights and the single task loss functions (Huang [0048], [0054] and [0057] recite, in part, “…features from different convolution layers are combined to enhance the performance of certain tasks, such as edge detection and segmentation. Part-level features focus on local details of objects to find discriminative appearance parts, whereas object-level or high-level features usually have a larger receptive field in order to recognize objects. A larger receptive field also comprises context information that may aid in predicting more accurate results. [0054] … a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the bounding-box regression loss… [0057] …the loss weight is set to 0, and for each pixel labeled non-positive in the output coordinate space, an ignore flag fign is set to 1 if a pixel with positive label within rnear=2 pixel length exists.” Convolution layers combined to enhance performance of certain tasks (i.e. combination), using loss functions such as L2 loss, hinge loss and cross-entropy loss (i.e. combination of single task loss), and loss weights for each pixel (i.e. weights)).  

Regarding claim 31,
Huang and Kendall disclose the method of claim 28, wherein determining the weights for each of the single task loss functions comprises penalizing the multitask neural network when backpropagated gradients from a first task of the plurality of tasks are substantially different from backpropagated gradients from a second task of the plurality of tasks (Huang [0058] and [0067] recites “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction. In embodiments, negative mining may be performed efficiently by using information about previous decision boundaries (also referred to as online bootstraping). [0067] … the final output refine branch uses the classification score map and landmark localization maps as input to refine the detection results. In embodiments, to further increase detection performance, a high level spatial model may be used to learn the constraints of landmark confidence and bounding box scores.” Gradient descent learning on samples (i.e. backpropated gradients), negative mining with badly predicted samples (i.e. penalizing network), using classification and localization maps to refine (i.e. a first task and a second task)).

Regarding claim 32,
 the method of claim 28, wherein determining the weights for each of the single task loss functions comprises decreasing a first weight for a first task of the plurality of tasks relative to a second weight for a second task of the plurality of tasks when a first training rate for the first task exceeds a second training rate for the second task (Kendall pg. 6 “This last objective can be seen as learning the relative weights of the losses for each output. Large scale values σ2 will decrease the contribution of L2(W), whereas small scale σ2 will increase its contribution. The scale is regulated by the last term in the equation.”).

Claims 9-12 and 33-37 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. (US 20170147905 A1, hereinafter Huang) in view of Kendall et al. (“Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics”) and Natarajan (US 20130013275 A1).
Regarding claim 9,
Huang and Kendall disclose the system of claim 1 (Huang [0109] recites, in part, “FIG. 8 depicts a simplified block diagram of a computing system comprising an FCN to perform end-to-end multi-task object detection…”). 
However, Huang does not disclose wherein the corresponding target gradient norm is determined based on (a) an average gradient norm of the plurality of tasks, (b) an inverse of the relative training rate for the task, and (c) a hyperparameter.
Natarajan teaches wherein the corresponding target gradient norm is determined based on (a) an average gradient norm of the plurality of tasks, (b) an inverse of the relative training rate for the task, and (c) a hyperparameter (Natarajan [0035], [0100], [0105], and [0107] recite, in part, “Thus, if μ denotes the mean parameter, then from the formula (3) μ =b'(θ), where b'(*) is an invertible mean-value mapping (i.e., a mapping between the canonical parameter and the mean parameter is an invertible change of the parameters), and its inverse b'-1(*) is termed a canonical link function. [0100] Alternatively, the special form (e.g., the formula (12) of the likelihood function in these cases (i.e., the Normal, Gamma and Inverse Gaussian distributions) suggests that existing GLM macros (i.e., a rule or pattern that specifies how a certain input sequence should be mapped to an output sequence according to GLM) can be used to handle joint modeling of the mean and dispersion. The mean and dispersion sub-models in the formulas (52)-(53) are equivalent to loss functions used in the GLM. Therefore, starting from an initial estimate of Φ, the computing system 1600 alternates between solving a GLM model for μ with Φ held fixed, followed by the GLM for Φ with μ held fixed (the latter GLM will use standard macros that are available for performing Gamma regression in the Normal and Inverse Gaussian cases). [0105] Every function in the function space can be represented as a linear combination of basis functions, just as every vector in a vector space can be represented as a linear combination of basis vectors. [0107] …the computing system 1600 chooses the basis functions that maximally correlate with a corresponding steepest-descent gradient direction (i.e., a gradient method in which a choice of a direction is where a function f decreases most quickly, which is the direction opposite to a gradient of the function f) of the deviance loss function.” Mean parameter μ (i.e. average gradient norm), invertible mean-value mapping (i.e. inverse of the relative training rate), and Φ (i.e. hyperparameter)).
Natarajan and Huang are both directed to problems involving joint modeling and optimization for a variety of data sets. In view of the teachings of Natarajan, it would have been obvious to one of ordinary skill in the art to apply the teachings of Natarajan to Huang before the effective filing date of the claimed invention in order to perform more advanced model training by incorporating joint modeling of the mean and the dispersion for a variety of data sets thereby improving Huang (cf. Natarajan [0006]-[0007] recites, in part, the following: 
“Traditionally, GLM (General Linear Model), which is widely used for mean regression modeling, has been used for conditional response distributions from the exponential dispersion family. The GLM is a statistical linear model for a suitable transformation of the mean, term the link transformation. The GLM may be represented as g(Y)=XB+U, where Y is a vector with series of response measurements, g(.) is the link function that is chosen appropriately for the assumed response distribution, X is a design matrix, B is a vector including parameters to be estimated, and U is a vector including errors and noises. The design matrix refers to a matrix of explanatory variables (one or zeroes, or reals), that represents a specific statistical or experimental model). However, a traditional methodology such as the GLM cannot perform joint modeling of the mean and the dispersion without inventing additional art, particularly in the case when the covariates in the data are complex, and must be simplified and grouped in a preprocessing step that considerably detracts from the quality of resulting model.

Therefore, it is highly desirable to provide a system and method to perform joint modeling of a mean and dispersion suitable for a wide variety of data sets without requiring any preprocessing and grouping of the sample data.”
).


Regarding claim 10,
the system of claim 9, wherein the hardware processor is further programmed by the executable instructions to: 
determine the average gradient norm of the plurality of tasks multiplied by the inverse relative training rate for the task to a power corresponding of the hyperparameter as the corresponding target gradient norm (Natarajan [0035], [0062], [0100], [0105] and [0107] recite, in part, “Thus, if μ denotes the mean parameter, then from the formula (3) μ =b'(θ), where b'(*) is an invertible mean-value mapping (i.e., a mapping between the canonical parameter and the mean parameter is an invertible change of the parameters), and its inverse b'-1(*) is termed a canonical link function. A link function serves to link a random or stochastic component of a model, a probability distribution of a response variable, to a systematic component of a model (e.g., a linear predictor). [0062] The form c(y, Φ) in the formula (11), which is exact only for the Normal, Gamma and Inverse Gaussian distributions, also has the same form with a saddlepoint density approximation in a leading-order term for Φ→0 for other conditional response distributions from the exponential dispersion family. [0100] Alternatively, the special form (e.g., the formula (12) of the likelihood function in these cases (i.e., the Normal, Gamma and Inverse Gaussian distributions) suggests that existing GLM macros (i.e., a rule or pattern that specifies how a certain input sequence should be mapped to an output sequence according to GLM) can be used to handle joint modeling of the mean and dispersion. The mean and dispersion sub-models in the formulas (52)-(53) are equivalent to loss functions used in the GLM. Therefore, starting from an initial estimate of Φ, the computing system 1600 alternates between solving a GLM model for Examiner interprets μ (i.e. average gradient), inverse b'-1(*) as inverse and corresponding steepest-descent gradient direction as average gradient norm multiplied by inverse relative training rate, and Φ as hyperparameter.).  
Please see motivation for claim 9 above.

Regarding claim 11,
The Huang/Kendall/Natarajan Combination teaches The system of claim 9, wherein to determine the relative training rate for the task based on the single-task loss for the task, the hardware processor is further programmed by the executable instructions to: 
determine the inverse of the relative training rate for the task based on a loss ratio of the single-task loss for the task and another single-task loss for the task (Natarajan [0118] and [0135] recite, in part, “Where μk{I}, Φk{I} denote the mean and {-I}( μk{I}, Φk{I}) denotes a loss function for test data in the I'th fold at the k'th stage. [0135] The table 3 describes for various response distributions… when a "correct" loss function is used for a model fit. For all three response distributions (i.e., Normal, Gamma and Inverse Gaussian), a choice k=1 yields the best model fit, and this is also the simplest basis function that is consistent with an assumed piecewise-constant variation in the synthetic data.” Loss ratios depicted in Table 3 for different distributions and tasks).
Please see motivation for claim 9.
  
Regarding claim 12,
The Huang/Kendall/Natarajan Combination teaches the system of claim 11, wherein to determine the inverse of the relative rate for the task, the hardware processor is further programmed by the executable instructions to: 
determine a ratio of the loss ratio of the task and an average of loss ratios of the plurality of tasks as the inverse of the relative training rate (Natarajan [0103], [0118], [0122] and [0135] recites, in part, “…the computing system 1600 obtains a deviance loss function (e.g., a formula (30)) [0118] Where μk{I}, Φk{I} denote the mean and dispersion estimates respectively at the k'th stage from training data for the I'th fold, and L{-I}(μk{I}, Φk{I}) denotes a loss function for test data in the I'th fold at the k'th stage. The number of cross-validation folds NCV is typically 5 or 10. Other criteria such as the 1-SE rule, in which the K is the smallest number of stages for which the cross-validation loss is within 1 standard error of a minimum cross-validation loss, can also be used as Examiner interprets cross-validation as ratio of loss ratios and cross-validating across stages as average of loss ratios).  
Please see motivation for claim 9.

Regarding claim 33,
the method of claim 28, wherein determining the weights for each of the single task loss functions comprises: 
evaluating a gradient norm of a weighted single-task loss function for each task of the plurality of tasks with respect to the weights at a training time (Huang [0057]-[0058] and [0063] recite, in part, “… the loss weight is set to 0, and for each pixel labeled non-positive in the output coordinate space, an ignore flag fign is set to 1 if a pixel with positive label within rnear=2 pixel length exists. [0058] After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Gradient descent learning on samples using loss and output gradients (i.e. gradient norm of weighted tasks loss of for each task output)); 
evaluating an average gradient norm across all tasks at the training time (Huang [0057]-[0058] and [0063] recite, in part, “… the loss weight is set to 0, and for each pixel labeled non-positive in the output coordinate space, an ignore flag fign is set to 1 if a pixel with positive label within rnear=2 pixel length exists. [0058] After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in Examiner interprets normalized by dividing by the standard object height (i.e. average gradient norm));
However, Huang does not disclose calculating a relative inverse training rate for each task of the plurality of tasks; and calculating a gradient loss function based at least partly on differences between the gradient norms of each of the weighted single-task loss functions and the average gradient norm multiplied by a function of the relative inverse training rate. 
Natarajan teaches calculating a relative inverse training rate for each task of the plurality of tasks (Natarajan [0035], [0100], [0105], and [0107] recite, in part, “Thus, if μ denotes the mean parameter, then from the formula (3) μ =b'(θ), where b'(*) is an invertible mean-value mapping (i.e., a mapping between the canonical parameter and the mean parameter is an invertible change of the parameters), and its inverse b'-1(*) is termed a canonical link function. [0100] Alternatively, the special form (e.g., the formula (12) of the likelihood function in these cases (i.e., the Normal, Gamma and Inverse Gaussian distributions) suggests that existing GLM macros (i.e., a rule or pattern that specifies how a certain input sequence should be mapped to an output sequence according to GLM) can be used to handle joint modeling of the mean and dispersion. The mean and dispersion sub-models in the formulas (52)-(53) are equivalent to loss functions used in the GLM. Therefore, starting from an initial estimate of Φ, the computing system 1600 alternates between solving a GLM model for μ with Φ held fixed, followed by the GLM for Φ with μ held fixed (the latter GLM will use standard Examiner interprets mean parameter μ (i.e. average gradient norm) and invertible mean-value mapping (i.e. inverse of the relative training rate)); and  
calculating a gradient loss function based at least partly on differences between the gradient norms of each of the weighted single-task loss functions and the average gradient norm multiplied by a function of the relative inverse training rate (Natarajan [0035], [0100], [0105], and [0107] recite, in part, “Thus, if μ denotes the mean parameter, then from the formula (3) μ =b'(θ), where b'(*) is an invertible mean-value mapping (i.e., a mapping between the canonical parameter and the mean parameter is an invertible change of the parameters), and its inverse b'-1(*) is termed a canonical link function. [0100] Alternatively, the special form (e.g., the formula (12) of the likelihood function in these cases (i.e., the Normal, Gamma and Inverse Gaussian distributions) suggests that existing GLM macros (i.e., a rule or pattern that specifies how a certain input sequence should be mapped to an output sequence according to GLM) can be used to handle joint modeling of the mean and dispersion. The mean and dispersion sub-models in the formulas (52)-(53) are equivalent to loss Examiner interprets deviance loss function (i.e. gradient loss function) and function f (i.e. a function of the relative inverse training rate))).  
Natarajan and Huang are both directed to problems involving joint modeling and optimization for a variety of data sets. In view of the teachings of Natarajan, it would have been obvious to one of ordinary skill in the art to apply the teachings of Natarajan to Huang before the effective filing date of the claimed invention in order to perform more advanced model training by incorporating joint modeling of the mean and the dispersion for a variety of data sets thereby improving Huang (cf. Natarajan [0006]-[0007] recites, in part, the following: 
“Traditionally, GLM (General Linear Model), which is widely used for mean regression modeling, has been used for conditional response distributions from the exponential dispersion family. The GLM is a statistical linear model for a suitable transformation of the mean, term the link transformation. The GLM may be represented as g(Y)=XB+U, where Y is a vector with series of response measurements, g(.) is the link function that is chosen 

Therefore, it is highly desirable to provide a system and method to perform joint modeling of a mean and dispersion suitable for a wide variety of data sets without requiring any preprocessing and grouping of the sample data.”
).

Regarding claim 34,
The Huang/Kendall/Natarajan Combination teaches the method of claim 33, wherein the gradient loss function comprises an L                        
                            
                                
                                    1
                                
                                
                            
                        
                     loss function (Huang [0053]-[0054] recites “In embodiments, like Fast R-CNN, the network has two sibling output branches. In embodiments, independent branch 360 outputs the confidence score ŷ (per pixel in the output map) of being a target object. Given the ground truth label y*∈ {0,1}, the classification loss can be defined as follows: Lcls (ŷ, y*)=∥ŷ−y∥2  (Eq. 1) [0054] In embodiments, a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. In embodiments, independent branch 360 outputs the bounding-box regression loss, denoted as Lloc, for example, to minimize the L2 loss between predicted location offsets d^=d^tx, d^ty, d^tx, d^ty) and targets d*=(d*tx, d*ty, d*tx, d*ty), as formulized by: Lloc(d^,d*)=Σi∈{tx,ty,bx,by} ∥d^i- d*i∥2 (Eq. 2)” Loss functions such as hinge loss and cross-entropy loss (i.e. L1 loss function)).  
Please see motivation for claim 33 above.

Regarding claim 35,
The Huang/Kendall/Natarajan Combination teaches the method of claim 33, wherein the function of the relative inverse training rate comprises a power law function (Natarajan [0031], [0037], and [0062] recites, in part, “[0031] The present invention thus provides: (1) An easy incorporation of relevant nonlinear and low-order covariate interaction effects in regression functions by representing the regression functions as piecewise-constant, additive and non-linear function (nonlinear effects implies that a covariate enters into a regression function not only as a linear term Xi, but also as nonlinear terms such as Xi2 etc. [0037] A convolution property of the exponential dispersion family yields a relationship for statistical parameters of a distribution of homogeneous sample aggregates from an underlying distribution ED(μ, Φ). [0062] The form c(y, Φ) in the formula (11), which is exact only for the Normal, Gamma and Inverse Gaussian distributions, also has the same form with a saddlepoint density approximation in a leading-order term for Φ →0 for other conditional response distributions from the exponential dispersion family.” Examiner interprets the form c(y, Φ) for the inverse Gaussian distribution as the function of the relative inverse training rate and nonlinear term such as Xi2 as a power law function.).  
Please see motivation for claim 33 above.

Regarding claim 36,
The Huang/Kendall/Natarajan Combination teaches the method of claim 35, wherein the power law function has a power law exponent in a range from -1 to 3 (Natarajan [0037] and [0003] recites, in part, “A convolution property of the exponential Examiner interprets exponential dispersion as a power law exponent and the dispersion measured by range between a largest value and a smallest value as having a range of -1 to 3.).  
Please see motivation for claim 33 above.

Regarding claim 37,
The Huang/Kendall/Natarajan Combination teaches the method of claim 35, wherein the power law function has a power law exponent that varies during the training (Natarajan [0075] and [0129] recites, in part, “A loss function for joint regression modeling (parameters in a regression model are estimated by minimizing this loss function) is the empirical negative log-likelihood for a conditional response variable from the ED(μ, Φ) family over the training data records {yi, xi, zi}i=1n. [0129] The synthetic data sets in one implementation comprises of 1,000 samples each for training and validation. A covariate set is 6-dimensional, x={x1, x2,…, x6}, where x1, x2, x3 are continuous-valued and uniformly sampled in an interval (0, 1), and x3, x4, x5 are categorical-valued with 4 levels denoted {1, 2, 3, 4} respectively which are sampled with equal probability. A response is given by y=Twp(μ(x), Φ(x)) where p=0, 2, 3, Tw denotes i2 used with the ED(μ, Φ) family as varying during training.).
Please see motivation for claim 33 above.

Claims 24-27 are rejected under 35 U.S.C. 103 as being unpatentable over Huang et al. (US 20170147905 A1, hereinafter Huang) in view of Kendall et al. (“Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics”) and Hoellwarth (US 20100079356 A1).
Regarding claim 24,
Huang discloses non-transitory memory configured to store: 
executable instructions (Huang [0112] recites, in part, “Embodiments of the present invention may be encoded upon one or more non-transitory computer-readable media with instructions for one or more processors or processing units to cause steps to be performed. It shall be noted that the one or more non-transitory computer-readable media shall include volatile and non-volatile memory.”), and 
a multitask network for determining outputs associated with a plurality of tasks, wherein the multitask network is trained using, for each task of the plurality of tasks (Huang fig. 8 and [0109] recites, in part, “FIG. 8 depicts a simplified block diagram of a computing system comprising an FCN to perform end-to-end multi-task object detection…” Fully Convolutional Network (FCN) to perform end-to-end multi-task object detections (i.e. multitask network for determining outputs associated with tasks)):  
a gradient norm of a single-task loss, of (1) a task output for a task of the plurality of tasks determined using the multitask network with a training image as input, and (2) a corresponding reference task output for the task associated with the training image, adjusted by a task weight for the task, with respect to a plurality of network weights of the multitask network (Huang [0058] and [0063] recites, in part, “After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction… [0063] In training, an input patch may be considered a “positive patch” if it comprises an object centered in the center at a specific scale. These patches comprise mainly negative samples around the positive samples. In embodiments, in order to fully explore the negative samples in the whole dataset, patches are cropped at random scale from training images, and resized to the same size and fed to the network...The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning.” Gradient descent learning on samples using loss and output gradients (i.e. gradient norm of tasks loss of a task output), positive and negative samples (i.e. reference tasks), and scaled by number of contributing pixels (i.e. reference task output adjusted by task weights)), 
a relative training rate for the task determined based on the single-task loss for the task (Huang [0063] recites “The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations.” Learning rate (i.e. training rate)), 
a gradient loss function comprising a difference between (1) the determined gradient norm for the task and (2) a corresponding target gradient norm determined based on (a) an average gradient norm of the plurality of tasks, (b) the relative training rate for the task, and (c) a hyperparameter (Huang [0054], [0058] and [0062]-[0063] recites, in part, “…a loss functions, such as L2 loss, hinge loss, and cross-entropy loss may be use in both face and car detection tasks. [0058] After negative mining, the badly predicted samples are relatively more likely to be selected, such that gradient descent learning on those samples reduces noise and, thus, leads to more robust prediction… [0062] …that combining may be performed in any other manner. In embodiments, the balance between classification and regression tasks is controlled by the parameter λloc. For example, the regression target d* may be normalized by dividing by the standard object height… [0063] The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning.” Loss function (i.e. gradient loss function), gradient descent learning on samples (i.e. a difference between gradient norm for each task and target gradient norm), normalized by dividing by the standard object height (i.e. average gradient norm), learning rate (i.e. training rate), and parameter λloc (i.e. hyperparameter)), and 
an updated task weight for the task using a gradient of the gradient loss function with respect to the task weight for the task (Huang [0063] recites “The loss and output gradients should be scaled by the number of contributing pixels, so that both loss and output gradients are comparable in multi-task learning. In embodiments, the global learning rate starts with 0.001 and is reduced by a factor of 10 every 100,000 iterations. A momentum term weight of 0.9 and a weight decay factor of 0.0005 are used.” Momentum term weight (i.e. updated task weight)); 
However, Huang does not disclose 
 wherein the relative training rates are determined so that each of the plurality of tasks completes over a similar length of training time
a head mounted display system comprising: a display; a sensor; and a hardware processor in communication with the non-transitory memory and the display, the hardware processor programmed by the executable instructions to: receive a sensor datum captured by the sensor; determine a task output for each task of the plurality of tasks using the multitask network with the sensor datum as input; and cause the display to show information related to the determined task outputs to a user of the augmented reality device.  
a head mounted display system comprising (Hoellwarth fig. 1 element 100 and [0047] recites “Head-mounted display system 100 can include a variety of features, which can be provided by one or both devices of the system when they are connected and in communications with one another.”): 
a display (Hoellwarth [0049] recites “The image based content may for example be viewed on the display of the head mounted display system.”); 
a sensor (Hoellwarth [0049] recites “As yet another example, the head-mounted system 100 can utilize a proximity sensor on one or both of the head mounted device and portable electronic device to detect and identify the relationship between the two devices or to detect and identify things in the outside environment.”); and 
a hardware processor in communication with the non-transitory memory and the display, the hardware processor programmed by the executable instructions to (Hoellwarth [0047] recites, in part, “Head-mounted display system 100 can include a variety of features, which can be provided by one or both devices of the system when they are connected and in communications with one another. For example, each device may include one or more of the following components: processors, display screen, controls (e.g., buttons, switches, touch pads, and/or screens), camera, receiver, antenna, microphone, speaker, batteries, optical subassembly, sensors, memory, communication systems, input/output ("I/O") systems, connectivity systems, cooling systems, connectors, and/or the like. If activated, these components may be configured to work together or separately depending on the needs of the system.” Processor connected and in communication with memory as part of head-mounted display system (i.e. hardware processor in communication with memory)): 
receive a sensor datum captured by the sensor (Hoellwarth fig. 17A element 1708 and [0254] recites “At step 1708, the head-mounted display system can determine whether a user input has been detected from sensors.” Detected input from sensors (i.e. sensor datum captured by sensor)); 
determine a task output for each task of the plurality of tasks using the multitask network with the sensor datum as input (Hoellwarth fig. 17A and [0254] recites “For example, accelerometers on the head-mounted display system can detect if the user has made any head movements. Based on the detection of a particular head movement, the head-mounted display system can determine if the head movement is an indication that the user would like to view image based content from the outside world.” Image viewing based on content from outside world (i.e. task output)); and 
cause the display to show information related to the determined task outputs to a user of the head mounted display system (Hoellwarth fig. 17A element 1706 and [0254] & [0258] recites “If, at step 1708, the head-mounted display system determines that a user input has been received from the sensors, process 1700 moves to step 1706. [0258] At step 1706, a PIP image frame overlaid on at least one of displayed left and right image frames (e.g., PIP mode) can be displayed.” PIP Image frame overlaid on left and right image frames (i.e. display to show information related to task outputs) from user input received by sensors of head-mounted display (i.e. user of the augmented reality device)).  

Kendall teaches
	wherein the relative training rates (Pg. 6, section 3; This last objective can be seen as learning the relative weights of the losses for each output. Large scale values σ2 will decrease the contribution of L2(W), whereas small scale σ2 will increase its contribution. The scale is regulated by the last term in the equation. The objective is penalised when setting σ2 too large (with the last term contributing a constant value log C – with C classes – to the loss). The multi-task objective with homoscedastic task uncertainty now becomes: L(W, σ1, σ2, ..., σi) = X i 1 2σ 2 i Li(W) + log σ 2 i (11) over all tasks indexed by i. Again, we write Li(W) = ||yi − fW(x)||2 for regression losses yi , and Li(W) = − log Softmax(yi ,fW(x)) for classification losses. Equation 11 represents the training rate.) are determined so that each of the plurality of tasks completes over a similar length of training time (Pg. 11; Figure 4: Training plots showing convergence of homoscedastic noise and task loss for an array of initialisation choices for the homoscedastic uncertainty terms for all three tasks. The left plot shows that the loss converges to the same minimum from varying initialisation choices. The centre plot  The three tasks finish at the same time (100 iterations).)
It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the multi-task learning of Huang with the learning of multiple objectives using homoscedastic uncertainty of Kendall.
Doing so would allow for finding optimal weights for tasks. Balancing loss helps find the optimal weighting for each task which leads to improved performance (pg. 1; We interpret homoscedastic uncertainty as task-dependent weighting and show how to derive a principled multi-task loss function which can learn to balance various regression and classification losses. Our method can learn to balance these weightings optimally, resulting in superior performance, compared with learning each task individually.)

Regarding claim 25,
The Huang/Kendall/Hoellwarth Combination teaches the system of claim 24, wherein the plurality of tasks comprises a plurality of perceptual tasks (Huang [0030], [0072] and [0074] recites, in part, “Object detection often involves multi-task learning, such as landmark localization, pose estimation, and semantic segmentation. [0072] Neural network-based face detector refers to those face detection system using Face detection with landmark localization (i.e. perceptual tasks)).
Please see motivation for claim 24 above.

Regarding claim 26,
The Huang/Kendall/Hoellwarth Combination teaches the system of claim 25, wherein the plurality of perceptual tasks comprises the face recognition, visual search, gesture identification, semantic segmentation, object detection, lighting detection, simultaneous localization and mapping, relocalization, or a combination thereof (Huang [0030], [0072] and [0074] recite, in part, “Object detection often involves multi-task learning, such as landmark localization, pose estimation, and semantic segmentation. [0072] Neural network-based face detector refers to those face detection system using neural network before the recent break-through results of CNNs for image classification… While the systems and methods presented herein have a similar detection pipeline, embodiments use modern CNNs as detectors. [0074] Deep Dense Face Detector (DDFD)… is a face detection system based on convolutional neural networks… Although DDFD is a complete detection pipeline, it is not an end-to-end framework … In contrast, embodiments of the present disclosure can be optimized Face detection (i.e. face recognition), object detection, landmark localization, and semantic segmentation”).
Please see motivation for claim 24 above.

Regarding claim 27,
The Huang/Kendall/Hoellwarth Combination teaches the system of claim 24, wherein the sensor comprises an inertial measurement unit, an outward-facing camera, a depth sensing camera, a microphone, an eye imaging camera, or a combination thereof (Hoellwarth [0225]-[0226], [0228] and [0230] recites, in part, “Head-mounted device 1304 can include one or more sensors 1324 to detect various signals. Suitable sensors can include, for example, ambient sound detectors, proximity sensors, accelerometers, light detectors, cameras, and temperature sensors. [0226] To identify the detected words, the ambient sound detector can attempt to match the words to a stored library of words. [0228] Accelerometers on head-mounted device 1304 can detect the user's head movements. [0230] Sensors 1324 can include a camera which can capture image based content of the outside world.” Sensor including ambient sound detector (i.e. microphone), camera capturing outside world (i.e. outward-facing camera), and accelerometers (i.e. inertial measurement unit)).  
Please see motivation for claim 24 above.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 

Hwang et al. (U.S. 20180060722 A1) teaches training a Convolutional Neural Network (CNN) with images and video based on weakly supervised learning.


Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HENRY K NGUYEN whose telephone number is (571)272-0217. The examiner can normally be reached Mon - Fri 7:00am-4:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 5712723768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.






/H.N./Examiner, Art Unit 2121                                                                                                                                                                                                        


/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121