DETAILED ACTION
This action is in response to the application filed 11/22/2019 which claims foreign priority to JP2018-226721 filed 12/03/2018. Claims 1-20 are pending and have been considered.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 02/20/2020 and 03/11/2020 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 

Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitations uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitations are: 
training data obtaining unit configured to obtain in claim 1
error map obtaining unit configured to obtain in claim 1
training unit configured to train in claim 1
Because these claim limitations are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the 
If applicant does not intend to have these limitations interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitations to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitations recites sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1-5, 8-10, 12, 14-20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Li et al. ("Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network", hereinafter "Li").

Regarding claim 1, Li discloses A training apparatus for training a neural network [Abstract], the neural network being configured to, when an input image is inputted, output a detection result of a first type and a detection result of a second type for each position of the input image (“Our heterogeneous multi-task framework consists of two types of tasks: 1) a pose regression task, where the aim is to predict the locations of human body joints in an image; 2) a set of body-part detection tasks, where the goal is to classify whether a window in the image contains the specific body part.” [pg. 483, § 3. Heterogeneous Multi-task learning, ¶1, Body part detection tasks would correspond to a detection result of a first type and a pose regression tasks would correspond to a detection result of a second type.]), the training apparatus (“We train and evaluate our network on a Dell T3400 with GTX 770 4G. Training the network takes 1 to 2 days, while the evaluation for 4000 images takes 5-6 seconds.” [pg. 486, § 4.2 Experiment setup, ¶1; note: Examiner is interpreting the following units under 112(f) and thus interpreting a processor to perform the corresponding functions.]) comprising: 
a training data obtaining unit configured to obtain a training image to be input to the neural network for training (“Our network structure is shown in Figure 3. The input is an RGB image with human. The first 6 hidden layers are shared by both regression and detection tasks. In the shared layers, we only use convolutional layers and pooling layers to ensure the activation of neurons are affected by only local patterns in the input.” [pg. 485, left col, ¶3]); 
an error map obtaining unit configured to obtain an error map indicating a detection error for a detection result of the first type, for each position of the training image (“Finally, calculating the binary indicator yp,l for each window l, results in a binary indicator map for part p. Figure 2 shows an example converting the upper-arm annotation into an indicator map. Note that we allow multiple body parts to appear in the same window, and also allow one body part to appear in several windows. For each detection task for part p and window l, we minimize the cross-entropy error function 
    PNG
    media_image1.png
    42
    321
    media_image1.png
    Greyscale
 where yp,l is the ground-truth label, and yˆp,l is the corresponding detection probability from the classifier.” [pg. 484, § 3.2 Body part detection, ¶2-3; Examiner is interpreting equation 3 to be equivalent to an “error map”.]); and 
a training unit configured to train the neural network using the detection result of the first type and the detection result of the second type that are obtained by inputting the training image to the neural network, and the error map (“We jointly train the regression and detection networks with the global cost function in (4). We use backpropagation [16] to update the weights. Given a training image, predictions for both tasks are calculated, and the corresponding gradients are back-propagated through the network.” [pg. 485, 3.5 Training, ¶1; note: “Our global cost function is the linear combination of the regression cost function for all joints and the detection cost function for all parts and windows, over all training images” [pg. 484, § 3.3 Global cost function; Global cost function includes equation (3)]).

Regarding claim 2, Li discloses The training apparatus according to claim 1, wherein the detection result of the second type can be generated from the detection result of the first type (“Low level feature sharing: We allow the detection tasks and regression tasks to share the same learned feature representation. This is motivated by the following two reasons. First, features learned for the detection task should also be helpful for identifying parts or joints in the regression task. Second, feature sharing will reduce the number of parameters and encourage the network to generalize on a larger range of samples.” [pg. 484, 3.4 Network Structure, first bullet; See further: pg. 484, Figure 1. discloses “Next, a convolutional neural network (CNN) extracts shared features from the cropped image, and the shared features are the inputs to the joint point regression tasks and the body-part detection tasks. The CNN, regression, and detection tasks are learned simultaneously, resulting in a shared feature representation that is good for all tasks.”]).

Regarding claim 3, Li discloses The training apparatus according to claim 1, wherein the error map indicates a position of an underdetection region or a misdetection region caused by a detection error in the detection result of the first type (“Evaluation on the whole Buffy test set includes errors due to mis-detection of the upper body. To investigate the pose estimation performance alone, we also present results on the subset of the Buffy test set where the upper body detector predicts the correct bounding box. In this case, HMLPE achieves slightly better results than [28] (0.7% better on lower arms and 0.5% better on upper arms).” [pg. 486, 4.3 Evaluation on Buffy Set, ¶4]).

Regarding claim 4, Li discloses The training apparatus according to claim 1, wherein the neural network includes an input layer to which the image is inputted, an intermediate layer in which processing is performed, a first output layer for outputting the detection result of the first type, and a second output layer that branches from the intermediate layer and is for outputting the detection result of the second type (“
    PNG
    media_image2.png
    256
    682
    media_image2.png
    Greyscale
” [pg. 485, Figure 3; note: The first 6 hidden layers are shared by both regression and detection tasks, examiner is interpreting hidden layers to be equivalent to an intermediate layer.]).

Regarding claim 5, Li discloses The training apparatus according to claim 1, wherein the training data obtaining unit is further configured to obtain first supervisory data indicating a detection result of the first type that is prepared in advance for the (“We collect training data from several data sets, including Buffy Stickmen [7], ETHZ Stickmen [4], Leed Sport Pose (LSP [13]), Synchronic Activities Stickmen (SA [6]), 485 Frames Labeled In Cinema (FLIC [19]), We Are Family(WAF) [5]. For Buffy, LSP, FLIC we only use their respective training sets, while we use the whole ETHZ, SA, and WAF datasets for training. In total, we have collected 8427 images for training. We represent the human body with a set of joints, and use the segments between those joints to represent body parts. For data sets with only stick labels, we use the nearest end of stick or average of nearest ends as the joint point. We define 8 joints (nose, neck, left and right shoulders, left and right elbows, and left and right wrists), and 7 body parts (head, left and right shoulder, left and right upper arms, and left and right lower arms)” [pg. 485-486, § 4.1 Training Data, ¶1-2; Examiner is interpreting supervisory data to be training data.]), and 
the error map obtaining unit is further configured to generate the error map based on an error between the first supervisory data and the detection result of the first type obtained by inputting the training image to the neural network (“For each detection task for part p and window l, we minimize the cross-entropy error function 
    PNG
    media_image1.png
    42
    321
    media_image1.png
    Greyscale
 where yp,l is the ground-truth label, and yˆp,l is the corresponding detection probability from the classifier.” [pg. 484, § 3.2 Body part detection, ¶3; ground truth label would correspond to training data and detection probability would be a detection result of the first type (i.e. body part).]).

claim 8, Li discloses The training apparatus according to claim 1, wherein the training data obtaining unit is further configured to obtain first supervisory data of the first type and second supervisory data of the second type, which are prepared in advance for the training image (“We represent the human body with a set of joints, and use the segments between those joints to represent body parts. For data sets with only stick labels, we use the nearest end of stick or average of nearest ends as the joint point. We define 8 joints (nose, neck, left and right shoulders, left and right elbows, and left and right wrists), and 7 body parts (head, left and right shoulder, left and right upper arms, and left and right lower arms). Since Buffy, ETHZ, SA, WAF only provide the upper-end and lower-end of the head, we use the middle point as the nose position. We illustrate our parts and joints definition in Figure 2.” [pg. 486, § 4.1 Training data, ¶2]).

Regarding claim 9, Li discloses The training apparatus according to claim 8, wherein the training unit is further configured to train the neural network based on an error between the detection result of the first type and the first supervisory data, and an error between the detection result of the second type and the second supervisory data (“Our global cost function is the linear combination of the regression cost function for all joints and the detection cost function for all parts and windows, over all training images 
    PNG
    media_image3.png
    64
    341
    media_image3.png
    Greyscale
 where λr and λd are the weights for regression and detection tasks, respectively, and the superscript (t) indicates the index of the training image” [pg. 484, § 3.3 Global cost function; note: the global cost function is a combination of eq (3) (error loss for body detection) and eq (1) (error loss for joint regression).]).

Regarding claim 10, Li discloses The training apparatus according to claim 9, wherein the training unit is further configured to use a detection error in the detection result of the first type to weight, for each position of the training image, the error between the detection result of the second type and the second supervisory data (“Next we study the effect of multi-task training, i.e., the joint learning of the regression and detection tasks. We set different values for the weights of the regression and detection tasks. All parameters except the weights on the cost function are kept the same. We show training and testing error in Figure 5 and in Table 3. Firstly, the network with only the regression task performs poorly on both the training and testing sets. Even using tiny weights on the detection tasks help to improve the convergence, leading to a significant performance increase. Within a certain range, increasing weights on the detection tasks leads to lower errors on the test set. For larger weights on the detection tasks, the performance decreases. This is reasonable since the gradient will be dominated by detection task in this case.” [pg. 487, § 4.5 Effect of multi-task training, ¶1-2]).

Regarding claim 12, Li discloses The training apparatus according to claim 8, wherein the detection result of the second type and the detection result of the first type indicate different information with respect to a detection target of the same type (“Our heterogeneous multi-task framework consists of two types of tasks: 1) a pose regression task, where the aim is to predict the locations of human body joints in an image; 2) a set of body-part detection tasks, where the goal is to classify whether a window in the image contains the specific body part. In the following, we assume that a bounding box around the human has already been provided, e.g., using an upper body detector” [pg. 483, § 3. Heterogeneous Multi-task Learning; Examiner is interpreting detection target of the same type to be equivalent to a human in the image.]).

Regarding claim 14, Li discloses The training apparatus according to claim 1, wherein the neural network is configured to output the detection result of the first type and the detection result of the second type for each position of the input image as an estimation map (“The detection task is to determine whether a local window contains the specific body part, while the regression task is to predict the coordinates of the joint position. Hence, the features extracted from the lower layers should not be translation invariant, i.e., the positions of the features should be preserved in the feature map” [pg. 484-485, § 3.4 Network Structure, 2nd bullet; See further Figure 2. and Figure 3.]).

Regarding claim 15, Li discloses The training apparatus according to claim 14, wherein the error map obtaining unit is further configured to generate the error map for the detection result of the first type based on first supervisory data and the estimation map representing the detection result of the first type (“Finally, calculating the binary indicator yp,l for each window l, results in a binary indicator map for part p. Figure 2 shows an example converting the upper-arm annotation into an indicator map. Note that we allow multiple body parts to appear in the same window, and also allow one body part to appear in several windows. For each detection task for part p and window l, we minimize the cross-entropy error function 
    PNG
    media_image1.png
    42
    321
    media_image1.png
    Greyscale
 where yp,l is the ground-truth label, and yˆp,l is the corresponding detection probability from the classifier.” [pg. 484, § 3.2 Body part detection, ¶3])

Regarding claim 16, Li discloses The training apparatus according to claim 1, wherein the detection result of the first type is a region of a predetermined object, and the detection result of the second type is a region of a specific portion of the predetermined object (“Our heterogeneous multi-task framework consists of two types of tasks: 1) a pose regression task, where the aim is to predict the locations of human body joints in an image; 2) a set of body-part detection tasks, where the goal is to classify whether a window in the image contains the specific body part.” [pg. 483, § 3. Heterogeneous Multi-task learning, ¶1, Body part detection would correspond to a region of a predetermined object (i.e. arm) and joint detection would correspond to a region of a specific portion (i.e. wrist, elbow, etc.)]).

Regarding claim 17, Li discloses A processing apparatus for outputting an estimation map, the estimation map indicating a detection result for each position of an input image (See pg. 483, Figure 2.), the processing apparatus comprising:
(“
    PNG
    media_image2.png
    256
    682
    media_image2.png
    Greyscale
” [pg. 485, Figure 3.; note: Body part detection would correspond to an output of a first type and joint point regression would correspond to an output of a second type.]), the training apparatus comprising: 
a training data obtaining unit configured to obtain a training image to be input to the neural network for training (“Our network structure is shown in Figure 3. The input is an RGB image with human. The first 6 hidden layers are shared by both regression and detection tasks. In the shared layers, we only use convolutional layers and pooling layers to ensure the activation of neurons are affected by only local patterns in the input.” [pg. 485, left col, ¶3]); 
an error map obtaining unit configured to obtain an error map indicating a detection error for a detection result of the first type, for each position of the training image (“Finally, calculating the binary indicator yp,l for each window l, results in a binary indicator map for part p. Figure 2 shows an example converting the upper-arm annotation into an indicator map. Note that we allow multiple body parts to appear in the same window, and also allow one body part to appear in several windows. For each detection task for part p and window l, we minimize the cross-entropy error function 
    PNG
    media_image1.png
    42
    321
    media_image1.png
    Greyscale
 where yp,l is the ground-truth label, and yˆp,l is the corresponding detection probability from the classifier.” [pg. 484, § 3.2 Body part detection, ¶2-3; Examiner is interpreting equation 3 to be equivalent to an “error map”.]); and 
a training unit configured to train the neural network using the detection result of the first type and the detection result of the second type that are obtained by inputting the training image to the neural network, and the error map (“We jointly train the regression and detection networks with the global cost function in (4). We use backpropagation [16] to update the weights. Given a training image, predictions for both tasks are calculated, and the corresponding gradients are back-propagated through the network.” [pg. 485, 3.5 Training, ¶1; note: “Our global cost function is the linear combination of the regression cost function for all joints and the detection cost function for all parts and windows, over all training images” [pg. 484, § 3.3 Global cost function; Global cost function includes equation (3)]); 
and a generation unit configured to generate the estimation map by inputting input images to the neural network (“The detection task is to determine whether a local window contains the specific body part, while the regression task is to predict the coordinates of the joint position. Hence, the features extracted from the lower layers should not be translation invariant, i.e., the positions of the features should be preserved in the feature map” [pg. 484-485, § 3.4 Network Structure, 2nd bullet; See further Figure 2. and Figure 3.]).

Regarding claim 18, Li discloses A neural network for outputting, as an estimation map, a detection result for each position of an input image, the neural network comprising:
an input layer to which the input image is inputted (“Network architecture for pose estimation: The input layer is 112×112 RGB image.” [pg. 485, Figure 3.]); 
an intermediate layer in which processing is performed (“The shared CNN consists of 3 convolutional layers, each followed by a max-pooling layer” [pg. 485, Figure 3.]); and 
an output layer configured to output the detection result (“For our HMLPE, the pose regression task predicts 8 joint positions (16 outputs in total), and the detection task has 7 body parts” [pg. 486, § 4.2. Experiment setup, ¶1; See further Figure 1. and Figure 3.]), 
wherein the neural network is trained such that a different detection result that can be generated from the detection result is obtained from the intermediate layer (“
    PNG
    media_image2.png
    256
    682
    media_image2.png
    Greyscale
” [pg. 485, Figure 3.]).  

Regarding claim 19, Li discloses A method of training a neural network [Abstract] being configured to, when an input image is inputted, output a detection result of a first type and a detection result of a second type for each position of the input image (“Our heterogeneous multi-task framework consists of two types of tasks: 1) a pose regression task, where the aim is to predict the locations of human body joints in an image; 2) a set of body-part detection tasks, where the goal is to classify whether a window in the image contains the specific body part.” [pg. 483, § 3. Heterogeneous Multi-task learning, ¶1, Body part detection tasks would correspond to a detection result of a first type and a pose regression tasks would correspond to a detection result of a second type.]), the training apparatus (“We train and evaluate our network on a Dell T3400 with GTX 770 4G. Training the network takes 1 to 2 days, while the evaluation for 4000 images takes 5-6 seconds.” [pg. 486, 4.2 Experiment setup, ¶1; note: Examiner is interpreting the following units under 112(f) and thus interpreting a processor to perform the corresponding functions.]) the method comprising: 
(“Our network structure is shown in Figure 3. The input is an RGB image with human. The first 6 hidden layers are shared by both regression and detection tasks. In the shared layers, we only use convolutional layers and pooling layers to ensure the activation of neurons are affected by only local patterns in the input.” [pg. 485, left col, ¶3]); 
obtaining an error map indicating a detection error for a detection result of the first type, for each position of the training image (“Finally, calculating the binary indicator yp,l for each window l, results in a binary indicator map for part p. Figure 2 shows an example converting the upper-arm annotation into an indicator map. Note that we allow multiple body parts to appear in the same window, and also allow one body part to appear in several windows. For each detection task for part p and window l, we minimize the cross-entropy error function 
    PNG
    media_image1.png
    42
    321
    media_image1.png
    Greyscale
 where yp,l is the ground-truth label, and yˆp,l is the corresponding detection probability from the classifier.” [pg. 484, § 3.2 Body part detection, ¶2-3; Examiner is interpreting equation 3 to be equivalent to an “error map”.]); and 
training the neural network using the detection result of the first type and the detection result of the second type that are obtained by inputting the training image to the neural network, and the error map (“We jointly train the regression and detection networks with the global cost function in (4). We use backpropagation [16] to update the weights. Given a training image, predictions for both tasks are calculated, and the corresponding gradients are back-propagated through the network.” [pg. 485, 3.5 Training, ¶1; note: “Our global cost function is the linear combination of the regression cost function for all joints and the detection cost function for all parts and windows, over all training images” [pg. 484, § 3.3 Global cost function; Global cost function includes equation (3)]).

Regarding claim 20, Li discloses A non-transitory computer-readable medium storing a program which, when executed by a computer comprising a processor and a memory (“We train and evaluate our network on a Dell T3400 with GTX 770 4G. Training the network takes 1 to 2 days, while the evaluation for 4000 images takes 5-6 seconds.” [pg. 486, § 4.2 Experiment setup, ¶1), causes the computer to perform:
obtaining a training image to be input to the neural network for training (“Our network structure is shown in Figure 3. The input is an RGB image with human. The first 6 hidden layers are shared by both regression and detection tasks. In the shared layers, we only use convolutional layers and pooling layers to ensure the activation of neurons are affected by only local patterns in the input.” [pg. 485, left col, ¶3]); 
obtaining an error map indicating a detection error for a detection result of the first type, for each position of the training image (“Finally, calculating the binary indicator yp,l for each window l, results in a binary indicator map for part p. Figure 2 shows an example converting the upper-arm annotation into an indicator map. Note that we allow multiple body parts to appear in the same window, and also allow one body part to appear in several windows. For each detection task for part p and window l, we minimize the cross-entropy error function 
    PNG
    media_image1.png
    42
    321
    media_image1.png
    Greyscale
 where yp,l is the ground-truth label, and yˆp,l is the corresponding detection probability from the classifier.” [pg. 484, § 3.2 Body part detection, ¶2-3; Examiner is interpreting equation 3 to be equivalent to an “error map”.]); and 
training the neural network using the detection result of the first type and the detection result of the second type that are obtained by inputting the training image to the neural network, and the error map (“We jointly train the regression and detection networks with the global cost function in (4). We use backpropagation [16] to update the weights. Given a training image, predictions for both tasks are calculated, and the corresponding gradients are back-propagated through the network.” [pg. 485, 3.5 Training, ¶1; note: “Our global cost function is the linear combination of the regression cost function for all joints and the detection cost function for all parts and windows, over all training images” [pg. 484, § 3.3 Global cost function; Global cost function includes equation (3)]).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 6, 7, 11, and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Li in view of Zhang et al. ("Facial Landmark Detection by Deep Multi-task Learning", hereinafter "Zhang").

Regarding claim 6, Li teaches The training apparatus according to claim 5, where Li further teaches wherein the error map obtaining unit is further configured to generate the error map based on an error between the first supervisory data and the detection result of the first type (“Finally, calculating the binary indicator yp,l for each window l, results in a binary indicator map for part p. Figure 2 shows an example converting the upper-arm annotation into an indicator map. Note that we allow multiple body parts to appear in the same window, and also allow one body part to appear in several windows. For each detection task for part p and window l, we minimize the cross-entropy error function 
    PNG
    media_image1.png
    42
    321
    media_image1.png
    Greyscale
 where yp,l is the ground-truth label, and yˆp,l is the corresponding detection probability from the classifier.” [pg. 484, § 3.2 Body part detection, ¶2-3; Examiner is interpreting equation 3 to be equivalent to an “error map”.]) obtained by inputting the training image to a neural network before training (“We pre-train the network using the training data discussed in the previous section, in order to obtain an initial network. Then, we use the initial network as the starting point for training the network using the training data of a specific dataset, either Buffy or FLIC. The initial network serves as a prior to help regularize the network.” [pg. 486, 4.2. Experiment setup, ¶1]), and 
the training unit is further configured to use a detection result of the first type and a detection result of the second type obtained by inputting the training image to a neural network after the training (“We jointly train the regression and detection networks with the global cost function in (4). We use backpropagation [16] to update the weights. Given a training image, predictions for both tasks are calculated, and the corresponding gradients are back-propagated through the network.” [pg. 485, § 3.5. Training, ¶1]), and
However Li fails to explicitly teach the error map to perform further training of the neural network after the training.
Zhang teaches the error map to perform further training of the neural network after the training (“Note that xl is the shared representation between the main task r, and related tasks A. Eq.(4) and Eq.(3) can be trained jointly. The former learns the shared space and the latter optimizes the tasks with respect to this space, and then the errors of the tasks can be propagated back to refine the space. We iterate this learning procedure until convergence.” [pg. 98, § 3.1 Problem Formulation, ¶4).
Li and Zhang are both in the same field of endeavor of multi-task learning. Li discloses multi-task learning for human pose estimation. Zhang discloses a multi-task facial landmark detection model. It would have been obvious to a person of ordinary skill in the art before the effective date to modify the training steps of Li by using the errors of the detection tasks to perform further training of the network as taught by Zhang. One would have been motivated to make this modification in order to improve the detection accuracy with further training by optimizing the main task jointly with related tasks until convergence. [pg. 95, ¶2 and Fig. 1, Zhang]

Regarding claim 7, Li teaches The training apparatus according to claim 5, where Li further teaches wherein the error map obtaining unit is further configured to, based on an error between the first supervisory data and the detection result of the first type obtained by inputting the training image to a neural network before training (“We pre-train the network using the training data discussed in the previous section, in order to obtain an initial network. Then, we use the initial network as the starting point for training the network using the training data of a specific dataset, either Buffy or FLIC. The initial network serves as a prior to help regularize the network.” [pg. 486, 4.2. Experiment setup, ¶1]), and 
(“We use Percentage of Correct Part (PCP) to measure the accuracy of pose estimation. As pointed out in [11], the previous PCP evaluation measure does not compute PCP correctly. We use the evaluation tool provided by [11] to calculate the corrected PCP, where an estimated body part with end points (e1, e2) is considered as correct if 
    PNG
    media_image4.png
    67
    330
    media_image4.png
    Greyscale
 where (g1, g2) and L are ground truth position and length of the part, and α is the parameter for PCP. We use the standard value of α = 0.5” [pg. 486, § 4.3 Evaluation on Buffy Set, ¶2]),
generate the error map (“Finally, calculating the binary indicator yp,l for each window l, results in a binary indicator map for part p. Figure 2 shows an example converting the upper-arm annotation into an indicator map. Note that we allow multiple body parts to appear in the same window, and also allow one body part to appear in several windows. For each detection task for part p and window l, we minimize the cross-entropy error function 
    PNG
    media_image1.png
    42
    321
    media_image1.png
    Greyscale
 where yp,l is the ground-truth label, and yˆp,l is the corresponding detection probability from the classifier.” [pg. 484, § 3.2 Body part detection, ¶2-3; Examiner is interpreting equation 3 to be equivalent to an “error map”.])

Zhang teaches used in further training of the neural network after the training (“Note that xl is the shared representation between the main task r, and related tasks A. Eq.(4) and Eq.(3) can be trained jointly. The former learns the shared space and the latter optimizes the tasks with respect to this space, and then the errors of the tasks can be propagated back to refine the space. We iterate this learning procedure until convergence.” [pg. 98, § 3.1 Problem Formulation, ¶4).
Li and Zhang are both in the same field of endeavor of multi-task learning. Li discloses multi-task learning for human pose estimation. Zhang discloses a multi-task facial landmark detection model. It would have been obvious to a person of ordinary skill in the art before the effective date to modify the training steps of Li by using the errors of the detection tasks to perform further training of the network as taught by Zhang. One would have been motivated to make this modification in order to improve the detection accuracy with further training by optimizing the main task jointly with related tasks until convergence. [pg. 95, ¶2 and Fig. 1, Zhang]

Regarding claim 11, Li teaches The training apparatus according to claim 9, however fails to explicitly teach wherein the detection result of the second type indicates a detection error for the detection result of the first type, and the training unit uses the error map as the second supervisory data.
Zhang teaches wherein the detection result of the second type indicates a detection error for the detection result of the first type (“To verify the effectiveness of the task-wise early stopping, we train the proposed TCDCN with and without this technique and compare the landmark detection rates in Figure 7(a), which shows that without task-wise early stopping, the accuracy is much lower. Figure 7(b) plots the main task’s loss errors of the training set and the validation set within 2,600 iterations. Without early stopping, the training error converges slowly and exhibits substantial oscillations. However, convergence rates of both the training and validation sets are fast and stable when using the proposed early stopping scheme. In Figure 7(b), we also point out when and which task has been halted during the training procedure. For example, ‘wearing glasses’ and ‘gender’ are stopped at the 250th and 350th iterations, and ‘pose’ lasts to the 750th iteration, which matches our expectation that ‘pose’ has the largest beneficit to landmark detection, compared to the other related tasks” [pg. 103, § 4.2 The Benefits of Task-Wise Early Stopping, ¶1]), and the training unit uses the error map as the second supervisory data (“Now we introduce a criterion to automatically determine when to stop learning an auxiliary task. Let                         
                            
                                
                                    E
                                
                                
                                    v
                                    a
                                    l
                                
                                
                                    a
                                
                            
                        
                     and                         
                            
                                
                                    E
                                
                                
                                    t
                                    r
                                
                                
                                    a
                                
                            
                        
                     be the values of the loss function of task a on the validation set and training set, respectively. We stop the task if its measure exceeds a threshold ᵋ as below… The first term in Eq.(5) represents the tendency of the training error. If the training error drops rapidly within a period of length k, the value of the first term is small, indicating that training can be continued as the task is still valuable; otherwise, the first term is large, then the task is more likely to be stopped. The second term measures the generalization error compared to the training error. The λa is the importance coefficient of a-th task’s error, which can be learned through gradient descent. Its magnitude reveals that more important task tends to have longer impact.” [pg. 100, Task-Wise Early Stopping, ¶2]).
Li and Zhang are both in the same field of endeavor of multi-task learning. Li discloses multi-task learning for human pose estimation. Zhang discloses a multi-task facial landmark detection model. It would have been obvious to a person of ordinary skill in the art before the effective date to modify the training steps of Li by using the errors of the detection tasks to perform further training of the network as taught by Zhang. One would have been motivated to make this modification in order to improve the detection accuracy with further training by optimizing the main task jointly with related tasks until convergence. [pg. 95, ¶2 and Fig. 1, Zhang]

Regarding claim 13, Li discloses The training apparatus according to claim 8, however fails to explicitly teach wherein the training data obtaining unit is further configured to generate the second supervisory data using the first supervisory data.
Zhang teaches wherein the training data obtaining unit is further configured to generate the second supervisory data using the first supervisory data (“The training dataset we use is identical to [21], consisting of 10,000 outdoor face images from the web. Each image is annotated with bounding box and five landmarks, i.e. centers of the eyes, nose, corners of the mouth, as depicted in Figure 1. We augmented the training samples by small jittering, including translation, in-plane rotation, and zooming. The ground truths of the related tasks are labeled manually. This dataset, known as Multi-Task Facial Landmark (MTFL) dataset, and the landmark detector will be released for research usage.” [pg. 101, § Model Training])
Li and Zhang are both in the same field of endeavor of multi-task learning. Li discloses multi-task learning for human pose estimation. Zhang discloses a multi-task facial landmark detection model. It would have been obvious to a person of ordinary skill in the art before the effective date to modify the training steps of Li by using the errors of the detection tasks to perform further training of the network as taught by Zhang. One would have been motivated to make this modification in order to improve the detection accuracy with further training by optimizing the main task jointly with related tasks until convergence. [pg. 95, ¶2 and Fig. 1, Zhang]

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Ranjan et al. ("HyperFace: A Deep Multi-task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition") discloses multi-task learning using deep CNNs.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL H HOANG whose telephone number is (571)272-8491.  The examiner can normally be reached on Mon-Fri 8:30AM-4:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/M.H.H./Examiner, Art Unit 2122                                                                                                                                                                                                        




/ERIC NILSSON/Primary Examiner, Art Unit 2122