DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Drawings
The drawings are objected to because a typographical error exists in the label for numeral 62 of Fig. 6, where “Second obtaining module” (emphasis added).  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
Applicant is reminded of the proper language and format for an abstract of the disclosure.
The abstract should be in narrative form and generally limited to a single paragraph on a separate sheet within the range of 50 to 150 words in length. The abstract should describe the disclosure sufficiently to assist readers in deciding whether there is a need for consulting the full patent text for details.
The language should be clear and concise and should not repeat information given in the title. It should avoid using phrases which can be implied, such as, “The disclosure concerns,” “The disclosure defined by this invention,” “The disclosure describes,” etc.  In addition, the form and legal phraseology often used in patent claims, such as “means” and “said,” should be avoided.

The abstract of the disclosure is objected to because the abstract contains language which repeats information given in the title and uses phrases which can be implied, e.g. “the present disclosure provides…” Furthermore, the abstract should be in narrative form should describe the disclosure sufficiently to assist readers in deciding whether there is a need for consulting the full patent text for details.  Correction is required.  See MPEP § 608.01(b).

The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 
The following title is suggested: “Image Recognition Method, Apparatus, and Computer Readable Storage Medium for Neural Network based Body Joint Prediction”.


CLAIM INTERPRETATION

The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier. Such claim limitation(s) is/are: “first obtaining module”, “second obtaining module”, “input module”, “predicted location obtaining module”, “third obtaining module”, “fourth obtaining module”, “training module”, “input unit”, “first obtaining unit”, “comparison unit”, “updating unit”, and “triggering unit” in claims 8-13.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure 
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1, 2, 6-8, and 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Vajda et al. (US 2019/0172223, effectively filed 3 December 2017), herein Vajda, in view of Park et al. (“3D Human Pose Estimation Using Convolutional Neural Networks with 2D Pose Information”), herein Park.
Regarding claim 1, Vajda discloses an image recognition method, applied to an electronic device, and comprising: 
obtaining a to-be-recognized feature of a to-be-recognized image (see Vajda [0080], [0083], and [0087], where a plurality of regional feature maps are generated based on a features map extracted from the input image), the to-be-recognized image comprising a target person (see Vajda Fig. 2 and [0080], where the input image comprises pixels corresponding to a person); 
obtaining a preset body model feature of a body frame image, the body model feature comprising locations of joints in the body frame image (see Vajda [0126]-[0128], where a pose model comprises keypoints which corresponding to predefined body joints; see also Vajda Fig. 7); 
inputting the to-be-recognized feature and the body model feature to a pre-constructed joint prediction model (see Vajda [0094], where the target regional feature map is processed using a third neural network to generate keypoint mask associated with each detected person; see Vajda [0126]-[0128], where the pose model is used to correct the keypoint predictions), 
the joint prediction model being obtained through training a neural network (see Vajda [0095], where the keypoint head is trained using ground truth keypoints by minimizing the cross-entropy loss over a softmax output); and 
obtaining predicted locations of joints of the target person in the to-be-recognized image based on the joint prediction model (see Vajda [0126]-[0128], where keypoint predictions from the keypoint head are obtained).
Vajda does not explicitly disclose that the neural network is trained using minimum respective differences between real locations of joints in a sample image and predicted locations of the corresponding joints as a training objective.
Park teaches in a related and pertinent 3d human pose estimation method using convolutional neural networks (see Park Abstract), where minimizing a cross entropy loss function is based on the Euclidean distance between a ground truth position and estimate positions of a joint (see Park sect. 3.1 Structure of the Baseline CNN, and Eq. (1)-(3)). 
At the time of filing, one of ordinary skill in the art would have found it obvious to apply the 

Regarding claim 2, please see the above rejection of claim 1. Vajda and Park disclose the image recognition method according to claim 1, further comprising: 
obtaining a body posture of the target person in the to-be-recognized image outputted by the joint prediction model, the body posture comprising joint locations in the to-be-recognized image (see Vajda [0134]-[0142], where a candidate pose is selected to represent the body depicted in the image).

Regarding claim 6, please see the above rejection of claim 1. Vajda and Park disclose the image recognition method according to claim 1, the obtaining a to-be-recognized feature of a to-be-recognized image comprising: 
(see Vajda [0082], where video frame images are accessed and the frames of the video are processed; and see Vajda [0115], where the video may be a live video).

Regarding claim 7, please see the above rejection of claim 6. Vajda and Park disclose the image recognition method according to claim 6, further comprising: 
obtaining tracking information of the target person based on predicted locations of joints in the multi-frame video images (see Vajda [0082], where video frame images are accessed and the frames of the video are processed; see Vajda Fig. 9 and [0143], where keypoints are tracked upon the people in the video frame image).

Regarding claim 8, it recites an apparatus performing the method of claim 1. Vajda and Park teach an apparatus performing the method of claim 1 (see Vajda [0165]-[0166], where a computer system is disclosed to perform the disclosed method). Please see above for detailed claim analysis, with the exception to the following further limitations:
Please see the above rejection for claim 1, as the rationale to combine the teachings of Vajda and Park are similar, mutatis mutandis.

Regarding claim 14, it recites a non-transitory computer readable medium for performing the method of claim 1. Vajda and Park teach a non-transitory computer readable medium for performing the method of claim 1 (see Vajda [0168]-[0169], where memory, such as RAM, storing instructions for a processor to perform the disclosed method is taught). Please see above for detailed claim analysis, with the exception to the following further limitations:
mutatis mutandis.

Regarding claim 15, see above rejection for claim 14. It is a computer readable storage medium claim reciting similar subject matter as claim 2. Please see above claim 2 for detailed claim analysis as the limitations of claim 15 are similarly rejected.

Claims 3-5, 9-13, 16-20 are rejected under 35 U.S.C. 103 as being unpatentable over Vajda and Park as applied to claims 1, 8, and 15 above, and further in view of Fu et al. (“ORGM: Occlusion Relational Graphical Model for Human Pose Estimation”), herein Fu.
Regarding claim 3, please see the above rejection of claim 1. Vajda and Park disclose the image recognition method according to claim 1, further comprising: 
obtaining a plurality of positive sample images to obtain a plurality of sample images, the positive sample image comprising a person (see Vajda [0101], where the training data set includes ground truths or labels that indicate known bounding boxes for object instance of interest, e.g. people); 
obtaining training features of the sample images respectively (see Vajda [0124], where feature maps and regional feature maps are generated for each training image); and 
using the training features of the sample images and the body model feature as inputs of the neural network, to obtain the joint prediction model through training (see Vajda [0095], the ground truth keypoints are used to train the keypoint head neural network to generate keypoint mask associated with each detected person; see Vajda [0126]-[0128], where a training dataset of training poses are used to learn a transformation function for determining the pose model).
Vajda and Park do not explicitly disclose obtaining a plurality of negative sample images and the 
Fu teaches in a related and pertinent human pose estimation method (see Fu Abstract), where non-person images are used as negative training samples (see Fu sect. IV. B. Implementation Details).
At the time of filing, one of ordinary skill in the art would have found it obvious to apply the teachings of Fu to the teachings of Vajda and Park, such that the training of neural networks of Vajda and Park would include negative training samples which are non-person images. This modification is rationalized as an application of a known technique to a known device ready for improvement to yield predictable results. In this instance, Vajda and Park disclose a base method for determining a candidate pose of an imaged person using based on a neural network architecture which are trained using training data set which includes ground truths or labels that includes people. Fu teaches a known technique of training a neural network, where negative training samples which are non-person images are used. One of ordinary skill in the art would have recognized that by applying Fu’s technique would allow for the Vajda and Park’s method to implement neural network training using a training data set which includes non-person images as negative training samples, predictably leading to a more robustly trained neural network models. 

Regarding claim 4, please see the above rejection of claim 3. Vajda, Park, and Fu disclose the image recognition method according to claim 3, the using the training features of sample images and the body model feature as inputs of the neural network, to obtain the joint prediction model through training comprising: 
inputting the training features of the sample images and the body model feature to the neural network (see Vajda [0095], the ground truth keypoints are used to train the keypoint head neural network to generate keypoint mask associated with each detected person; see Vajda [0126]-[0128], where a training dataset of training poses are used to learn a transformation function for determining the pose model); 
obtaining predicted locations of joints in sample images outputted by the neural network (see Vajda [0095], the ground truth keypoints are used to train the keypoint head neural network to generate keypoint mask associated with each detected person); 
comparing the predicted locations of the joints in the sample images with real locations of the joints in the corresponding sample images respectively, to obtain comparison results (see Vajda [0095], where the keypoint head is trained using ground truth keypoints by minimizing the cross-entropy loss over a softmax output; see Park sect. 3.1 Structure of the Baseline CNN, and Eq. (1)-(3), where minimizing a cross entropy loss function is based on the Euclidean distance between a ground truth position and estimate positions of a joint);
updating a body extraction parameter and an alignment parameter based on the comparison results, the body extraction parameter being used to extract the person from a background environment in the sample image (see Vajda [0102], where the training dataset is used to train a neural network learn to generate refined segmentations mask that indicate the object of interest and used to compute errors to update the network) and the alignment parameter being used to represent respective correspondences between locations of joints in the body model feature and the predicted locations of the joints in the sample images (see Vajda [0102], where the training dataset is used to train a neural network to generate one-hot mask for each keypoint of interest and are used to compute errors to update the network); and 
inputting the training features of the sample images and the body model feature to the neural network, until the comparison result satisfies a termination condition, to obtain the joint prediction model (see Vajda [0095], where the keypoint head is trained using ground truth keypoints by minimizing the cross-entropy loss over a softmax output; see Park sect. 3.1 Structure of the Baseline CNN, and Eq. (1)-(3), where minimizing a cross entropy loss function is based on the Euclidean distance between a ground truth position and estimate positions of a joint).

Regarding claim 5, please see the above rejection of claim 4. Vajda, Park, and Fu disclose the image recognition method according to claim 4, the updating a body extraction parameter and an alignment parameter based on the comparison results comprising: 
updating, in response to the comparison results comprise a first comparison result, the body extraction parameter based on the first comparison result (see Vajda [0102], where the training dataset is used to train a neural network learn to generate refined segmentations mask that indicate the object of interest and used to compute errors to update the network), 
wherein the first comparison result comprises a comparison result between a predicted location of a first joint of the joints in the sample image and a real location of the first joint; and the predicted location of the first joint is located in a background environment area of the corresponding sample image (see Vajda [0102], where the generated segmentation mask is compared with a ground truth segmentation mask, and computed errors suggest that the generated mask is not at a real location of the object of interest and is in the background, where the segmented mask of the object of interest reads upon the broadest reasonable interpretation for first joint location); and 
updating, in response to the comparison results comprise a second comparison result, the alignment parameter based on the second comparison result (see Vajda [0102], where the training dataset is used to train a neural network to generate one-hot mask for each keypoint of interest and are used to compute errors to update the network), 
wherein the second comparison result comprises a comparison result between a predicted location of a second joint of the joints in the sample image and a real location of the second joint, the predicted location of the second joint is located in a location area of the person in the corresponding sample image, and the predicted location of the second joint is different from the real location of the (see Vajda [0102], where the generated masks is compared with the corresponding ground truth mask, and computed errors suggest that the generated mask location is different from the true location).

Regarding claim 9, please see the above rejection of claim 8. Vajda, Park, and Fu disclose the image recognition apparatus according to claim 8, further comprising: a third obtaining module, configured to obtain a plurality of positive sample images and a plurality of negative sample images respectively, to obtain a plurality of sample images, the positive sample image comprising a person, and the negative sample image comprising no person (see Vajda [0101], where the training data set includes ground truths or labels that indicate known bounding boxes for object instance of interest, e.g. people; see Fu sect. IV. B. Implementation Details, where non-person images are used as negative training samples).
Please see above rejection for claim 3, as the rationale to combine the teachings of Vajda, Park, and Fu are similar, mutatis mutandis.

Regarding claim 10, please see the above rejection of claim 9. Vajda, Park, and Fu disclose the image recognition apparatus according to claim 9, further comprising: a fourth obtaining module, configured to obtain training features of the sample images respectively (see Vajda [0124], where feature maps and regional feature maps are generated for each training image); and a training module, configured to use the training features of the sample images and the body model feature as an input of the neural network, to obtain the joint prediction model through training (see Vajda [0095], the ground truth keypoints are used to train the keypoint head neural network to generate keypoint mask associated with each detected person; see Vajda [0126]-[0128], where a training dataset of training poses are used to learn a transformation function for determining the pose model).

Regarding claim 11, please see the above rejection of claim 10. Vajda, Park, and Fu disclose the image recognition apparatus according to claim 10, wherein the training module comprises: 
an input unit, configured to input the training features of the sample images and the body model feature to the neural network (see Vajda [0095], the ground truth keypoints are used to train the keypoint head neural network to generate keypoint mask associated with each detected person; see Vajda [0126]-[0128], where a training dataset of training poses are used to learn a transformation function for determining the pose model); 
a first obtaining unit, configured to obtain predicted locations of joints in sample images outputted by the neural network (see Vajda [0095], the ground truth keypoints are used to train the keypoint head neural network to generate keypoint mask associated with each detected person); 
a comparison unit, configured to compare the predicted locations of the joints in the sample images with real locations of the joints in the corresponding sample images respectively, to obtain comparison results (see Vajda [0095], where the keypoint head is trained using ground truth keypoints by minimizing the cross-entropy loss over a softmax output; see Park sect. 3.1 Structure of the Baseline CNN, and Eq. (1)-(3), where minimizing a cross entropy loss function is based on the Euclidean distance between a ground truth position and estimate positions of a joint); and 
an updating unit, configured to update a body extraction parameter and an alignment parameter based on the comparison results, the body extraction parameter being used to extract the person from a background environment in the sample image (see Vajda [0102], where the training dataset is used to train a neural network learn to generate refined segmentations mask that indicate the object of interest and used to compute errors to update the network); 
the alignment parameter being used to represent respective correspondences between locations of joints in the body model feature and the predicted locations of the joints in the sample  (see Vajda [0102], where the training dataset is used to train a neural network to generate one-hot mask for each keypoint of interest and are used to compute errors to update the network).

Regarding claim 12, please see the above rejection of claim 11. Vajda, Park, and Fu disclose the image recognition apparatus according to claim 11, wherein the training module comprises: a triggering unit, configured to trigger the input unit if the comparison results do not meet a termination condition; and obtain the joint prediction model if the comparison results meet the termination condition (see Vajda [0095], where the keypoint head is trained using ground truth keypoints by minimizing the cross-entropy loss over a softmax output; see Park sect. 3.1 Structure of the Baseline CNN, and Eq. (1)-(3), where minimizing a cross entropy loss function is based on the Euclidean distance between a ground truth position and estimate positions of a joint; where the combined teachings suggest that the termination condition is met when the cross-entropy loss is minimized and training continues until that condition is met).

Regarding claim 13, please see the above rejection of claim 11. Vajda, Park, and Fu disclose the image recognition apparatus according to claim 11, wherein the first obtaining module is further configured to: obtain successive multi-frame video images in a video and using each frame of the video image as the to-be-recognized image respectively, to obtain the to-be-recognized feature of the to- be-recognized image (see Vajda [0082], where video frame images are accessed and the frames of the video are processed; and see Vajda [0115], where the video may be a live video).

Regarding claim 16, see above rejection for claim 15. It is a computer readable storage medium claim reciting similar subject matter as claim 3. Please see above claim 3 for detailed claim analysis as the limitations of claim 16 are similarly rejected.
 mutatis mutandis.

Regarding claim 17, see above rejection for claim 16. It is a computer readable storage medium claim reciting similar subject matter as claim 4. Please see above claim 4 for detailed claim analysis as the limitations of claim 17 are similarly rejected.

Regarding claim 18, see above rejection for claim 17. It is a computer readable storage medium claim reciting similar subject matter as claim 5. Please see above claim 5 for detailed claim analysis as the limitations of claim 18 are similarly rejected.

Regarding claim 19, see above rejection for claim 18. It is a computer readable storage medium claim reciting similar subject matter as claim 6. Please see above claim 6 for detailed claim analysis as the limitations of claim 19 are similarly rejected.

Regarding claim 20, see above rejection for claim 19. It is a computer readable storage medium claim reciting similar subject matter as claim 7. Please see above claim 7 for detailed claim analysis as the limitations of claim 20 are similarly rejected.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TIMOTHY WING HO CHOI whose telephone number is (571)270-3814.  The examiner can normally be reached on 9:00 AM to 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, VINCENT RUDOLPH can be reached on (571) 272-8243.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/TIMOTHY CHOI/Examiner, Art Unit 2661                                                                                                                                                                                         

/VINCENT RUDOLPH/Supervisory Patent Examiner, Art Unit 2661