Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
Applicant’s response to the last office action, filed January 5, 2022 has been entered and made of record. Claims 1-20 have been amended. Claims 1-20 are pending in this application.

Response to Arguments
Applicant’s arguments with respect to claims 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  


The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 1, 3-4, 7-9, 11-12, and 15-17 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Kaihao et al, (“Facial Expression Recognition Based on Deep Evolutional Spatial-Temporal Networks”, IEEE Transactions on image processing, Vol. 26, No. 9, September 2017).

In regards to claim 1, Kaihao discloses an image recognition method, performed by a terminal, and comprising:
obtaining a target video comprising a target object, (see at least: Fig.1, obtaining sequence facial images);
extracting a target video frame image from the target video, (see at least: Fig. 1, lower part, under MSCNN, extracting the input 64x64@1, “target video frame image”, from the input sequence facial images, “the target video”);
generating a key point video frame sequence comprised of a plurality of key point video frames according to key point information of the target object and a plurality of video frames in the target video, (see at least: upper part of Fig. 1 under “PHRNN”, generating facial landmarks video frame sequence, “keypoint video frame sequence” according the input consecutive frames, “plurality of video frames in the target video”. Further, Page 4198, right-hand-column, under section B, extracting facial landmarks based on position 
extracting dynamic timing feature information of the key point video frame sequence by using an RNN model, (see at least: Abstract, and Page 4193, right-hand-column, 3rd paragraph, extracting temporal features based on facial landmarks from motion over time based on Part-based Hierarchical Recurrent Neural Network (PHRNN), which is effective to capture the dynamic variation of the facial physical structure);
extracting static structural feature information of the target video frame image describing the structure of the target object by using a convolutional neural network model, (see at least: Page 4194, left-hand-column, 2nd paragraph, using a Multi-Signal Convolutional Neural Network (MSCNN) to extract spatial features from still frames. Further, right-hand-column, under section III, “a spatial network based on MSCNN is constructed to extract static features from still frames, [i.e., extracting static structural feature information of the target video frame image describing the spatial feature of target face, according to the spatial network based on MSCNN, “structure of the target object]);
recognizing an attribute type corresponding to a motion or an expression of the target object presented in the target video according to the dynamic timing feature information of the key point video frame sequence and the static structural feature information of the target video frame image, (see at least: Page 4197, under section C, “model fusion”, estimating facial expressions, “see equation 14”, based on the captured dynamic features and the extracted static features. Further, Page 4202, under section 3, “confusion matrix”, the confusion matrix in tables V, VI, VII, for spatial temporal networks, shows the performance of the model fusion MCNN-PHRNN for detecting the expression 

In regards to claim 3, Kaihao further discloses wherein extracting static structural feature information of the target video frame image comprises:
inputting the target video frame image into the convolutional neural network model, (see at least: Fig. 1, lower part, the input frame image 64x64@1 is cropped and then input to the MSCNN); and
extracting the static structural feature information of the target video frame image
through convolution processing and pooling processing of the convolutional neural network model, (see at least: Pages 4194-4195, section III, a spatial network based on MSCNN is constructed to extract static features from still frames. Further, Fig. 1, “lower part”, the pooling part of NSCNN: (“Conv+Pool+Norm 28x28@10”, “Conv+Pool+Norm 12x12@20, “Conv+Pool+Norm 4x4@40), corresponds to the pooling processing of the convolutional neural network model).

In regards to claim 4, Kaihao further discloses wherein recognizing the attribute type comprises:
recognizing, according to a classifier in the RNN model, matching degrees between the dynamic timing feature information of the key point video frame sequence and a plurality of attribute type features in the RNN model, and associating the matching degrees obtained through the dynamic timing feature information of the key point video frame sequence with label information corresponding to the plurality of attribute type 
recognizing, according to a classifier in the convolutional neural network model,
matching degrees between the static structural feature information of the target video frame image and a plurality of attribute type features in the convolutional neural network model, and associating the matching degrees obtained through the static structural feature information of the target video frame image with label information corresponding to the plurality of attribute type features in the convolutional neural network model, to obtain a second label information set, (see at least: Fig. 1, and Pages 4199-4200, and sections IV. A, and IV.C, under “Experiments”, and Fig. 6, the deep spatial network (MSCNN) takes the detected facial images as input., and implicitly recognizing the matching degree between the static structural feature information of the target video frame image and the plurality the plurality of face expressions types from the database of the MSCNN, using the plurality pooling of the MSCNN to obtain a second label information set, as shown in Fig. 1, and Table I, II, III“MSCNN”); and
fusing the first label information set and the second label information set, to obtain the attribute type corresponding to the target object in the target video, (see at least: Fig. 1, and Page 4197, “Fusing Model”, and Tables I, II, and III, “PHRNN+MSCNN”, the temporal network (PHRNN) and spatial network (MSCNN) are combined).
In regards to claim 7, Kaihao further discloses the method, further comprising:
obtaining a first sample image and a second sample image, (see at least: Fig. 2, Page 4197, input a pair of facial images, which correspond to the first and second sample images);
extracting static structural feature information of the first sample image; and extracting static structural feature information of the second sample image, (see at least: Fig. 2, Page 4197, extracting the static facial information from the input images, using MSCNN, [i.e., implicitly extracting static structural feature information from the first and second sample images, using the MSCNN]); and 
determining a model loss value according to the static structural feature information of the first sample image and the static structural feature information of the second sample, (see at least: Fig. 2, Page 4197, the recognition and verification signals corresponds to two different loss functions, which can be combined to update all weights of our model)

In regards to claim 8, Kaihao further discloses the method, wherein determining the model loss values comprises:
generating a first recognition loss value of the first sample image, (see at least:  Fig. 2, Page 4197, Feature 1, corresponds to the “first recognition loss value of the first sample image”);
generating a second recognition loss value of the second sample image, (see at least:  Fig. 2, Page 4197, Feature 2, corresponds to the “second recognition loss value of the second sample image”);

information of the first sample image, and the static structural feature information of the second sample image, (see at least: Fig. 2, Page 4197, the distance metric (VeLoss) and Softmax layer (ReLoss), represent verification loss value, which is implicitly generated based on the static structural feature information of the first sample image, and the static structural feature information of the second sample image according to the MSCNN using the pair of facial images); and 
generating the model loss value according to the first recognition loss value of the first sample image, the second recognition loss value of the second sample image, and the verification loss value, (see at least: Fig. 2, Page 4197, the recognition and verification signals corresponds to two different loss functions, which can be combined to update all weights of our model, where the “FiLoss”, as shown in Fig. 2, corresponds to the model loss value).

Regarding claim 9, claim 9 recites substantially similar limitations as set forth in claim 1. As such, claim 9 is rejected for at least similar rational.
The Examiner further acknowledged the following additional limitation(s): “an image recognition apparatus, comprising: a memory and a processor coupled to the memory, the processor being configured”. However, Phan et al discloses the “image recognition apparatus, comprising: a memory and a processor coupled to the memory”, (Phan et al, see at least: col. 1, lines 41-42, “apparatus”, col. 4, lines 39-43, “processor 102”, “memory”).

Regarding claim 11, claim 11 recites substantially similar limitations as set forth in claim 3. As such, claim 11 is rejected for at least similar rational.

Regarding claim 12, claim 12 recites substantially similar limitations as set forth in claim 4. As such, claim 12 is rejected for at least similar rational.

Regarding claim 15, claim 15 recites substantially similar limitations as set forth in claim 7. As such, claim 15 is rejected for at least similar rational.

Regarding claim 16, claim 16 recites substantially similar limitations as set forth in claim 8. As such, claim 16 is rejected for at least similar rational.

Regarding claim 17, claim 17 recites substantially similar limitations as set forth in claim 1. As such, claim 17 is rejected for at least similar rational.
The Examiner further acknowledged the following additional limitation(s): “a non-transitory computer-readable storage medium, storing a computer-readable instruction, the computer-readable instruction, when executed by one or more processors, causing the one or more processors to perform the following operations”. However, Phan et al discloses the “non-transitory computer-readable storage medium, storing a computer-readable instruction, the computer-readable instruction, when executed by one or more processors, causing the one or more processors to perform the following operations”, (Phan et al, see at least: col. 13, lines 50-56, one or more machine-readable instructions .

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 2, 10, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Kaihao et al, (“Facial Expression Recognition Based on Deep Evolutional Spatial-Temporal Networks”, IEEE Transactions on image processing, Vol. 26, No. 9, September 2017), in view of Tao et al, (CN 105469065, “based on English machine translation”)

In regards to claim 2, Kaihao et al discloses the limitations of claim 1.
Furthermore, Kaihao et al disclose wherein extracting dynamic timing feature information of the key point video frame sequence comprises: inputting the unit key point rd paragraph, we propose a PHRNN model to extract dynamic geometry information), and connecting the dynamic timing feature information to obtain the dynamic timing feature information of the key point video frame sequence, (see at least: Pages 4195-4196, under section A.1), Local features are concatenated along the feature extraction cascade, while the global high-level features are formed in the upper layers based on the facial morphological variations and dynamically evolutional properties of expression. To model the neighboring landmarks, we combine the representations of eyebrows and eyes to obtain a new representation in the L2. Followed by two BRNNs in the L3 and L4, we obtain the features of the eye brow eye, nose and mouth. The representations of eyebrow-eye and nose are concatenated to obtain the upper half face while the representations of nose and mouth are concatenated to obtain the bottom half face in the L5, then fed into two BRNNs in the L6 and L7. We can obtain the representation of the whole face in the L8. The temporal dynamics of the whole face are fed into a BRNN in the L9 and a fully connected layer in the L10, [i.e., implicitly connecting the dynamic timing feature information using the BRNNs of PHRNN to obtain the dynamic timing feature information of the key point video frame sequence]).
Kaihao et al does not expressly extracting key marker areas from the key point video frame sequence; obtaining unit key point video frame sequences according to the key marker areas in the key point video frames.


Kaihao et al and Tao et al are combinable because they are both concerned with object recognition. Therefore, it would have been obvious to a person of ordinary skill in the art, to modify Kaihao et al, to use the discrete emotion recognition method based on a recurrent neural network, as though by Tao et al, in order to make full use of the dynamic information in the emotion expression process, so as to realize the accurate recognition of the emotions of the participants in the video, (Tao et al, Par. 0006)

The following prior art of record, Tamrakar et al, (US-PGPUB 2018/0239975), 
discloses also the extracting key marker areas from each key point video frame in the key point video frame sequence; obtaining unit key point video frame sequences each being formed by same key marker areas in the key point video frames, (see at least: Fig. 24, and Par. 0089, identifying front face area 2402 that is identified in the captured image 2400, and identifying inside the identified front face area 2402, the facial features and landmarks, including eyes, noses, and mouths on the face, [i.e., extracting key marker areas from each key point video frame in the key point video frame sequence]. Further, tracked by dots 2408 and lines 2410 connecting the dots 2408. The dots 2408 may be annotated by experts or may be identified by the image processing device 2310, [i.e., obtaining unit key point video frame sequences each being formed by same key marker areas in the key point video frames based implicitly on tracking the facial features, landmarks, and the dots over the video frames]).

Regarding claim 10, claim 10 recites substantially similar limitations as set forth in claim 2. As such, claim 10 is rejected for at least similar rational

Regarding claim 18, claim 18 recites substantially similar limitations as set forth in claim 2. As such, claim 18 is rejected for at least similar rational

Allowable Subject Matter
Claims 5-6, and 13-14 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

With respect to claim 5, the prior art of record, alone or in reasonable combination, does not teach or suggest, the following limitation(s), (in consideration of the claim as a whole):  
“recognizing, according to a classifier in the RNN model, matching degrees between the fused feature information and a plurality of attribute type features in the RNN 

The relevant prior art of record, Kaihao, (“Facial Expression Recognition Based on Deep Evolutional Spatial-Temporal Networks”, IEEE Transactions on image processing, Vol. 26, No. 9, September 2017), discloses fusing the dynamic timing feature information of the key point video frame sequence and the static structural feature information of the target video frame image, to obtain fused feature information, (see at least: Fig. 1, and Page 4197, “Model Fusion”, the temporal network (PHRNN) and spatial network (MSCNN) are combined); but fails to teach or suggest, either alone or in combination with the other cited references, recognizing, according to a classifier in the RNN model, matching degrees between the fused feature information and a plurality of attribute type features in the RNN model, and associating the matching degrees obtained through the RNN model with label information corresponding to the plurality of attribute type features in the RNN model, to obtain a first label information set; recognizing, according to a the fused feature information and a plurality of attribute type features in the convolutional neural network model, and associating the matching degrees obtained through the convolutional neural network model with label information corresponding to the plurality of attribute type features in the convolutional neural network model, to obtain a second label information set; and fusing the first label information set and the second label information set, to obtain the attribute type corresponding to the target object in the target video.

A further prior art of record, Qing et al, (US-PGPUB 2019/0311188) discloses fusing the dynamic timing feature information of the key point video frame sequence and the static structural feature information of the target video frame image, to obtain fused feature information, (see at least: Fig. 3, and Par. 0012, 0022, and 0038); but fails to teach or suggest, either alone or in combination with the other cited references, the above limitations (as combined with the other claimed limitations).

With respect to claim 6, the prior art of record, alone or in reasonable combination, does not teach or suggest, the following limitation(s), (in consideration of the claim as a whole):  
“performing weighted averaging on the matching degrees associated with the first label information set and the second label information set, to obtain a target label information set; and extracting label information from the target label information set to obtain extracted label information, and using the extracted label information as the attribute type “.


Other prior art listed on the attached form PTO-892 show the aspect of performing weighted averaging on the matching degrees, but none, either alone or in combination, teach or suggest all the claimed limitations.
-- Zhang et al, (US-PGPUB 20210256979 A) discloses calculating a weighted average value of the first matching degree and the second matching degree.
-- Liang et al, (US-PGPUB 20180204132), discloses calculating a weighted average of a matching degree with each dictionary.

Regarding claim 13, claim 13 recites substantially similar limitations as set forth in claim 5. As such, claim 13 is in condition for allowance, for at least similar reasons, as stated above.

Regarding claim 14, claim 14 recites substantially similar limitations as set forth in claim 6. As such, claim 14 is in condition for allowance, for at least similar reasons, as stated above.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 



Contact Information
Any inquiry concerning this communication or earlier communications from the examiner should be directed to AMARA ABDI whose telephone number is (571)270-1670. The examiner can normally be reached 9:00am-5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached on (571)272-7332. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/AMARA ABDI/Primary Examiner, Art Unit 2668                                                                                                                                                                                            03/18/2022