Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION


Response to Arguments

1.	Applicant’s arguments, see applicant’s remarks, filed on 08/19/2021, with respect to the rejection(s) of claims 1-37 under 35 U.S.C § 103(a) have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made over over Gould et al. (U.S. PAT. 10,460,175, hereinafter “Gould”) in view of Dominik Lorenz et al. ("Unsupervised Part-Based Disentangling of Object Shape and Appearance", 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1 June 2019 (2019-06-01), pages 10955-10964, hereinafter. “Dominik”; provided by applicants’s IDS filed on 08/31/2021).

Information Disclosure Statement

2.	The information disclosure statement (IDS) submitted on 08/31/2021 has been considered by Examiner and made of record in the application file.

Claim Rejections - 35 USC §103


3.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:



4.	Claims 1-3, 8-10, 14-16, 20-22, 26-28 and 32-34 are rejected under 35 U.S.C. 103 as being unpatentable over Gould et al. (U.S. PAT. 10,460,175, hereinafter “Gould”) in view of Dominik Lorenz et al. ("Unsupervised Part-Based Disentangling of Object Shape and Appearance", 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1 June 2019 (2019-06-01), pages 10955-10964, hereinafter. “Dominik”; provided by applicants’s IDS filed on 08/31/2021).

Consider claim 1, Gould teaches a processor comprising: 2one or more arithmetic logic units (ALUs) (col. 7, lines 18-28) to be configured to infer frames for a 3video using one or more neural networks (col. 2, line 51 through col. 3, line 7).
Gould does not explicitly show that trained using one or more temporal pose 4representations.
In the same field of endeavor, Dominik teaches trained using one or more temporal pose 4representations (fig. 2, i.e., at least two temporal pose representations x and x°s are used to train the neural networks).
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention was made to use, trained using one or more temporal pose 4representations, as taught by Dominik, in order outperforms thr state of art on unsupervised keypoint prediction and compares favorably even against supervised approaches on the task of shape and appearance transfer.

Consider claim 2, Dominik further teaches wherein video frames used to train the one or 2more neural networks are used to generate the one or more temporal pose representations (fig. 2, page 10958, section 3.4, “Unsupervised Learning of Part-based Shape and Appearance” i.e., at least two temporal pose representations x and x°s are used to train the neural networks) and a 3time-invariant appearance representation (page 10957, section 3.2 Invariance and Equivariance).  

Consider claim 3, Dominik further teaches wherein color jittering and thin-plate-spline 2(TPS) warping enforce appearance invariance and localization properties for the one or more 3temporal pose representations (page 10959, section 3.5 “Implementation Details” i.e., thin plate spline transformation are applied to the input image and section 1. “Introduction” i.e., object deformation leads to complicated "recoloring". Therefore, it is obvious that color jittering is applied in addition to TPS to achieve the time invariance).

Consider claim 8, the previous rejections of claim 1 apply mutatis mutandis to corresponding claim 8.

Consider claim 9, the previous rejections of claim 2 apply mutatis mutandis to corresponding claim 9.

Consider claim 10, the previous rejections of claim 3 apply mutatis mutandis to corresponding claim 10.



Consider claim 15, the previous rejections of claim 2 apply mutatis mutandis to corresponding claim 15.

Consider claim 16, the previous rejections of claim 3 apply mutatis mutandis to corresponding claim 16.

Consider claim 20, the previous rejections of claim 1 apply mutatis mutandis to corresponding claim 20.

Consider claim 21, the previous rejections of claim 2 apply mutatis mutandis to corresponding claim 21.

Consider claim 22, the previous rejections of claim 3 apply mutatis mutandis to corresponding claim 22.

Consider claim 26, the previous rejections of claim 1 apply mutatis mutandis to corresponding claim 26.

Consider claim 27, the previous rejections of claim 2 apply mutatis mutandis to corresponding claim 27.

Consider claim 28, the previous rejections of claim 3 apply mutatis mutandis to corresponding claim 28.



Consider claim 33, the previous rejections of claim 2 apply mutatis mutandis to corresponding claim 33.

Consider claim 34, the previous rejections of claim 3 apply mutatis mutandis to corresponding claim 34.

5.	Claims 4, 11, 17, 23, 29 and 35 are rejected under 35 U.S.C. 103 as being unpatentable over Gould in view of Dominik and further in view of Tomas Jakab et al. ("Unsupervised Learning of Object Landmarks through Conditional Image Generation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 20 June 2018 (2018-06- 20), XP081432392; hereinafter, “Tomas”; provided by applicants’s IDS filed on 08/31/2021).

Consider claim 4, Gould and Dominik in combination fail to teach wherein motion of the one or more temporal 2pose representations is modeled over time using a temporal encoder including one or more long 3short-term memory (LSTM) networks.
However, Tomas teaches wherein the one or more temporal pose  5representations comprise features represented by Gaussian heat maps and parameterized as a 6mean and a covariance (pages 3-4, section 3.1 "Heatmaps bottleneck" that Gaussian like functions are used as heatmaps in the model Φ(x) which extracts keypoint-like structures from the image. The mean and the variance of the used Gaussian distribution can be derived from D2 equation 2).


Consider claim 11, the previous rejections of claim 4 apply mutatis mutandis to corresponding claim 11.

Consider claim 17, the previous rejections of claim 4 apply mutatis mutandis to corresponding claim 17.

Consider claim 23, the previous rejections of claim 4 apply mutatis mutandis to corresponding claim 23.

Consider claim 29, the previous rejections of claim 4 apply mutatis mutandis to corresponding claim 29.

Consider claim 35, the previous rejections of claim 4 apply mutatis mutandis to corresponding claim 35.

6.	Claims 6-7, 13, 19, 25, 31 and 37 are rejected under 35 U.S.C. 103 as being unpatentable over Gould in view of Dominik and further in view of Sambreka Prathmesh et al. ("Movie Frame Prediction Using Convolutional Long Short Term Memory", 2019 2nd International conference on Intelligent computing, Instrumentation and control technologies (ICICICT), IEEE, vol. 1, 5 July 2019 (2019-07-05), pages 1-5, XP033713411, DOI: 10.1109/ICICICT46008.2019.8993289 ISBN: 978-1-7281-0282-5 .

Consider claim 6, Gould and Dominik in combination fail to teach wherein motion of the one or more temporal 2pose representations is modeled over time using a temporal encoder including one or more long 3short-term memory (LSTM) networks.
However, Tomas teaches wherein motion of the one or more temporal 2pose representations is modeled over time using a temporal encoder including one or more long 3short-term memory (LSTM) networks (fig. 3, i.e., the information of a cropped object first is processed by a CNN. The output of the CNN is processed by two layers of LSTM networks. It is well known that CNNs are suitable for detecting features in images whereas LSTMs are used for sequence extrapolation in time).
Therefore, it is obvious to one of ordinary skill in the art at the time the invention was made to incorporate the disclosing of Tomas into view of Gould and Dominik, in order for learning landmark detectors for visual objects (such as the eyes and the nose in a face) without any manual supervision. 

Consider claim 7, Sambreka further teaches wherein the inferred frames are used to generate 2a video with a higher frame rate, fewer dropped frames, or additional content (fig. 3, i.e., the LSTM layer in the network extrapolates the features detected by the CNN for 10 image frames. The LSTM layers features can be extrapolated to predict missing parts of a time sequence. Therefore, it is obvious to use a neural network based on a LSTM cell to predict the pose respectively shape parameters for video frame prediction).

Consider claim 13, the previous rejections of claim 6 apply mutatis mutandis to corresponding claim 13.

Consider claim 19, the previous rejections of claim 6 apply mutatis mutandis to corresponding claim 19.

Consider claim 25, the previous rejections of claim 6 apply mutatis mutandis to corresponding claim 25.

Consider claim 31, the previous rejections of claim 6 apply mutatis mutandis to corresponding claim 31.
Consider claim 37, the previous rejections of claim 6 apply mutatis mutandis to corresponding claim 37.

Allowable Subject Matter

7.	Claims 5, 12, 18, 24, 30 and 36 objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Consider claim 5, the prior arts made of record, alone or in combination, fail to clearly teach or fairly suggest wherein the one or more temporal pose representations and the time-invariant appearance representation are used to reconstruct an input video frame, and wherein a loss value resulting from comparing the input video frame with the reconstructed video frame is minimized by adjusting one or more network parameters of 

Consider claim 12, the prior arts made of record, alone or in combination, fail to clearly teach or fairly suggest wherein the one or more temporal pose representations and the time-invariant appearance representation are used to reconstruct an input video frame, and wherein a loss value resulting from comparing the input video frame with the reconstructed video frame is minimized by adjusting one or more network parameters of the one or more neural networks, in combination with other limitations, as specified in the independent claim 8 and dependent claim 9.

Consider claim 18, the prior arts made of record, alone or in combination, fail to clearly teach or fairly suggest wherein the one or more temporal pose representations and the time-invariant appearance representation are used to reconstruct an input video frame, and wherein a loss value resulting from comparing the input video frame with the reconstructed video frame is minimized by adjusting one or more network parameters of the one or more neural networks, in combination with other limitations, as specified in the independent claim 14 and dependent claim 15.

Consider claim 24, the prior arts made of record, alone or in combination, fail to clearly teach or fairly suggest wherein the one or more temporal pose representations and the time-invariant appearance representation are used to reconstruct an input video frame, and wherein a loss value resulting from comparing the input video frame with the reconstructed video frame is minimized by adjusting one or more network parameters of 

Consider claim 30, the prior arts made of record, alone or in combination, fail to clearly teach or fairly suggest wherein the one or more temporal pose representations and the time-invariant appearance representation are used to reconstruct an input video frame, and wherein a loss value resulting from comparing the input video frame with the reconstructed video frame is minimized by adjusting one or more network parameters of the one or more neural networks, in combination with other limitations, as specified in the independent claim 26 and dependent claim 27.

Consider claim 36, the prior arts made of record, alone or in combination, fail to clearly teach or fairly suggest wherein the one or more temporal pose representations and the time-invariant appearance representation are used to reconstruct an input video frame, and wherein a loss value resulting from comparing the input video frame with the reconstructed video frame is minimized by adjusting one or more network parameters of the one or more neural networks, in combination with other limitations, as specified in the independent claim 32 and dependent claim 33.

Conclusion


8.	Any response to this action should be mailed to:
Mail Stop_________ (Explanation, e.g., Amendment or After-final, etc.)
Commissioner for Patents
P.O. Box 1450

Facsimile responses should be faxed to:
(571) 273-8300
Hand-delivered responses should be brought to:
Customer Service Window
Randolph Building
401 Dulany Street
Alexandria, VA 22313
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Tuan H. Nguyen whose telephone number is (571) 272-8329. The examiner can normally be reached on 8:00Am - 5:00Pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pan Yuwen can be reached on (571) 272-7855. The fax phone number for the organization where this application or proceeding is assigned is (571) 273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR.
Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).

/TUAN H NGUYEN/Primary Examiner, Art Unit 2649