DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
2.	Applicant’s amendments filed on November 03, 2021 have been entered. Claims 1, 3-4, and 8-19 have been amended. Claims 1-20 are pending in this application, with claims 1, 8 and 15 being independent.

Response to Arguments
3.	Applicant’s arguments, see page 12, filed November 03, 2021, with respect to the 101 rejections have been fully considered and are persuasive.  The amendments to the claims are sufficient to overcome the 101 rejection; thus the 101 rejections of these claims have been withdrawn.
4.	Applicant's arguments filed November 03, 2021, with respect to the 103 rejection have been fully considered but are moot in view of the new grounds of rejection. 
	Examiner notes that independent claims 1, 8 and 15 have been amended to include new limitation. Examiner finds these limitations to be unpatentable as can be found in below detail action.
5.	On pages 10-12 of Applicant's Remarks, the Applicant argues that the dependent claims are not taught by the prior art, insomuch as they depend from claims that are not taught by the prior art. Examiner respectfully disagrees with these arguments, for the reasons discussed below.




Claim Rejections - 35 USC § 112
6.	The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

7.	Claims 1-7 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.
The Examiner finds the claim language informal and there are numerous issues.
Claim 1, lines 11-12 recites the limitation "the set of predicted 3D facial landmarks”. The limitation “predicted 3D facial landmarks” is previously introduced in claim 1. As such, the subsequent limitation is either: (1) not following antecedent basis (i.e. "a set of predicted 3D facial landmarks”); or (2) is intended to be a new limitation which ambiguously conflicts with the previous limitation of claim 1. Accordingly, the metes and bounds of the claim are not clear. Therefore, claim 1 is rejected 35 U.S.C. § 112(b), as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention. Claims 2-7 are also rejected 35 U.S.C. § 112(b), based on its respective dependency to claim 1.


Claim Rejections - 35 USC § 103
8.	The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains.  Patentability shall not be negatived by the manner in which the invention was made.


9.	Claims 1, 4, 7-10, 13-16 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Moulton et al., (“Moulton”) [US-2002/0097380-A1] in view of Savchenkov et al., (“Savchenkov”) [US-2020/0234690-A1]
Regarding claim 1, Moulton discloses one or more computer storage media storing computer-useable instructions that, when used by one or more computing devices (Moulton- ¶0012 discloses database, a computer vision motion tracking system; ¶0074 discloses software processes; ¶0088 discloses computer system's memory) , cause the one or more computing devices to perform operations comprising:
accessing a  image of a head to animate with an audio signal of speech (Moulton- Fig. 1 and ¶0063 disclose Execute 2D image, the original actor screen image; ¶0023 discloses the image frames show sequential lip motion that is now visually synchronized to the new dub speech track; ¶0026 discloses acquire the motion dynamics of speech as mouth and lip motions. The dynamics of motion of the dub speaker are scaled to match the dynamic range of motion of the screen actor lips and jaw; ¶0034 discloses A 3D wireframe model of a human facial muzzle is overlayed and rectified to the source 2D footage on a frame by frame basic); 
generating a plurality of animation frames, by, for each of the animation frames (Moulton- ¶0006 discloses analyzes screen actor speech to convert it into triphones and/or phonemes and then uses a time coded phoneme stream to identify corresponding visual facial motions of the jaw, lips, visible tongue and visible teeth. These single frame snapshots or multi-frame clips of facial motion ... then subsequently used to animate the original screen actor's :
generating,  initial 3D facial landmarks extracted from the  image (Moulton- ¶0022 discloses the viseme CV fixed reference control points are exactly registered to the original screen actor facial position for the eyes and nose position; ¶0037 discloses sets of points are tracked by the computer vision system to estimate their position on a frame to frame basis) and  a sliding window of the audio signal (Moulton- ¶0064 discloses time synched to the speech tracks to be modified; ¶0065 discloses audio video tracks for the actor provides means to identify and select a set of visemes for the actor), a set of predicted 3D facial landmarks reflecting a corresponding portion of the head saying a portion of the speech in the sliding window (Moulton- ¶0011 discloses the actual motion path of a set of fixed reference points, while the mouth moves from one phoneme to another, is recorded and captured for the entire transformation between any set of different phonemes; ¶0014 discloses the estimated motion path for any reference point on the face, in conjunction with all the reference points and rates of relative motion change during any mouth shape transformation; ¶0017 discloses the points include key facial features, such as a constellation of points that outline the chin, the outside of the mouth and the inside edge of the lips; ¶0037 discloses the positions of fixed reference points on the face are estimated per frame by the computer vision system, and the motion paths these points take frame to frame during speech is recorded); and
generating the animation frame by  transforming the  image to fit the set of predicted 3D facial landmarks (Moulton- ¶0007 discloses the snapshot image states, or the short clip image sequences are used, as key frame facial speech motion image sets, respectively, and are interpolated for intermediate frames between key frames using optical morph techniques ... synthetically continuously animate facial motion by identifying the facial motions and interpolating between key frames of facial motion; ¶0011 discloses the actual while the mouth moves from one phoneme to another, is recorded and captured for the entire transformation between any set of different phonemes. As the mouth naturally speaks one speech phoneme and then alters its shape to speak another phoneme, the entire group of fixed reference points move and follow a particular relative course during any phoneme to phoneme transformation; ¶0022 discloses the viseme CV fixed reference control points are exactly registered to the original screen actor facial position for the eyes and nose position, to place them exactly to the head position, scaled to the correct size); and
compiling the plurality of animation frames with the audio signal into an animation of the head saying the speech (Moulton- ¶0023 discloses the imaqe frames show sequential lip motion that is now visually synchronized to the new dub speech track; ¶0064 discloses an analysis stage, where the actor audio video track sound (FIG. 2, Block 900) is analyzed, the image frames 905 are analyzed ... and the actor head position and orientation (FIG. 2, Block 915) are analyzed and located in the image frames for all frames to be modified ... All annotations are synchronized or time synched to the speech tracks to be modified; ¶0065 discloses the set of visemes is derived by using the actor image frames which are uttering each phoneme which has a unique face and mouth shape; ¶0088 discloses the output actor screen imaqe track that is synchronized to the new dub audio track can be initially stored in the computer system's memory).
Moulton does not explicitly disclose accessing a single image of a head to animate; generating, by a neural network evaluating initial 3D facial landmarks extracted from the single image; an audio feature vector encoding an audio chunk from a sliding window; generating the animation frame by warping the single image to fit the set of predicted 3D facial landmarks;
However, Savchenkov discloses
accessing a single image of a head to animate (Savchenkov- Fig. 1 and ¶0033 disclose the target image 125 may include at least a target face 140 […] The target face 140 ;
generating, by a neural network evaluating initial 3D facial landmarks extracted from the single image (Savchenkov- ¶0008-0009 disclose the sequence of sets of mouth key points can be generated by a neural network […] The sequence of sets of facial key points can be generated by a neural network; Fig. 4 and ¶0040 disclose mouth key points and facial key points of a model face 400 […] a set of mouth key points includes 20 key points enumerated from 48 to 67 […] the facial key points may include the 20 mouth key points and additional facial landmarks around face shape, in regions of nose, eyes, and brows; Fig. 5 and ¶0043 disclose a three-dimensional (3D) face model 505; ¶0062-0063 disclose at each step, the neural network can use, as an input, a pre-determined number of sets of mouth key points generated for previous frames by the same neural network […] During the training the neural network, the mouth key points of training sets can be normalized on each frame independently using affine transformation; ¶0082 discloses the neural network for generating the sets of facial key points can be trained on a set of real videos recorded in a controlled environment and featuring a single actor speaking different predefined sentences. The single actor can be the same in all the videos);
an audio feature vector encoding an audio chunk from a sliding window (Savchenkov- ¶0058 discloses Each of the sets of acoustic (numerical) features 625 may be assigned a timestamp. Each of the sets of acoustic features may correspond to one of frames 345 in output video 340 […] the number of sets in the sequence of the sets of acoustic features vocoder 660 may apply a deterministic algorithm that decode set of speech parameters [audio feature vector] (a fundamental frequency (F0), a spectral envelope (SP), and aperiodicity (AP)) to produce an audio data (for example, a speech waveform as output). In other embodiments, the vocoder 660 may include a neural vocoder based on a neural network. The neural vocoder may decode a Mel-frequency cepstrum and produce the speech waveform; ¶0062 discloses the neural network can also use, as an input, sets of acoustic features corresponding to a fixed-length time window [an audio chunk from a window]. The time window may cover a pre-determined time before a current timestamp and a pre-determined time after the current timestamp [a sliding window]. The acoustic features can be presented in the form of Mel Frequency Cepstral Coefficients. The neural network may apply convolutions to the acoustic features and the key points to extract latent features and then to concatenate the latent features);
generating the animation frame by warping the single image (Savchenkov- ¶0048 discloses obtain a set of control points (facial landmarks), which can be robustly tracked through the scenario video […] The affine transformation can be further used to predict the location of additional control points [predicted 3D facial landmarks] in the target image 125. The sparse correspondence module 410 can be further configured to build a triangulation of the control points; ¶0052 discloses the image animation and refinement module 530 can be configured to animate a target image frame by frame [generating the animation frame]. For each set of facial key points in the sequence of sets of the scenario data, changes in positions of the control points can be determined. The changes in position of control points can be projected onto the target image 125 [the single image]. The module 530 can be further configured to build a warp field [warping]. The warp field can include a set of piecewise linear transformations induced by changes of each triangle in triangulation of the control points. The module 530 can be further apply the warp field to the target image 125 and by doing so produce a frame of output video 340. Application of the warp field to an image can be performed relatively fast, which may allow the animation to be performed in real time);
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Moulton to incorporate the teachings of Savchenkov, and apply generating, by a neural network evaluating initial 3D facial landmarks extracted from the single image into Moulton’s system for accessing a single image of a head to animate with an audio signal of speech; generating a plurality of animation frames, by, for each of the animation frames: generating, by a neural network evaluating initial 3D facial landmarks extracted from the single image and an audio feature vector encoding an audio chunk from a sliding window of the audio signal, predicted 3D facial landmarks reflecting a corresponding portion of the head saying a portion of the speech in the sliding window; and generating the animation frame by warping the single image to fit the set of predicted 3D facial landmarks.
Doing so would enhance the systems and methods for text and audio-based real-time face reenactment.

Regarding claim 4, Moulton in view of Savchenkov, discloses the one or more computer storage media of claim 1, and further discloses wherein the initial 3D facial landmarks represent a subset of body parts of the head, and wherein the animation of the head selectively animates and synchronizes motion of the subset of body parts based on the speech (Moulton- ¶0035 discloses tracks the (a) head position, (b) the facial motions for the jaw and (c) the lip motion during speech. (FIG. 1, Block 250). This develops a database of computer estimated positions of control reference points for the head as a whole and for the facial muzzle area including the jaw and lips; ¶0060 discloses the control points associated with the phoneme transition motion path of the lips, mouth and jaw).

Regarding claim 7, Moulton in view of Savchenkov, discloses the one or more computer storage media of claim 1, and further discloses wherein generating the set of predicted 3D facial landmarks is by a landmark predictor trained on a different identity than the head (Moulton- ¶0013 discloses If the training footage is of a screen actor, it permits a computer vision motion tracking system to learn the probable optical flow paths for each fixed reference point on the face for different types of mouth motions corresponding to phoneme transitions. These identified and recorded optical reference point flow paths during speech facial motions are recorded and averaged over numerous captured examples of the same motion transformations).

Regarding claim 8, Moulton discloses a computerized method (Moulton- ¶0008 discloses a method for accumulating an accurate database of learned motion paths of a speaker's face and mouth during speech, and applying it to directing facial animation during speech using visemes) comprising:
accessing a  representation of a head to animate with an audio signal of
speech (Moulton- Fig. 1 and ¶0063 discloses Execute 2D image, the original actor screen
image; ¶0023 discloses the image frames show sequential lip motion that is now visually synchronized to the new dub speech track; ¶0026 discloses acquire the motion dynamics of speech as mouth and lip motions. The dynamics of motion of the dub speaker are scaled to match the dynamic range of motion of the screen actor lips and jaw; ¶0034 discloses a 3D wireframe model of a human facial muzzle is overlayed and rectified to the source 2D
footage on a frame by frame basic);
generating, from the  representation, a set of initial 3D facial landmarks  of the head (Moulton- ¶0022 discloses the viseme CV fixed reference control points are exactly registered to the original screen actor facial position for the eyes and nose position, to place them exactly to the head position, scaled to the correct size; ¶0037 ;
extracting, from each window of the audio signal,  a portion of the audio signal in the window (Moulton- ¶0064 discloses time synched to the speech tracks to be modified; ¶0065 discloses audio video tracks for the actor provides means to identify and select a set of visemes for the actor);
generating, from  the set of initial 3D facial landmarks, a corresponding set of a plurality of sets of predicted 3D facial landmarks reflecting a corresponding portion of the head saying a portion of the speech in the window (Moulton- ¶0011 discloses the actual motion path of a set of fixed reference points, while the mouth moves from one phoneme to another, is recorded and captured for the entire transformation between any set of different phonemes; ¶0014 discloses the estimated motion path for any reference point on the face, in conjunction with all the reference points and rates of relative motion change during any mouth shape transformation; ¶0017 discloses the points include key facial features, such as a constellation of points that outline the chin, the outside of the mouth and the inside edge of the lips; ¶0022 discloses the viseme CV fixed reference control points are exactly registered to the original screen actor facial position for the eyes and nose position; ¶0037 discloses the positions of fixed reference points on the face are estimated per frame by the computer vision system, and the motion paths these points take frame to frame during speech is recorded); and
generating, from the plurality of sets of predicted 3D facial landmarks, an animation of the head saying the speech (Moulton- ¶0023 discloses the image frames show sequential lip motion that is now visually synchronized to the new dub speech track; ¶0064 discloses an analysis stage, where the actor audio video track sound (FIG. 2, Block 900) is analyzed, the image frames 905 are analyzed ... and the actor head position and orientation (FIG. 2, Block 915) are analyzed and located in the image frames for all frames to be modified .
Moulton does not explicitly disclose accessing a single representation of a head to animate; generating, from the single representation, a set of initial 3D facial landmarks representing a rest pose of the head; extracting, from each window of the audio signal, an audio feature vector encoding a portion of the audio signal in the window; generating, from the audio feature vector for each window;
However, Savchenkov discloses
accessing a single representation of a head to animate (Savchenkov- Fig. 1 and ¶0033 disclose the target image 125 may include at least a target face 140 […] The target face 140 may belong to the user 130 or a person other than the user 130; Fig. 9 and ¶0085-0087 disclose a target face 140 to be animated and an input text 160. The mobile application may generate, using the face reenactment system 220, audio data (sound waveform) for the input text 160 and a video animating the target face 140 in the target image 125 […] A vocoder may generate an audio data to match the input text 160. The face reenactment system 220 may generate an animation of the target face 140 to match the audio data. In further embodiments, the mobile device may allow to enter an input text and/or audio via an audio or video recording);
generating, from the single representation, a set of initial facial landmarks representing a rest pose of the head (Savchenkov- Fig. 4 and ¶0040 disclose a set of mouth key points includes 20 key points enumerated from 48 to 67. The mouth key points 48-67 are located substantially around a mouth region of the model face 400. In some embodiments, the facial key points may include the 20 mouth key points and additional facial landmarks around face shape, in regions of nose, eyes, and brows. In example of FIG. 4, the number of facial key points is 78. The facial key points are enumerated from 0 to 77 […] The facial key points and mouth key points may correspond to particular facial landmarks (for example, a corner of a brow, a corner of an eye, a corner of mouth, a bottom of a chin, and so forth));
an audio feature vector encoding a portion of the audio signal in the window (Savchenkov- ¶0058 discloses Each of the sets of acoustic (numerical) features 625 may be assigned a timestamp. Each of the sets of acoustic features may correspond to one of frames 345 in output video 340 […] the number of sets in the sequence of the sets of acoustic features can be determined based on a desired frame rate of the output video. Additionally, the number of sets can be also determined based on the desired duration of an audio representing the input text; ¶0060 discloses the vocoder 660 may apply a deterministic algorithm that decode set of speech parameters [audio feature vector] (a fundamental frequency (F0), a spectral envelope (SP), and aperiodicity (AP)) to produce an audio data (for example, a speech waveform as output). In other embodiments, the vocoder 660 may include a neural vocoder based on a neural network. The neural vocoder may decode a Mel-frequency cepstrum and produce the speech waveform; ¶0062 discloses the neural network can also use, as an input, sets of acoustic features corresponding to a fixed-length time window [audio signal in the window]. The time window may cover a pre-determined time before a current timestamp and a pre-determined time after the current timestamp [window]. The acoustic features can be presented in the form of Mel Frequency Cepstral Coefficients. The neural network may apply convolutions to the acoustic features and the key points to extract latent features and then to concatenate the latent features);
generating, from the audio feature vector for each window (Savchenkov- ¶0058 discloses Each of the sets of acoustic (numerical) features 625 may be assigned a timestamp. Each of the sets of acoustic features may correspond to one of frames 345 [window] in output video 340 […] the number of sets in the sequence of the sets of acoustic features can be determined based on a desired frame rate of the output video. Additionally, the number of sets can be also determined based on the desired duration of an audio representing the input text; frequency (F0), a spectral envelope (SP), and aperiodicity (AP)) to produce an audio data (for example, a speech waveform as output); ¶0062 discloses the neural network can also use, as an input, sets of acoustic features corresponding to a fixed-length time window [window]. The time window may cover a pre-determined time before a current timestamp and a pre-determined time after the current timestamp [window]);
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Moulton to incorporate the teachings of Savchenkov, and apply accessing a single representation of a head to animate into Moulton’s system for accessing a single representation of a head to animate with an audio signal of speech; generating, from the single representation, a set of initial 3D facial landmarks representing a rest pose the head; extracting, from each window of the audio signal, an audio feature vector encoding a portion of the audio signal in the window; generating, from the audio feature vector for each window and the set of initial 3D facial landmarks, a corresponding set of a plurality of sets of predicted 3D facial landmarks reflecting a corresponding portion of the head saying a portion of the speech in the window.
Doing so would enhance the systems and methods for text and audio-based real-time face reenactment.

Regarding claim 9, Moulton in view of Savchenkov, discloses the computerized method of claim 8, and further discloses wherein the single representation of the head is an image, a 3D mesh, a 3D rig, or 2D layered artwork (Moulton- ¶0065 discloses a 3D wire frame mesh model of the actor's head is generated; Savchenkov- Fig. 1 and ¶0033 disclose the target image 125 [single representation of the head is an image] may include at least a target face 140 […] The target face 140 may belong to the user 130 or a person other than the user 130).

The same motivation that was utilized in the rejection of claim 8 applies equally to this claim.

Regarding claim 10, Moulton in view of Savchenkov, discloses the computerized method of claim 8, and further discloses wherein the single representation of the head is a single image (Moulton- ¶0065 discloses a 3D wire frame mesh model of the actor's head is generated; Savchenkov- Fig. 1 and ¶0033 disclose the target image 125 [single representation of the head is a single image] may include at least a target face 140 […] The target face 140 may belong to the user 130 or a person other than the user 130), and wherein generating the animation comprises warping the single image to the set of predicted 3D facial landmarks for each window (Savchenkov- ¶0048 discloses obtain a set of control points (facial landmarks), which can be robustly tracked through the scenario video […] The affine transformation can be further used to predict the location of additional control points [the set of predicted 3D facial landmarks] in the target image 125. The sparse correspondence module 410 can be further configured to build a triangulation of the control points; ¶0052 discloses the image animation and refinement module 530 can be configured to animate a target image frame by frame [generating the animation]. For each set of facial key points in the sequence of sets of the scenario data, changes in positions of the control points can be determined. The changes in position of control points can be projected onto the target image 125 [the single image]. The module 530 can be further configured to build a warp field [warping]. The warp field can include a set of piecewise linear transformations induced by changes of each triangle in triangulation of the control points. The module 530 can be further configured to apply the warp field to the target image 125 and by doing so produce a frame of output video 340. Application of the warp field to an image can be performed relatively fast, which may allow the animation to be performed in real time; ¶0058 discloses each of the sets of acoustic (numerical) features 625 may be assigned a timestamp. Each of the sets of acoustic features may correspond to one of frames 345 [window] in output video 340);
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Moulton to incorporate the teachings of Savchenkov, and apply warping the single image into Moulton’s system for generating the animation comprises warping the single image to the set of predicted 3D facial landmarks for each window.
The same motivation that was utilized in the rejection of claim 8 applies equally to this claim.

The method of claims 13-14 are similar in scope to the functions performed by the computer storage media of claims 4 and 7 and therefore claims 13-14 are rejected under the same rationale.

Regarding claim 15, Moulton discloses a computer system (Moulton- ¶0012 discloses a computer vision motion tracking system) comprising:
one or more hardware processors and memory configured to provide computer program instructions to the one or more hardware processors system (Moulton- ¶0012 discloses a computer vision motion tracking system; ¶0074 discloses software processes; ¶0082 discloses a radar data processor; ¶0088 discloses computer system's memory);
a landmark predictor configured to use the one or more hardware processors (Moulton- ¶0018 discloses the computer vision tracks and estimates the position of the control points mapped to the mouth as they move in the target production footage) to generate, from a  representation of a head to animate (Moulton- Fig. 1 and ¶0063 disclose execute 2D image, the original actor screen image; ¶0023 discloses the image frames show sequential lip motion that is now visually synchronized to the new dub speech track) and an audio signal of speech (Moulton- ¶0026 discloses the dynamics of motion of the dub speaker are scaled to match the dynamic range of motion of the screen actor lips and jaw), a plurality of sets of predicted 3D facial landmarks for the head (Moulton- ¶0011 discloses the actual motion path of a set of fixed reference points, while the mouth moves from one phoneme to another; ¶0014 discloses the estimated motion path for any reference point on the face, in conjunction with all the reference points and rates of relative motion change during any mouth shape transformation; ¶0017 discloses the points include key facial features, such as a constellation of points that outline the chin, the outside of the mouth and the inside edge of the lips; ¶0022 discloses the viseme CV fixed reference control points are exactly registered to the original screen actor facial position for the eyes and nose position; ¶0037 discloses the positions of fixed reference points on the face are estimated per frame); and
an animation compiler configured to use the one or more hardware processors (Moulton- ¶0018 discloses the computer vision tracks; ¶0082 discloses a radar data processor) to generate, based on the plurality of sets of predicted 3D facial landmarks, an animation of the head saying the speech (Moulton- ¶¶0023 discloses the imaqe frames show sequential lip motion that is now visually synchronized to the new dub speech track; ¶0064 discloses an analysis stage, where the actor audio video track sound (FIG. 2, Block 900) is analyzed, the image frames 905 are analyzed ... and the actor head position and orientation (FIG. 2, Block 915) are analyzed and located in the image frames for all frames to be modified ... All annotations are synchronized or time synched to the speech tracks to be modified; ¶0065 discloses the set of visemes is derived by using the actor image frames which are uttering each phoneme which has a unique face and mouth shape; ¶0088 discloses the output actor screen image track that is synchronized to the new dub audio track).
Moulton does not explicitly disclose generate, from a single representation of a head to animate and audio encodings of successive windows of an audio signal of speech.

generate, from a single representation of a head to animate (Savchenkov- Fig. 1 and ¶0033 disclose the target image 125 may include at least a target face 140 [a single representation of a head] […] The target face 140 may belong to the user 130 or a person other than the user 130; Fig. 9 and ¶0085-0087 disclose a target face 140 to be animated and an input text 160. The mobile application may generate, using the face reenactment system 220, audio data (sound waveform) for the input text 160 and a video animating the target face 140 in the target image 125 […] A vocoder may generate an audio data to match the input text 160. The face reenactment system 220 may generate an animation of the target face 140 to match the audio data. In further embodiments, the mobile device may allow to enter an input text and/or audio via an audio or video recording) and audio encodings of successive windows of an audio signal of speech (Savchenkov- ¶0058 discloses Each of the sets of acoustic (numerical) features 625 may be assigned a timestamp. Each of the sets of acoustic features may correspond to one of frames 345 in output video 340 […] the number of sets in the sequence of the sets of acoustic features can be determined based on a desired frame rate of the output video. Additionally, the number of sets can be also determined based on the desired duration of an audio representing the input text; ¶0060 discloses the vocoder 660 […] to produce an audio data [audio encodings] (for example, a speech waveform as output). In other embodiments, the vocoder 660 may include a neural vocoder based on a neural network; ¶0062 discloses the neural network can also use, as an input, sets of acoustic features corresponding to a fixed-length time window [window]. The time window may cover a pre-determined time before a current timestamp and a pre-determined time after the current timestamp [successive windows of an audio signal of speech]);
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Moulton to incorporate the teachings of Savchenkov, and apply a single representation of a head and audio encodings of successive windows into 
The same motivation that was utilized in the rejection of claim 1 applies equally to this claim.

The computer system of claim 16 is similar in scope to the functions performed by the computerized method of claim 9 and therefore claim 16 is rejected under the same rationale.

Regarding claim 19, Moulton in view of Savchenkov, discloses the computer system of claim 15, and further discloses wherein the landmark predictor is configured to generate the plurality of sets of predicted 3D facial landmarks based on a representation of a subset of body parts identified from the single representation of the head, and wherein the animation of the head selectively animates and synchronizes motion of the subset of body parts based on the speech (Moulton- ¶0018 discloses the computer vision tracks and estimates the position of the control points mapped to the mouth as they move in the target production footage; ¶0035 discloses tracks the (a) head position, (b) the facial motions for the jaw and (c) the lip motion during speech. (FIG. 1, Block 250). This develops a database of computer estimated positions of control reference points for the head as a whole and for the facial muzzle area including the jaw and lips; ¶0060 discloses the control points associated with the phoneme transition motion path of the lips, mouth and jaw; Savchenkov- Fig. 1 and ¶0033 disclose the target image 125 may include at least a target face 140 [single representation of the head] […] The target face 140 may belong to the user 130 or a person other than the user 130; Fig. 9 and ¶0085-0087 disclose a target face 140 to be animated and an input text 160. The mobile application may generate, using the face reenactment system 220, audio data (sound waveform) for the input text 160 and a video animating the target face 140 in the target image 125 […] A vocoder may generate an audio data to match the input text 160. The face .
It would have been obvious to one of ordinary in the art before the effective filing date of the claimed invention to have modified Moulton to incorporate the teachings of Savchenkov, and apply a single representation of a head into Moulton’s system for in order to generate the plurality of sets of predicted 3D facial landmarks based on a representation of a subset of body parts identified from the single representation of the head.
The same motivation that was utilized in the rejection of claim 15 applies equally to this claim.

The computer system of claim 20 are similar in scope to the functions performed by the computer storage media of claim 7 and therefore claim 20 are rejected under the same rationale.


10.	Claims 2-3, 11-12 and 17-18 are rejected under 35 U.S.C. 103 as being unpatentable over Moulton et al., (“Moulton”) [US-2002/0097380-A1] in view of Savchenkov et al., (“Savchenkov”) [US-2020/0234690-A1], further in view of “VisemeNet: Audio-Driven Animator-Centric Speech Animation” by Zhou et al., (“Zhou”)
Regarding claim 2, Moulton in view of Savchenkov, discloses the one or more computer storage media of claim 1, and the prior art does not explicitly disclose, but Zhou discloses wherein generating the set of predicted 3D facial landmarks is based on one of a plurality of speaking styles automatically detected from the speech (Zhou- page 161:1, left column, speech styles like mumbling or shouting are strongly co-related to the motion of facial landmarks; page 161:3, section Speech style prediction teaches style attributes of the audio performance can be captured using jaw and lip parameters ... The "landmark stage" in .
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to have combined the teachings of Moulton/Savchenkov with Zhou and apply the speech styles being detected with the set of reference points, as taught by Moulton/Savchenkov for generating the set of predicted 3D facial landmarks is based on one of a plurality of speaking styles automatically detected from the speech. 
Doing so the faces in the video clips can be accurately detected.

Regarding claim 3, Moulton in view of Savchenkov, discloses the one or more computer storage media of claim 1, and the prior art does not explicitly disclose, but Zhou discloses wherein generating the set of predicted 3D facial landmarks is based on an input selecting a particular speaking style from a plurality of speaking styles, and setting a landmark predictor to generate the set of predicted 3D facial landmarks using the particular speaking style (Zhou- page 161:1, left column, speech styles like mumbling or shouting are strongly co-related to the motion of facial landmarks; page 161:3, section Speech style prediction teaches style attributes of the audio performance can be captured using jaw and lip parameters ... The "landmark stage" in Figure 3 (bottom-left box) is designed to predict a set of jaw and lip landmark positions over time given input audio; page 161:4, left column teaches the faces in the video clips can be accurately detected and annotated with landmarks ... The extracted facial landmarks corresponding to the speech audio are then useful to train the landmark stage).
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to have combined the teachings of Moulton/Savchenkov 
Doing so the faces in the video clips can be accurately detected.

The method of claims 11-12 are similar in scope to the functions performed by the computer storage media of claims 2-3 and therefore claims 11-12 are rejected under the same rationale.

The computer system of claims 17-18 are similar in scope to the functions performed by the computer storage media of claims 2-3 and therefore claims 11-12 are rejected under the same rationale.


11.	Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Moulton et al., (“Moulton”) [US-2002/0097380-A1] in view of Savchenkov et al., (“Savchenkov”) [US-2020/0234690-A1], further in view of Deller et al., (“Deller”) [US-10,127,908-B1]
Regarding claim 5, Moulton in view of Savchenkov, discloses the one or more computer storage media of claim 1, and further discloses wherein the operations further comprise: initially training a landmark predictor to simulate a base set of facial dynamics captured in a first training video (Moulton- ¶0015 discloses an emotional capture elicitation process is effected by having the actor get in the mood of a list of different basic emotional expressions. The actor then performs and visually records examples of those emotional expressions and changes ... Examples are recorded as static image positions of the face, and ;
transforming the landmark predictor to simulate the supplemental set of facial dynamics by training the landmark predictor with the supplemental training video (Moulton- ¶0011 discloses the actual motion path of a set of fixed reference points, while the mouth moves from one phoneme to another, is recorded and captured for the entire transformation between any set of different phonemes. As the mouth naturally speaks one speech phoneme and then alters its shape to speak another phoneme, the entire group of fixed reference points move and follow a particular relative course during any phoneme to phoneme transformation; ¶0019 discloses each facial image viseme represents a specific facial expressive primitive of the voice for different phonemes. The morph target transformation subsystem has learned, from the computer vision analysis on a real actor, to acquire and tune the right optical path transform for each different actor expressive transformation. These transformations can include speech and emotional expression, and idiosyncratic gesture based expressions).
The prior art does not explicitly disclose, but Deller discloses
receiving a supplemental training video capturing a supplemental set of facial dynamics (Deller- col 8 lines 18-50 discloses accessing control information associated with the main content and/or supplemental content associated with the main content. .. The orchestration component 128 can also send a second instruction to the accessory device 106 to begin processing the control information or to begin playback of the supplemental content (e.g., video content) via the accessory device 106 at the time specified in the second instruction that corresponds to a time when the voice-controlled device 104 is instructed to begin playback of the main content (e.g., audio content); col 9 lines 46-51 discloses Supplemental content output on the display 113 of the accessory device 106 may comprise video content, such as a music video; col 28 lines 1-3 discloses viseme information being correlated with the lip synch mode of ;
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to have combined the teachings of Moulton/Savchenkov with Deller and apply the lip synch mode in the supplemental video content into training a landmark predictor, as taught by Moulton/Savchenkov for transforming the landmark predictor to simulate the supplemental set of facial dynamics by training the landmark predictor with the supplemental training video. 
Doing so to effectively find an audio data.


12.	Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over Moulton et al., (“Moulton”) [US-2002/0097380-A1] in view of Savchenkov et al., (“Savchenkov”) [US-2020/0234690-A1], further in view of “Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion” by Karras et al., (“Karras”)
Regarding claim 6, Moulton in view of Savchenkov, discloses the one or more computer storage media of claim 1, and further discloses wherein the sliding window is configured to capture audio chunks from the audio signal (see Claim 1 rejection for detailed analysis).
The prior art does not explicitly disclose, but Karras discloses
capture partially overlapping audio chunks from the audio signal (Karras- page 94:4, section 3.2 Audio processing, 4th paragraph discloses The input audio window is divided into 64 audio frames with 2x overlap, so that each frame corresponds to 16ms (256 samples) and consecutive frames are located 8ms (128 samples) apart).

Doing so would drive 3D facial animation by audio input in real time and with low latency.


Conclusion
13.	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
14.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL LE whose telephone number is (571)272-5330. The examiner can normally be reached 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Xiao Wu can be reached on (571) 272-7761. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/MICHAEL LE/Primary Examiner, Art Unit 2619