DETAILED ACTION
Introduction
This office action is in response to Applicant’s submission filed on 09/15/2022. Claims 1-5, 12-16 and 20 are pending in the application and have been examined.
	
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The response filed on 09/15/2022 has been correspondingly accepted and considered in this Office Action. Claims 1-5, 12-16 and 20  have been examined. Claims 6-11 and 17-19 have been cancelled. Applicant’s amendments to claim 1, 12 and 20, indicating processing of the voice and the shape of the lips corresponding to the voice with the support in the Specifications [0059] overcome the 35 U.S.C 101 rejections previously set forth in the Non-Final Office Action mailed 06/23/2022. The dependent claims 2-5, 13-16 overcome the 35 U.S.C 101 rejections previously set forth in the Non-Final Office Action mailed 06/23/2022 based on their dependency to the amended claims 1 and 12 respectively. Therefore, the above referenced rejections under 35 U.S.C. 101 are withdrawn.
Response to Amendment
Applicant's arguments filed 09/15/2022  have been fully considered as follows:
Applicant’s arguments with respect to claim 1 state that
“Fan… fails to disclose the specific feature for synthesizing the voice and the lip-shape image sequence corresponding to the voice to obtain a virtual character video corresponding to the voice of the claimed invention.”
	
The examiner respectfully disagrees, Fan teaches “Finally, the predicted AAM visual parameter sequence V can be reconstructed to high quality photo realistic face images and rendering the full face talking head with lip-synced animation” in see Fan, pg./ 4885, sect. 2 & Fig. 1, during the training the neural network model is trained to predict the corresponding lip-shape image for the corresponding text and audio frame and then during synthesis the once the phoneme labels is extracted from the text and audio, the visual feature sequence for the corresponding phoneme labels are predicted and this is used to construct the lip-synced animation of talking head or virtual character video corresponding to voice. Therefore, Fan teaches synthesizing the voice and the lip-shape image sequence corresponding to the voice to obtain a virtual character video corresponding to the voice and therefore, the rejections of Claims 1, 5 and 20 are rejected under 35 U.S.C. 102 are sustained and further updated accordingly.
Applicant’s further arguments with respect to claim 1 state that
“Fan fails to disclose the process of clustering the collected lip shape images of the lip shape library based on the distances between the lip-shape key points such that the images with similar distances between the lip-shape key points are clustered into one cluster.”

Applicant’s arguments above with respect to claim 1 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
In response to the art rejection(s) of the remainder of dependent claims are rejected under 35 U.S.C 102 or U.S.C 103, in case said claims are correspondingly discussed and/or argued for at least the same rationale presented in Remarks filed 09/15/2022  , Examiner respectfully notes as follows. For completeness, should the mentioned claims be likewise traversed for similar reasons to independent claims 1 and 5 correspondingly, Examiner respectfully directs Applicant to the same previous supra reasons provided in the response directed towards claims 1 and 5 correspondingly discussed above. For at least the same supra provided reasons, Examiner likewise respectfully disagrees, and Applicant's arguments have been fully considered but they are not persuasive.
Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1-5, 12-16 and20 are rejected under 35 U.S.C. 103 as being unpatentable over B. Fan, L. Wang, F. K. Soong and L. Xie, "Photo-real talking head with deep bidirectional LSTM," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4884-4888 in view of Aoyama et. al., US Patent Application Publication 2010/0332229. 
Regarding claim 1, Fan teaches a computer-implemented method for determining  (see Fan, pg. 4885, sect. 3.1 for each speech utterance, we convert the phoneme/state sequence and their time offset into a label sequence, denoting as L=(11,…,1t,…,1T), where T is the number of frames in the sequence.); determining lip-shape key point information corresponding to each phoneme in the phoneme sequence (see Fan, pg. 4886, sect. 3.2 Subsequently, the lower face sequence with T frames can be represented by the visual feature sequence V=(v1,…,,vT). See Fan, pg. 4886, sect 4.1, In our BLSTM network, as shown in Fig. 3, label sequence L is the input layer, and visual feature sequence V serves as the output layer and H denotes the hidden layer); searching a pre-established lip shape library according to each piece of determined lip-shape key point information, so as to obtain a lip shape image of each phoneme (see Fan, pg. 4886, sect. 4.1 In the training stage, we have multiple sequence pairs of L and V. As we represent both sequences as continuous numerical vectors, the network is treated as a regression model minimizing the SSE of predicting V from L. In the test (or synthesis) stage, given any arbitrary text along with natural or synthesized speech, we firstly convert them into a sequence of labels, then feed into the trained BLSTM network, and the output of the network is the predicted visual AAM feature sequence; the training sequence V is interpreted as pre-established lip shape library); (see Fan, pg. 4885, sect. 2, In the synthesis stage, for any input text with natural or synthesized speech by text-to-speech (TTS), we first extract the label sequence L from them and then predict the visual AAM parameters V using the well trained deep BLSTM network); synthesizing the voice and the lip-shape image sequence corresponding to the voice to obtain a virtual character video corresponding to the voice (see Fan, pg./ 4885, sect. 2 & Fig. 1, In the synthesis stage, for any input text with natural or synthesized speech by text-to-speech (TTS), we first extract the label sequence L from them and then predict the visual AAM parameters V using the well trained deep BLSTM network. Finally, the predicted AAM visual parameter sequence V can be reconstructed to high quality photo realistic face images and rendering the full face talking head with lip-synced animation ;lip-synced animation of talking head is interpreted as virtual character video corresponding to voice); and playing the virtual character video on a terminal device (see Fan, pg. 4887 Sect 5.4, For each test sequence, the two talking head videos were played side-by-side randomly with original speech. (interpreted as on a terminal device)), wherein the lip shape library comprises various lip shape images and lip-shape key point information corresponding to the lip shape images (see Fan, pg. 4885, sect 3.2 AAM Visual Feature V discusses lower face sequence with T frames can be represented by the visual feature sequence V=(v1,…,vT) where V is extracted from the shape and texture of the image of the lower face; V is interpreted as the various lip shape library comprising lip shape images and lip-shape key point information): collecting lip shape images of a real person in a speaking process in advance (see Fan, pg. 4884, sect 2., Firstly, an audio/visual database of a subject talking to a camera with frontal view of his/her face is recorded as our training data) wherein the lip-shape key point information comprises information of distances between the key points (see Fan, pg. 4885, sect. 3.2 & Fig. 2(a), The shape of the j-th lower face, Sj, can be represented by the concatenation of the x and y coordinates of N facial feature points: sj=(xj1, xj2, …, xjN, yj1, yj2, …, yjN) where j=1,2,…,J and J is the total number of the face images; x and y coordinates interpreted as the distances between key points). 
However, Fan fails to teach clustering the collected lip shape images based on the lip-shape key point information; and selecting one lip shape image and the lip-shape key point information corresponding to the lip shape image from each cluster to construct the lip shape library, and clustering is based on the distances between the lip-shape key points such that the images with similar distances between the lip-shape key points are clustered into one cluster, and the shapes of the lips in one cluster are similar.
However, Aoyama teaches clustering the collected lip shape images based on the lip-shape key point information (see Aoyama,[0059], [0066-0067], [0107] The lip area detecting unit 23 detects a lip area , and outputs position information of the lip area of each frame to the lip image generating unit 24 together with the utterance moving image for learning. The learning sample storing unit 29 stores a plurality of lip images with added viseme labels (hereinafter, referred to as lip images with viseme labels) as learning samples. More specifically, as shown in FIG. 4, the M number of learning samples (xi, yk) in a state that a class label yk (k=1, 2, . . . , K) corresponding to a viseme label is assigned to M pieces of lip image xi (i=1, 2, . . . , M); class label is interpreted as clustering the collected lip shape images); and selecting one lip shape image and the lip-shape key point information corresponding to the lip shape image from each cluster to construct the lip shape library (see Aoyama Fig. 3,  [0064-65][0107] teaches  converting phonemes to visemes which is interpreted as selecting one lip shape image and lip-shape key point (viseme) from the cluster) , and clustering is based on the distances between the lip-shape key points such that the images with similar distances between the lip-shape key points are clustered into one cluster, and the shapes of the lips in one cluster are similar (see Aoyama,[0059], [0066-0067], [0107] The lip area detecting unit 23 detects a lip area , and outputs position information of the lip area of each frame to the lip image generating unit 24 together with the utterance moving image for learning. The learning sample storing unit 29 stores a plurality of lip images with added viseme labels (hereinafter, referred to as lip images with viseme labels) as learning samples. More specifically, as shown in FIG. 4, the M number of learning samples (xi, yk) in a state that a class label yk (k=1, 2, . . . , K) corresponding to a viseme label is assigned to M pieces of lip image xi (i=1, 2, . . . , M); class label is interpreted as clustering the collected lip shape images, position information of lip area is interpreted as lip-shape key points and class label of same viseme is interpreted as shapes of lips in cluster are similar as shown in Aoyama Fig. 4).
Fan and Aoyama are considered to be analogous to the claimed invention because they relate to audio visual modeling systems. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of  Fan on visual speech synthesis using DNNs with the Audio Visual Speech Recognition teachings of Aoyama to improve voice recognition technique under a noisy environment ( see Aoyama, [0008]).
Regarding claim 2, Fan teaches the method according to claim 1. Fan further teaches wherein the voice is voice data obtained by performing voice synthesis on a text ((see Fan, pg. 4885, sect. 2  In the synthesis stage, for any input text with natural or synthesized speech by text-to-speech (TTS)). Aoyama further teaches  or the voice is a voice segment obtained by splicing the voice data (see Aoyama, [0095] teaches the utterance voice for learning to the phoneme label assigning unit 25).
Regarding claim 3, Fan in view of Aoyama teach the method according to claim 1. Fan further teaches wherein the determining a phoneme sequence corresponding to a voice comprises: inputting the voice into a voice-phoneme conversion model to obtain the phoneme sequence output by the voice-phoneme conversion model (see Fan, pg. 4885, sect. 3.1 For natural recordings, the phoneme/state time alignment can be obtained by conducting forced alignment using a trained speech recognition model. For TTS synthesized speech, the phoneme/state sequence and time offset are a by-product of the synthesis process. Therefore, for each speech utterance, we convert the phoneme/state sequence and their time offset into a label sequence, denoting as L=(11,…,1T), where T is the number of frames in the sequence) ; the voice-phoneme conversion model is pre-trained based on a recurrent neutral network (see Fan, pg. 4884, sect. 1 In this paper, we propose a deep BLSTM-based approach for visual speech synthesis. The audio/visual parallel training data are converted into sequences of contextual labels and visual feature vectors, respectively). Aoyama also teaches inputting the voice into a voice-phoneme conversion model to obtain the phoneme sequence output by the voice-phoneme conversion model (see Aoyama, [0056] The separated utterance moving image for learning is input to the face area detecting unit 22, and the separated utterance voice for learning is input to the phoneme label assigning unit 25).
Regarding claim 4, Fan in view of Aoyama teaches the method according to claim 3. Fan further teaches acquiring training data comprising a voice sample and a phoneme sequence obtained by labeling the voice sample (see Fan, Fig. 1, pg. 4885, sect 2, In the training stage, the audio is converted into a sequence of contextual phoneme labels L; interpreted as labeling voice sample); training the recurrent neural network with the voice sample as input thereof and the phoneme sequence obtained by labeling the voice sample as target output thereof, so as to obtain the voice-phoneme conversion model (see Fan, Fig. 1, pg. 4885, sect 2, Then we train the deep BLSTM(RNN) neural networks to learn the regression model between the two audio and visual parallel sequences by minimizing the SSE of the prediction, in which the input layer is the label sequence L and the output prediction layer is the visual feature sequence V; Fig. 1, NN Model and Prediction is interpreted as the voice-phoneme conversion model). 
Also, Aoyama further teaches acquiring training data comprising a voice sample and a phoneme sequence obtained by labeling the voice sample (see Aoyama, [0065] The viseme label adding unit 28 uses the viseme label assigned to the utterance voice input from the viseme label converting unit 27 to add to the lip image for each frame of the utterance moving image for learning input from the lip image generating unit 24, and outputs the lip image added with the viseme label to the learning sample storing unit 29; lip image with viseme label (from voice input) is interpreted as training data); and training with the voice sample as input thereof and the phoneme sequence obtained by labeling the voice sample as target output thereof, so as to obtain the voice-phoneme conversion model (see Aoyama, [0068] The viseme classifier learning unit 30 is trained based on the learning samples stored in the learning sample storing unit 29).  
 Regarding claim 5, Fan teaches the method according to claim 1.  Fan further teaches smoothing a lip-shape key point corresponding to each phoneme in the phoneme sequence (see Fan, pg. 4885, sect 3.2 In our system, the visual stream is a sequence of lower face images which are strongly correlated to the underlying speech. As the raw face image is hard to model directly due to the high dimensionality, we use AAM for visual feature extraction. AAM is a joint statistical model compactly representing both the shape and the texture variations and the correlation between them. Fan teaches normalizing the head pose and using the EPCA and PCA to process the N facial feature points of the lower face to construct a visual feature sequence; this is interpreted as smoothing the lip shape key point corresponding to each phoneme).
	Regarding claim 12, is directed to a device claim corresponding to the method claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1.
Regarding claim 13, is directed to a device claim corresponding to the method claim presented in claim 2 and is rejected under the same grounds stated above regarding claim 2.
Regarding claim 14, is directed to a device claim corresponding to the method claim presented in claim 3 and is rejected under the same grounds stated above regarding claim 3.
Regarding claim 15, is directed to a device claim corresponding to the method claim presented in claim 4 and is rejected under the same grounds stated above regarding claim 4.
Regarding claim 16, is directed to a device claim corresponding to the method claim presented in claim 5 and is rejected under the same grounds stated above regarding claim 5.
Regarding claim 20, is directed to a non-transitory computer readable medium claim corresponding to the method claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Edwards et. al., US Patent Application Publication 2018/0253881 teaches a method for animated lip synchronization by mapping phonemes to visemes; synchronizing the visemes into viseme action units, the viseme action units comprising jaw and lip contributions for each of the phonemes; and outputting the viseme action units. (see Edwards, [0005]).
Steptoe et. al., US Patent 11,270,487 teaches embodiments to analyze muscle characteristics of individual users and may apply those individualized muscle characteristics to computer-generated avatars (see Steptoe, col. 2 lines 40-42).
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NANDINI SUBRAMANI whose telephone number is (571)272-3916. The examiner can normally be reached Monday - Friday 12:00pm - 5:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh M Mehta can be reached on (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/NANDINI SUBRAMANI/ Examiner, Art Unit 2656                                                                                                                                                                                            
/BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656