DETAILED ACTION
Introduction
This office action is in response to Applicant’s submission filed on 03/18/2021. Claims 1-20 are pending in the application and have been examined.
	
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.  
Claim 1 recites determining lip-shape key point information corresponding to each phoneme in the phoneme sequence.
The limitation of determining lip-shape key point information corresponding to each phoneme in the phoneme sequence, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. For example, “determining” in the context of this claim encompasses the user visually determining lip orientation based on the lip image of the speaker for the particular word sequence. Similarly, the limitation of corresponding the searched lip shape image of each phoneme with each time point to obtain a lip-shape image sequence corresponding to the voice, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. For example, “corresponding” in the context of this claim encompasses the user thinking which searched image shape image would match the word sequence. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim is not patent eligible.
Claim 2 recites the voice is voice data obtained by performing voice synthesis on a text.
The limitation of  the voice is voice data obtained by performing voice synthesis on a text, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. For example, “performing” in the context of this claim encompasses the user reading text. Similarly, the voice is a voice segment obtained by splicing the voice data., as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. For example, “obtaining” in the context of this claim encompasses the user reading a word at a time. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim is not patent eligible.
	Claims 3 and 4 recite obtaining and training a voice-phoneme conversion model based on a recurrent neural network to obtain a phoneme from a voice. 
The limitation of obtaining and training, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. That is, other than reciting “recurrent neural network,” nothing in the claim element precludes the step from practically being performed in the mind. For example, but for the “recurrent neural network” language, “obtaining and training” in the context of this claim encompasses the user to be able to speak the words. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application. In particular, the claims only recites one additional element – using a recurrent neural network to perform both the obtaining and training steps. The neural network in both steps is recited at a high-level of generality (i.e., as a generic processor performing a generic computer function of obtaining and training) such that it amounts no more than mere instructions to apply the exception using a generic computer component. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claims are directed to an abstract idea. 
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using a recurrent neural network to perform both the obtaining and training steps amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claims are not patent eligible.
Claims 5-11 recite searching and creating a library shape library. The limitation of processing lip-shape key points, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind. For example, “smoothing” in the context of this claim encompasses the user blending sounds of the word together. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claims are not patent eligible.
	Similarly claims 12-20 are device and non-transitory computer readable storage medium version of the method claims 1-8 and are rejected on similar grounds as claims 1-8.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of pre-AIA  35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1, 5-12 and 16-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by B. Fan, L. Wang, F. K. Soong and L. Xie, "Photo-real talking head with deep bidirectional LSTM," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4884-4888.
Regarding claim 1, Fan teaches a method for determining the shape of the lips of a virtual character, comprising: determining a phoneme sequence corresponding to a voice, the phoneme sequence comprising a phoneme corresponding to each time point (see Fan, pg. 4885, sect. 3.1 for each speech utterance, we convert the phoneme/state sequence and their time offset into a label sequence, denoting as L=(11,…,1t,…,1T), where T is the number of frames in the sequence.); determining lip-shape key point information corresponding to each phoneme in the phoneme sequence (see Fan, pg. 4886, sect. 3.2 Subsequently, the lower face sequence with T frames can be represented by the visual feature sequence V=(v1,…,,vT). See Fan, pg. 4886, sect 4.1, In our BLSTM network, as shown in Fig. 3, label sequence L is the input layer, and visual feature sequence V serves as the output layer and H denotes the hidden layer); searching a pre-established lip shape library according to each piece of determined lip-shape key point information, so as to obtain a lip shape image of each phoneme (see Fan, pg. 4886, sect. 4.1 In the training stage, we have multiple sequence pairs of L and V. As we represent both sequences as continuous numerical vectors, the network is treated as a regression model minimizing the SSE of predicting V from L. In the test (or synthesis) stage, given any arbitrary text along with natural or synthesized speech, we firstly convert them into a sequence of labels, then feed into the trained BLSTM network, and the output of the network is the predicted visual AAM feature sequence; the predicting V is interpreted as pre-established lip shape library); and corresponding the searched lip shape image of each phoneme with each time point to obtain a lip-shape image sequence corresponding to the voice (see Fan, pg. 4885, sect. 2, In the synthesis stage, for any input text with natural or synthesized speech by text-to-speech (TTS), we first extract the label sequence L from them and then predict the visual AAM parameters V using the well trained deep BLSTM network).
 Regarding claim 5, Fan teaches the method according to claim 1.  Fan further teaches smoothing a lip-shape key point corresponding to each phoneme in the phoneme sequence (see Fan, pg. 4885, sect 3.2 In our system, the visual stream is a sequence of lower face images which are strongly correlated to the underlying speech. As the raw face image is hard to model directly due to the high dimensionality, we use AAM for visual feature extraction. AAM is a joint statistical model compactly representing both the shape and the texture variations and the correlation between them. Fan teaches normalizing the head pose and using the EPCA and PCA to process the N facial feature points of the lower face to construct a visual feature sequence; this is interpreted as smoothing the lip shape key point corresponding to each phoneme).
	Regarding claim 6, Fan teaches the method according to claim 1.  Fan further teaches wherein the lip shape library comprises various lip shape images and lip-shape key point information corresponding to the lip shape images (see Fan, pg. 4885, sect 3.2 AAM Visual Feature V discusses lower face sequence with T frames can be represented by the visual feature sequence V=(v1,…,vT) where V is extracted from the shape and texture of the image of the lower face; this is interpreted as the various lip shape images and lip-shape key point information).
Regarding claim 7, Fan teaches the method according to claim 6. Fan further teaches collecting lip shape images of a real person in the speaking process in advance (see Fan, pg. 4885, sect. 3.2 Since the speaker moves his/her head naturally during recording, we perform head pose normalization among all the face images before AAM modeling); clustering the collected lip shape images based on the lip-shape key point information (see Fan, pg. 4886, sect. 3.2 we can reconstruct the shape and texture of the j-th lower face image by only one parameter vector vj , and vj is the j-th appearance parameter vector which we use as AAM visual feature. Subsequently, the lower face sequence with T frames can be represented by the visual feature sequence V = (v1, . . . , vt, . . . , vT ); this lower face sequence representation is interpreted as clustering the collected lip shape images based on lip-shape key point information); and  selecting one lip shape image and the lip-shape key point information corresponding to the lip shape image from each cluster to construct the lip shape library (see Fan, pg. 4886, sect. 4.1, In the test (or synthesis) stage, given any arbitrary text along with natural or synthesized speech, we firstly convert them into a sequence of labels, then feed into the trained BLSTM network, and the output of the network is the predicted visual AAM feature sequence. After reconstructing the AAM feature vectors to RGB images, we can get the photo realistic image sequence of the lower face; the predicted visual AAM feature sequence is interpreted as the selecting one lip shape image and lip-shape key point information to construct the lip shape image).
	Regarding claim 8, Fan teaches the method according to claim 1.  Fan further teaches wherein the lip-shape key point information comprises information of the distances between the key points (see Fan, pg. 4885, sect. 3.2 & Fig. 2(a), The shape of the j-th lower face, Sj, can be represented by the concatenation of the x and y coordinates of N facial feature points: sj=(xj1, xj2, …, xjN, yj1, yj2, …, yjN) where j=1,2,…,J and J is the total number of the face images; x and y coordinates interpreted as the distances between key points).
Regarding claim 9, Fan teaches the method according to claim 6.  Fan further teaches wherein the lip-shape key point information comprises information of the distances between the key points (see Fan, pg. 4885, sect. 3.2 & Fig. 2(a), The shape of the j-th lower face, Sj, can be represented by the concatenation of the x and y coordinates of N facial feature points: sj=(xj1, xj2, …, xjN, yj1, yj2, …, yjN) where j=1,2,…,J and J is the total number of the face images; x and y coordinates interpreted as the distances between key points).
Regarding claim 10, Fan teaches the method according to claim 7.  Fan further teaches wherein the lip-shape key point information comprises information of the distances between the key points (see Fan, pg. 4885, sect. 3.2 & Fig. 2(a), The shape of the j-th lower face, Sj, can be represented by the concatenation of the x and y coordinates of N facial feature points: sj=(xj1, xj2, …, xjN, yj1, yj2, …, yjN) where j=1,2,…,J and J is the total number of the face images; x and y coordinates interpreted as the distances between key points).
Regarding claim 11, Fan teaches the method according to claim 1.  Fan further teaches synthesizing the voice and the lip-shape image sequence corresponding to the voice to obtain a virtual character video corresponding to the voice (see Fan, pg./ 4885, sect. 2 & Fig. 1, In the synthesis stage, for any input text with natural or synthesized speech by text-to-speech (TTS), we first extract the label sequence L from them and then predict the visual AAM parameters V using the well trained deep BLSTM network. Finally, the predicted AAM visual parameter sequence V can be reconstructed to high quality photo realistic face images and rendering the full face talking head with lip-synced animation ).
Regarding claim 12, is directed to a device claim corresponding to the method claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1.
Regarding claim 16, is directed to a device claim corresponding to the method claim presented in claim 5 and is rejected under the same grounds stated above regarding claim 5.
Regarding claim 17, is directed to a device claim corresponding to the method claim presented in claim 6 and is rejected under the same grounds stated above regarding claim 6.
Regarding claim 18, is directed to a device claim corresponding to the method claim presented in claim 7 and is rejected under the same grounds stated above regarding claim 7.
Regarding claim 19, is directed to a device claim corresponding to the method claim presented in claim 8 and is rejected under the same grounds stated above regarding claim 8.
Regarding claim 20, is directed to a non-transitory computer readable medium claim corresponding to the method claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1.
	
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 2-4 and 13-15 are rejected under 35 U.S.C. 103 as being unpatentable over B. Fan, L. Wang, F. K. Soong and L. Xie, "Photo-real talking head with deep bidirectional LSTM," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4884-4888 in view of Sak. et. al., US Patent Application Publication 2016/0372119.
Regarding claim 2, Fan teaches the method according to claim 1. Fan further teaches wherein the voice is voice data obtained by performing voice synthesis on a text ((see Fan, pg. 4885, sect. 2  In the synthesis stage, for any input text with natural or synthesized speech by text-to-speech (TTS)). However, Fan fails to teach  or the voice is a voice segment obtained by splicing the voice data.
However, Sak teaches or the voice is a voice segment obtained by splicing the voice data (see Sak, [0028] The feature extraction module 102 receives an acoustic sequence and generates a feature representation for frames of acoustic data 110 in the acoustic sequence, e.g., from an audio waveform. For example, the acoustic modeling system 100 may receive a digital representation of an utterance, e.g., as a continuous stream of data).
Fan and Sak are considered to be analogous to the claimed invention because they relate to speech modeling systems using neural networks. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of  Fan on visual speech synthesis using DNNs with the acoustic modeling neural network teachings of Sak to improve automatic speech recognition ( see [Sak][0003,0004]).
Regarding claim 3, Fan teaches the method according to claim 1. Fan further teaches wherein the determining a phoneme sequence corresponding to a voice comprises: inputting the voice into a voice-phoneme conversion model to obtain the phoneme sequence output by the voice-phoneme conversion model (see Fan, pg. 4885, sect. 3.1 For natural recordings, the phoneme/state time alignment can be obtained by conducting forced alignment using a trained speech recognition model. For TTS synthesized speech, the phoneme/state sequence and time offset are a by-product of the synthesis process. Therefore, for each speech utterance, we convert the phoneme/state sequence and their time offset into a label sequence, denoting as L=(11,…,1T), where T is the number of frames in the sequence) ; the voice-phoneme conversion model is pre-trained based on a recurrent neutral network (see Fan, pg. 4884, sect. 1 In this paper, we propose a deep BLSTM-based approach for visual speech synthesis. The audio/visual parallel training data are converted into sequences of contextual labels and visual feature vectors, respectively).
Further, Sak also teaches inputting the voice into a voice-phoneme conversion model to obtain the phoneme sequence output by the voice-phoneme conversion model (see Sak,[0027] the system output 130 may be provided to a speech decoder for speech decoding. A speech decoder may receive a system output, e.g., a set of phoneme scores for the system input, generate a phoneme representation of the system input using the set of phoneme scores, and generate a corresponding written transcription of the phoneme representation); the voice-phoneme conversion model is pre-trained based on a recurrent neutral network (see Sak, [0030] The neural network system 104 includes a subsampling system 116, a recurrent neural network 120 and a CTC output layer 124. The neural network system is trained to process modified frames of acoustic data 114 and generate respective sets of phoneme scores 126).
	Regarding claim 4, Fan teaches the method according to claim 3. Fan fails to teach acquiring training data comprising a voice sample and a phoneme sequence obtained by labeling the voice sample; and training the recurrent neural network with the voice sample as input thereof and the phoneme sequence obtained by labeling the voice sample as target output thereof, so as to obtain the voice-phoneme conversion model. 
However, Sak teaches acquiring training data comprising a voice sample and a phoneme sequence obtained by labeling the voice sample (see Sak,[0037] The neural network system 104 can be trained on multiple batches of training examples in order to determine trained values of parameters of the neural network layers, i.e., to adjust the values of parameters from initial values to trained values) ; and training the recurrent neural network with the voice sample as input thereof and the phoneme sequence obtained by labeling the voice sample as target output thereof, so as to obtain the voice-phoneme conversion model (see Sak, 0030] The neural network system 104 includes a subsampling system 116, a recurrent neural network 120 and a CTC output layer 124. The neural network system is trained to process modified frames of acoustic data 114 and generate respective sets of phoneme scores 126. See Sak, [0037] during the training, the neural network system 104 can process a batch of training examples and generate a respective neural network output for each training example in the batch. The neural network outputs can then be used to adjust the values of the parameters of the components of the neural network 104, for example, using state-level minimum Bayes risk (sMBR) sequence discriminative training criterion; the phoneme scores 126 are interpreted as phoneme sequence labels of the voice samples).
Regarding claim 13, is directed to a device claim corresponding to the method claim presented in claim 2 and is rejected under the same grounds stated above regarding claim 2.
Regarding claim 14, is directed to a device claim corresponding to the method claim presented in claim 3 and is rejected under the same grounds stated above regarding claim 3.
Regarding claim 15, is directed to a device claim corresponding to the method claim presented in claim 4 and is rejected under the same grounds stated above regarding claim 4.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Edwards et. al., US Patent Application Publication 2018/0253881 teaches a method for animated lip synchronization by mapping phonemes to visemes; synchronizing the visemes into viseme action units, the viseme action units comprising jaw and lip contributions for each of the phonemes; and outputting the viseme action units. (see Edwards, [0005]).
Aoyama et. al., US Patent Application Publication 2010/0332229 teaches an image acquisition that acquires a temporal sequence of frames of image data, a detecting unit that detects a lip area and a lip image from each of the frames of the image data, a recognition unit that recognizes a word based on the detected lip images of the lip areas, and a controller that controls an operation at the information processing apparatus based on the word recognized by the recognition unit. (see Aoyama, [0018]).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NANDINI SUBRAMANI whose telephone number is (571)272-3916. The examiner can normally be reached Monday - Friday 2:00pm - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh M Mehta can be reached on (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NANDINI SUBRAMANI/Examiner, Art Unit 2656                                                                                                                                                                                                        	
	/EDGAR X GUERRA-ERAZO/            Primary Examiner, Art Unit 2656