DETAILED ACTION

Introduction
This office action is in response to Applicant’s submission filed on 06/16/2020. Claims 1-20 are pending in the application and have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on [1] is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(d):
(d) REFERENCE IN DEPENDENT FORMS.—Subject to subsection (e), a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

The following is a quotation of pre-AIA  35 U.S.C. 112, fourth paragraph:
Subject to the following paragraph [i.e., the fifth paragraph of pre-AIA  35 U.S.C. 112], a claim in dependent form shall contain a reference to a claim previously set forth and then specify a further limitation of the subject matter claimed. A claim in dependent form shall be construed to incorporate by reference all the limitations of the claim to which it refers.

Claim 16 rejected under 35 U.S.C. 112(d) or pre-AIA  35 U.S.C. 112, 4th paragraph, as being of improper dependent form for failing to further limit the subject matter of the claim upon which it depends, or for failing to include all the limitations of the claim upon which it depends.  Claim 16 refers to identifying phonemes corresponding to viseme while claim 1 refers to identifying viseme based on phoneme.  Applicant may cancel the claim, amend the claim to 
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of pre-AIA  35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the manner in which the invention was made.

Claims 1, 3, 9, 11, 12, 13, 15, 16, 17 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Li et. al. (US Patent 10,699,705) in view of Heller et. al. (US Patent Application Publication 2020/0160581) where Li has been cited in the IDS submitted on 12/20/2021 as US Patent Publication Application 2019/0392853.
Regarding claim 1, Li teaches a computer-implemented method comprising: training a machine-learning algorithm to use look-ahead to improve effectiveness of identifying visemes corresponding to audio signals by, for at least one audio segment in a set of training audio signals, evaluating: the audio segment, where the audio segment includes at least a portion of a phoneme ( see Li col 2, lines 38-41,  viseme-generation application 102 receives an audio sequence from audio input device 105, generates feature vector 115, and uses viseme prediction model 120,  col 4 lines 1-2, viseme-generation application 102 generates training data 130a-n for training viseme prediction model 120; viseme prediction model : interpreted as machine algorithm ); and a subsequent segment that includes contextual audio that comes after the audio segment and potentially contains context about a viseme that maps to the phoneme (see Li col 6, lines 14-15, LSTM model 500 predicts visemes based on feature vectors for past, present, or future windows in time); using the trained machine-learning algorithm to identify at least one probable viseme corresponding to speech in a target audio signal (see Li, col 5, lines 47-50 at block 303, process 300 involves determining a sequence of predicted visemes representing speech for the present subset by applying the feature vector to the viseme prediction model). Li fails to teach recording, as metadata of the target audio signal, where the probable viseme occurs within the target audio signal.  However Heller teaches using the trained machine-learning algorithm to identify at least one probable viseme corresponding to speech in a target audio signal (see Heller [0066] But using the input audio dataset alone may only allow the viseme detection engine 102 to identify a set of candidate frames that includes both a video frame that depicts a viseme (e.g., frames in which a person's mouth was moving while speaking a word) and frames that do not depict the viseme (e.g., frames in which a person's mouth was not moving due to slurring or under-enunciation while speaking a word; viseme detection engine interpreted as trained machine learning algorithm); and recording, as metadata of the target audio signal, where the probable viseme occurs within the target audio signal (see Heller [0042] the viseme detection engine 102 detects, extracts, and tags a set of viseme frames 124 based on an analysis of the input recording 104 with respect to the reference audio dataset 118; viseme frame interpreted as target audio signal).
Li and Heller are both considered to be analogous to the claimed invention because both relate to relating lip movement with speech synthesis. Therefore, it would have been obvious to see Heller [0004]).
	Regarding claim 3, Li and Heller teach the method of claim 1, Li further teaches wherein: training the machine-learning algorithm comprises extracting a set of features from the set of training audio signals, wherein each feature in the set of features comprises a spectrogram indicating energy levels of a training audio signal (see Li col 5, lines 23-27 feature vector 115 can include energy component 403. Energy component 403 represents the energy of the sequence of the audio samples in the window, for example, using a function such as the log mean energy of the samples); and training the machine-learning algorithm on the set of training audio signals is performed using the extracted set of features (see Li col 8, lines 9-14, viseme prediction model 120 is trained using training data 130a-n. Training data can include a set of feature vector and corresponding predicted visemes. Viseme-generation application 102 can be used to generate training data 130a-n). 
	Regarding claim 9, Li and Heller teach the method of claim 1, Li further teaches wherein training the machine-learning algorithm comprises, for each audio segment in the set of training audio signals: calculating, for one or more visemes, the probability of the viseme mapping to the phoneme of the audio segment (see Li, Col 9, lines 1-10 At block 601, process 600 involves determining a feature vector for each sample of the respective audio sequence of each set of training data. For example, training data 130a includes audio samples. In that case, the viseme-generation application 102 determines, for a window of audio samples, feature vector 115 in a substantially similar manner as described with respect to block 302 in process 300; interpreted as the prediction model 102, RNN or LSTM model, is trained with training data); selecting the viseme with a high probability of mapping to the phoneme based on the context from the subsequent segment (see Li Col 9, lines 1-23 At block 603, process 600 involves receiving, from the viseme prediction model, a predicted viseme. The viseme-generation application 102 receives a predicted viseme from viseme prediction model 120. The predicted viseme corresponds to the feature vector 115, and to the corresponding input audio sequence from which the feature vector was generated); and modifying the machine-learning algorithm based on a comparison of the selected viseme to a known mapping of visemes to phonemes (see Li, col 9 lines 33-47 at block 605, process 600 involves adjusting internal parameters, or weights, of the viseme prediction model to minimize the loss function. With each iteration, the viseme-generation application 102 seeks to minimize the loss function until viseme prediction model 120 is sufficiently trained. Viseme-generation application 102 can use a backpropagation training method to optimize internal parameters of the LSTM model 500. Backpropagation updates internal parameters of the network to cause a predicted value to be closer to an expected output).
	Regarding claim 11, Li and Heller teach the method of claim 9. Li further teaches wherein selecting the viseme with the high probability of mapping to the phoneme further comprises adjusting the selection based on additional context from a prior segment that includes additional contextual audio that comes before the audio segment (see Li col 6 lines 14-23 LSTM model 500 predicts visemes based on feature vectors for past, present, or future windows in time. LSTM model 500 can consider feature vectors for future windows by delaying the output of the predicted viseme until subsequent feature vectors are received and analyzed. Delay 501, denoted by d, represents the number of time windows of look-ahead. For a current audio feature vector a.sub.t, LSTM model 500 predicts a viseme that appears d windows in the past at v.sub.t-d).
	Regarding claim 12, Li and Heller teach the method of claim 1. Li further teaches wherein training the machine-learning algorithm further comprises: validating the machine-learning algorithm using a set of validation audio signals (see Li col 10, lines 36-47  At block 701, process 700 involves accessing a first set of training data including a first audio sequence representing a sentence spoken by a first speaker and having a first length. For example, viseme-generation application 102 accesses the first set of training data 801. The first set of training data 801 includes viseme sequence 811 and first audio sequence 812. The audio samples in first audio sequence 812 represent a sequence of phonemes. The visemes in viseme sequence 811 are a sequence of visemes, each of which correspond to one or more audio samples in first audio sequence 812; teaches validating ); and testing the machine-learning algorithm using a set of test audio signals (see Li col 11 lines 36-40  at block 705, process 700 involves training a viseme prediction model to predict a sequence of visemes from the first training set and the second training set; blocks 702-704 teaches the testing with the second audio sequence).
	Regarding claim 13, Li and Heller teach the method of claim 12. Li further teaches wherein validating the machine-learning algorithm comprises: standardizing the set of validation audio signals (see Li, col 9 lines 1-10 At block 601, process 600 involves determining a feature vector for each sample of the respective audio sequence of each set of training data. For example, training data 130a includes audio samples. In that case, the viseme-generation application 102 determines, for a window of audio samples, feature vector 115 in a substantially similar manner as described with respect to block 302 in process 300. As discussed with respect to FIGS. 3 and 4, feature vector 115 can include one or more of MFCC component 402, energy component 403, MFCC derivatives 404, and energy level derivative 405); applying the machine-learning algorithm to the standardized set of validation audio signals (see Li col 9 lines 17-18 At block 603, process 600 involves receiving, from the viseme prediction model, a predicted viseme); and evaluating an accuracy of mapping visemes to phonemes of the set of validation audio signals by the machine-learning algorithm (see Li col 9 lines 24-27 At block 604, process 600 involves calculating a loss function by calculating a difference between predicted viseme and the expected viseme. The expected viseme for the feature vector is included in the training data).
	Regarding claim 15, Li and Heller teach the method of claim 1. Heller further teaches wherein recording where the probable viseme occurs within the target audio signal comprises identifying and recording a probable start time and a probable end time for each identified probable viseme in the target audio signal (see Heller [0045] the viseme detection engine 102 can therefore determine that the frame 108a located at the timestamp 114a corresponding to the input audio data 112a should be tagged as depicting the “D” viseme, the frame 108b located at the timestamp 114b corresponding to the input audio data 112b should be tagged as depicting the “Oh” viseme, and the frame 108c located at the timestamp 114b corresponding to the input audio data 112c should be tagged as depicting the “Ee” viseme. The viseme detection engine 102 can perform this tagging operation and thereby generate a set of viseme frames 124 that include frames 126a-c with the tags 128a-c).
	Regarding claim 16, Li and Heller teach the method of claim 1. Heller further teaches identifying a set of phonemes that map to each identified probable viseme in the target audio signal (see Heller [0045] the viseme detection engine 102 can determine that the reference audio portions 120a-c match or are sufficiently similar to the input sets of audio data 112a-c. The viseme detection engine 102 can also determine that the input sets of audio data 112a-c respectively include the sounds “D,” “Oh,” and “Ee.” ); and recording, as metadata of the target audio signal, where the set of phonemes occur within the target audio signal (see Heller [0045] The viseme detection engine 102 can therefore determine that the frame 108a located at the timestamp 114a corresponding to the input audio data 112a should be tagged as depicting the “D” viseme, the frame 108b located at the timestamp 114b corresponding to the input audio data 112b should be tagged as depicting the “Oh” viseme, and the frame 108c located at the timestamp 114b corresponding to the input audio data 112c should be tagged as depicting the “Ee” viseme. The viseme detection engine 102 can perform this tagging operation and thereby generate a set of viseme frames 124 that include frames 126a-c with the tags 128a-c).
Regarding claim 17, Li teaches a system comprising: at least one physical processor; physical memory comprising computer-executable instructions that, when executed by the physical processor, cause the physical processor to (see Li system shown in col 12 10-12 The depicted example of a computing system 900 includes a processor 902 communicatively coupled to one or more memory devices 904 ): train a machine-learning algorithm to use look-ahead to improve effectiveness of identifying visemes corresponding to audio signals by, for at least one audio segment in a set of training audio signals, evaluating: the audio segment, where the audio segment includes at least a portion of a phoneme ( see Li col 2, lines 38-41,  viseme-generation application 102 receives an audio sequence from audio input device 105, generates feature vector 115, and uses viseme prediction model 120,  col 4 lines 1-2, viseme-generation application 102 generates training data 130a-n for training viseme prediction model 120; viseme prediction model : interpreted as machine algorithm ); and a subsequent segment that includes contextual audio that comes after the audio segment and potentially contains context about a viseme that maps to the phoneme (see Li col 6, lines 14-15, LSTM model 500 predicts visemes based on feature vectors for past, present, or future windows in time); uses the trained machine-learning algorithm to identify at least one probable viseme corresponding to speech in a target audio signal (see Li, col 5, lines 47-50 at block 303, process 300 involves determining a sequence of predicted visemes representing speech for the present subset by applying the feature vector to the viseme prediction model). Li fails to teach record, as metadata of the target audio signal, where the probable viseme occurs within the target audio signal.  However Heller teaches uses the trained machine-learning algorithm to identify at least one probable viseme corresponding to speech in a target audio signal (see Heller [0066] But using the input audio dataset alone may only allow the viseme detection engine 102 to identify a set of candidate frames that includes both a video frame that depicts a viseme (e.g., frames in which a person's mouth was moving while speaking a word) and frames that do not depict the viseme (e.g., frames in which a person's mouth was not moving due to slurring or under-enunciation while speaking a word; viseme detection engine interpreted as trained machine learning algorithm); and record, as metadata of the target audio signal, where the probable viseme occurs within the target audio signal (see Heller [0042] the viseme detection engine 102 detects, extracts, and tags a set of viseme frames 124 based on an analysis of the input recording 104 with respect to the reference audio dataset 118; viseme frame interpreted as target audio signal).
Li and Heller are both considered to be analogous to the claimed invention because both relate to relating lip movement with speech synthesis. Therefore, it would have been obvious to see Heller [0004]).
Regarding claim 20, Li teaches a non-transitory computer-readable medium comprising one or more computer- executable instructions that, when executed by at least one processor of a computing device, cause the computing device to: (see Li col 12 lines 21-26 :  A memory device 904 includes any suitable non-transitory computer-readable medium for storing program code 905, program data 907, or both. Program code 905 and program data 907 can be from viseme-generation application 102, viseme prediction model 120, or any other applications or data described herein): train a machine-learning algorithm to use look-ahead to improve effectiveness of identifying visemes corresponding to audio signals by, for at least one audio segment in a set of training audio signals, evaluating: the audio segment, where the audio segment includes at least a portion of a phoneme ( see Li col 2, lines 38-41,  viseme-generation application 102 receives an audio sequence from audio input device 105, generates feature vector 115, and uses viseme prediction model 120,  col 4 lines 1-2, viseme-generation application 102 generates training data 130a-n for training viseme prediction model 120; viseme prediction model : interpreted as machine algorithm ); and a subsequent segment that includes contextual audio that comes after the audio segment and potentially contains context about a viseme that maps to the phoneme (see Li col 6, lines 14-15, LSTM model 500 predicts visemes based on feature vectors for past, present, or future windows in time); use the trained machine-learning algorithm to identify at least one probable viseme corresponding to speech in a target audio signal (see Li, col 5, lines 47-50 at block 303, process 300 involves determining a sequence of predicted visemes representing speech for the present subset by applying the feature vector to the viseme prediction model). Li fails to teach record, as metadata of the target audio signal, where the probable viseme occurs within the target audio signal.  However Heller teaches use the trained machine-learning algorithm to identify at least one probable viseme corresponding to speech in a target audio signal (see Heller [0066] But using the input audio dataset alone may only allow the viseme detection engine 102 to identify a set of candidate frames that includes both a video frame that depicts a viseme (e.g., frames in which a person's mouth was moving while speaking a word) and frames that do not depict the viseme (e.g., frames in which a person's mouth was not moving due to slurring or under-enunciation while speaking a word; viseme detection engine interpreted as trained machine learning algorithm); and record, as metadata of the target audio signal, where the probable viseme occurs within the target audio signal (see Heller [0042] the viseme detection engine 102 detects, extracts, and tags a set of viseme frames 124 based on an analysis of the input recording 104 with respect to the reference audio dataset 118; viseme frame interpreted as target audio signal).
Li and Heller are both considered to be analogous to the claimed invention because both relate to relating lip movement with speech synthesis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Li on automation solution to predicting mouth movements based on speech with the automatic detection of visemes in input recordings teachings of Heller to improve the quality of the viseme frames (see Heller [0004]).
Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over Li et. al. (US Patent 10,699,705) in view of Heller et. al. (US Patent Application Publication 2020/0160581) further 2016. JALI: an animator-centric viseme model for expressive lip synchronization. ACM Trans. Graph. 35, 4, Article 127 (July 2016)”, where Li has been cited in the IDS submitted on 12/20/2021 as US Patent Publication Application 2019/0392853.
Regarding claim 2, Li and Heller teaches the method of claim 1, Heller teaches wherein training the machine-learning algorithm comprises identifying a start time and an end time for each phoneme in the set of training audio signals by at least one of: detecting prelabeled phonemes (see Heller Fig. 1 122A-122C, 124, and [0044-0045] teaches this analysis includes comparing the input audio dataset 110 to the reference audio dataset 118 having reference audio portions 120a-c. The reference audio dataset 118 includes annotations 122a-c that respectively identity phonemes or other sounds within the reference audio portions 120a-c. The viseme detection engine 102 can determine that the reference audio portions 120a-c match or are sufficiently similar to the input sets of audio data 112a-c. The viseme detection engine 102 can also determine that the input sets of audio data 112a-c respectively include the sounds “D,” “Oh,” and “Ee.” The viseme detection engine 102 can therefore determine that the frame 108a located at the timestamp 114a corresponding to the input audio data 112a should be tagged as depicting the “D” viseme). 
Li and Heller are both considered to be analogous to the claimed invention because both relate to relating lip movement with speech synthesis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Li on automation solution to predicting mouth movements based on speech with the automatic detection of visemes in input recordings teachings of Heller to improve the quality of the viseme frames (see Heller [0004]).
or aligning estimated phonemes to a script of each training audio signal in the set of training audio signals (see Edwards, pg. 127:2 Col1 lines 40-47 2. Forced alignment is employed to align utterances in the soundtrack to the text, giving an output time series containing a sequence of phonemes [Brugnara et al. 1993]. 3. Audio, text and alignment information are combined to give text/phoneme and phoneme/audio correspondences).
Li, Heller and Edwards are considered to be analogous to the claimed invention because they relate to relating lip movement with speech synthesis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Li and Heller on automation solution to detect visemes based on speech with the speech synchronization teachings of Edwards to produce realistic outputs for the animation (see Edwards, pg. 127:2, col 1 lines 4-15).
Claims 4, 5, 6, 7 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Li et. al. (US Patent 10,699,705) in view of Heller et. al. (US Patent Application Publication 2020/0160581) further in view of Howard, (US Patent 11,004,461), where Li has been cited in the IDS submitted on 12/20/2021 as US Patent Publication Application 2019/0392853.
Regarding claim 4, Li and Heller teaches the method of claim 3, however Li and Heller fail to teach wherein extracting the set of features comprises, for each training audio signal: dividing the training audio signal into overlapping windows of time; performing a transformation on each windowed audio signal to convert a frequency spectrum for the window of time to a power spectrum indicating a spectral density of the windowed audio signal; computing filter banks for the training audio signal by applying filters that at least partially reflect a scale of human hearing to each power spectrum; and calculating the spectrogram of the training audio signal by combining coefficients of the filter banks.  However Howard teaches wherein extracting the set of features comprises, for each training audio signal: dividing the training audio signal into overlapping windows of time; performing a transformation on each windowed audio signal to convert a frequency spectrum for the window of time to a power spectrum indicating a spectral density of the windowed audio signal; computing filter banks for the training audio signal by applying filters that at least partially reflect a scale of human hearing to each power spectrum; and calculating the spectrogram of the training audio signal by combining coefficients of the filter banks ( see Howard, [0099] An exemplary block diagram of MFCC computation 900 is shown in FIG. 9. As shown in this example, an audio signal 902 is input to pre-emphasis stage 904, the output of which is input to windowing stage 906, the output of which is input to discrete Fourier transform (DFT) stage 908, the output of which is input to power spectrum stage 910. The output from power spectrum stage 910 is input to both filter bank stage 912 and energy spectrum stage 914. The output from filter bank stage 912 is input to log stage 916, the output of which is input to discrete cosine transform (DCT) stage 918, the output of which is input to sinusoidal liftering stage 920, the output of which is 12 MFCC data samples 924. Howard [0100] the algorithm is trying to mimic and be as close as possible to the process of frequency perception by the human auditory system).  
Li, Heller and Howard are considered to be analogous to the claimed invention because they relate to vocal feature processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Li and Heller on automation solution to detect visemes based on speech with the vocal feature extraction teachings of Howard to produce realistic outputs for the animation (see Edwards, pg. 127:2, col 1 lines 4-15).
Regarding claim 5, Li, Heller and Howard teach the method of claim 4, Li further teaches wherein extracting the set of features further comprises applying a pre-emphasis filter to the set of training audio signals to balance frequencies and reduce noise in the set of training audio signals (see Li col 5, lines 9-16 before computing MFCCs, viseme-generation application 102 can filter the input audio to boost signal quality).
Regarding claim 6, Li, Heller and Howard teach the method of claim 4, Li further teaches wherein dividing the training audio signal comprises applying a window function to taper the windowed audio signal within each overlapping window of time of the training audio signal (see Li col5, lines 23-26 feature vector 115 can include energy component 403. Energy component 403 represents the energy of the sequence of the audio samples in the window, for example, using a function such as the log mean energy of the samples).
Regarding claim 7, Li, Heller and Howard teach the method of claim 4, Li further teaches wherein calculating the spectrogram comprises at least one of: performing a logarithmic function to convert the frequency spectrum to a mel scale; extracting frequency bands by applying the filter banks to each power spectrum; performing an additional transformation to the filter banks to decorrelate the coefficients of the filter banks; or computing a new set of coefficients from the transformed filter banks (see Li col 5 lines 9-13, MFCCs are a frequency-based representation with non-linearly spaced frequency bands that roughly match the response of the human auditory system. Feature vector 115 can include any number of MFCCs derived from the audio sequence).
Regarding claim 8, Li, Heller and Howard teach the method of claim 4, Howard further teaches wherein extracting the set of features further comprises standardizing the set of features see Howard [0086] a window of finite length with abrupt boundaries, such as a rectangular window, is the simplest in the time domain, but creates artifacts in the frequency domain. A function such as the Dirac function, with a thin central peak and maxima tending towards zero elsewhere may be better in the frequency domain. But this type of function has infinite duration once transferred to the time domain, which does not correlate to an ideal time domain window function. Regardless of the selected window, completely avoiding spectral deformation is not possible and the window will not be of infinite length).
Claims 10 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Li et. al. (US Patent 10,699,705) in view of Heller et. al. (US Patent Application Publication 2020/0160581) further in view of Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh, “Visemenet: audio-driven animator-centric speech animation” ACM Trans. Graph. 37, 4, Article 161 (August 2018), where Li has been cited in the IDS submitted on 12/20/2021 as US Patent Publication Application 2019/0392853.
Regarding claim 10, Li and Heller teach the method of claim 9
    PNG
    media_image1.png
    360
    738
    media_image1.png
    Greyscale
, but fail to teach wherein calculating the probability of mapping at least one viseme to the phoneme comprises weighting visually distinctive visemes more heavily than other visemes.  However Zhou teaches wherein calculating the probability of mapping at least one viseme to the phoneme comprises weighting visually distinctive visemes more heavily than other visemes (see Zhou phoneme group probability and Viseme prediction in Fig. 3 , pg. 161:3 section 3, Viseme prediction. The last part of our network, the “viseme stage” in Figure 3(right-box), combines the intermediate predictions of phoneme groups, jaw and lip parameters, as well as the audio signal itself to produce visemes. By training our architecture on a combination of data sources containing audio, 2D video, and 3D animation of human speech, we are able to predict visemes accurately. We represent visemes based on the JALI model [Edwards et al. 2016], comprising a set of intensity values for 20 visemes and 9 co-articulation rules, and JAW and LIP parameters that capture; intensity values interpreted as the weighting visually distinctive visemes more heavily than other visemes).
Li, Heller and Zhou are considered to be analogous to the claimed invention because they relate to correlation of speech expressiveness with lip movement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Li and Heller on automation solution to detect visemes based on speech with the deep learning based teachings of Zhou to improve the efficiency of producing realistic outputs for the animation (see Zhou, pg. 161:2, col 1 lines 36-46).

    PNG
    media_image2.png
    417
    414
    media_image2.png
    Greyscale
Regarding claim 18, Li and Heller teach the system of claim 17, however fail to teach wherein the machine-learning algorithm is trained to identify at least one of: a probable phoneme corresponding to the speech in the target audio signal; and a set of alternate phonemes that map to the probable viseme corresponding to the probable phoneme in the target audio signal. However, Zhou teaches wherein the machine-learning algorithm is trained to identify at least one of: a probable phoneme corresponding to the speech in the target audio signal (see Zhou, pg. 161:3 col 1 lines 48-51 & Fig. 3 Phoneme group prediction. A large part of our network, the “phoneme group stage” in Figure 3 (top left box), is dedicated to map audio to phonemes groups corresponding to visemes.); and a set of alternate phonemes that map to the probable viseme corresponding to the probable phoneme in the target audio signal (see Zhou Fig. 2 and pg. 161:3, sect. 3 teaches Phoneme group prediction: A large part of our network, the “phoneme group stage” in Figure 3 (top left box), is dedicated to map audio to phonemes groups corresponding to visemes. For example, the two labio-dental phonemes /f and v/ form a group that maps to a single, near-identical viseme[Edwards et al. 2016], where the lower lip is pressed against the upper teeth in Figure 2 (last row, right). We identified 20 such visual groups of phonemes expressed in the International Phonetic Alphabet (IPA) in Figure 2). 
Li, Heller and Zhou are considered to be analogous to the claimed invention because they relate to correlation of speech expressiveness with lip movement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Li and Heller on automation solution to detect visemes based on speech with the deep learning based teachings of Zhou to improve the efficiency of producing realistic outputs for the animation (see Zhou, pg. 161:2, col 1 lines 36-46).
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Li et. al. (US Patent 10,699,705) in view of Heller et. al. (US Patent Application Publication 2020/0160581) further in view of Thangthai, K., Bear, H. L., & Harvey, R. (2018) “Comparing phonemes and visemes with DNN-based lipreading.” arXiv preprint arXiv:1805.02924.
	Regarding claim 14, Li and Heller teach the method of claim 12. Li further teaches wherein testing the machine-learning algorithm comprises: standardizing the set of test audio signals (see Li, Col 9 lines 1-10 at block 601, process 600 involves determining a feature vector for each sample of the respective audio sequence of each set of training data. For example, training data 130a includes audio samples. In that case, the viseme-generation application 102 determines, for a window of audio samples, feature vector 115 in a substantially similar manner as described with respect to block 302 in process 300. As discussed with respect to FIGS. 3 and 4, feature vector 115 can include one or more of MFCC component 402, energy component 403, MFCC derivatives 404, and energy level derivative 405); applying the machine-learning algorithm to the standardized set of test audio signals (see Li, col 9 lines 17-18 At block 603, process 600 involves receiving, from the viseme prediction model, a predicted viseme).
Li and Heller are both considered to be analogous to the claimed invention because both relate to relating lip movement with speech synthesis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Li on automation solution to predicting mouth movements based on speech with the automatic detection of visemes in input recordings teachings of Heller to improve the quality of the viseme frames (see Heller [0004]).
However Li and Heller fail to teach comparing an accuracy of mapping visemes to phonemes of the set of test audio signals by the machine-learning algorithm with an accuracy of at least one alternate machine-learning algorithm; and selecting an accurate machine-learning algorithm based on the comparison. However, Thangthai teaches comparing an accuracy of mapping visemes to phonemes of the set of test audio signals by the machine-learning algorithm with an accuracy of at least one alternate machine-learning algorithm (see Thangthai, pg. 4, section 2.3 Our DNN-HMM visual speech model training involves all five successive stages. Here, we detail the development of the visual speech model that we employ in this work including all steps and parameters); and selecting an accurate machine-learning algorithm based on the comparison (see Thangthai, pg. 7, section 5.2 & section 5.3  Table 4 shows the word and phoneme accuracies achieved with our phoneme-based lipreading system. This system achieved the most accurate lipreading with a word accuracy of 48.74%. It is interesting that with the phoneme recogniser, word accuracy is greater than phoneme accuracy, because in the viseme recogniser, this is vice versa. Again, highest accuracy is achieved with Eigenlip features rather than DCT. One interesting observation apparent in Tables 3 and 4 is that the introduction of the DNN makes little difference to the unit accuracy but a bigger difference to a word accuracy for both DCT and eignlips features).
Li, Heller and Thangthai are considered to be analogous to the claimed invention because they relate to text to speech Animation and Synchronization techniques. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Li and Heller on deep learning solution to detect visemes based on speech with the phoneme classifier teachings of Thangthai to improve lip-reading accuracy (see Thangthai, pg. 9, section 6).
Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Li et. al. (US Patent 10,699,705) in view of Heller et. al. (US Patent Application Publication 2020/0160581) further in view of Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh, “Visemenet: audio-driven animator-centric speech animation” ACM Trans. Graph. 37, 4, Article 161 (August 2018), further in view of Suriyah et.al. "Idhazhi:A Min-Max Algorithm for Viseme to Phoneme Mapping", International Journal of Innovative Technology and Exploring Engineering, February2020, Vol.9, No.4, 588-594, where Li has been cited in the IDS submitted on 12/20/2021 as US Patent Publication Application 2019/0392853 and Suriyah has been cited in the IDS submitted on 12/20/2021.
Regarding claim 19, Li, Heller and Zhou teach the system of claim 18, Heller further teaches wherein the computer-executable instructions, when executed by the physical processor, further cause the physical processor to: provide the metadata indicating where the probable viseme occurs within the target audio signal to a user (see Heller [0042] The viseme detection engine 102 detects, extracts, and tags a set of viseme frames 124 based on an analysis of the input recording 104 with respect to the reference audio dataset 118). However Li, Heller and Zhou fail to teach provide, to the user, the set of alternate phonemes that map to the probable viseme to improve selection of translations for the speech in the target audio signal. However Suriyah teaches provide, to the user, the set of alternate phonemes that map to the probable viseme to improve selection of translations for the speech in the target audio signal (see Suriyah pg. 588 col. 2 lines 28-33 Idhazhi, a system to suggest words which match a particular viseme sequence is proposed in this paper. This system finds words matching the viseme sequence of a particular word, in four levels - perfect, optimized, semi-perfect and compacted.).
Li, Heller, Zhou and Suriyah are considered to be analogous to the claimed invention because they relate to text to speech Animation and Synchronization techniques. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Li, Heller and Zhou on deep learning solution to detect visemes based on speech with the language tool teachings of Suriyah to improve the subtitling and dubbing of media content (see Suriyah, pg. 588, section 1).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Theobald et. al. US Patent Application Publication 2017/0154457 teaches speech animation performed using visemes with phonetic boundary context (see Theobald, [0005]).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NANDINI SUBRAMANI whose telephone number is (571)272-3916. The examiner can normally be reached Monday - Friday 2:00pm - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh M Mehta can be reached on (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.







/NANDINI SUBRAMANI/Examiner, Art Unit 2656    

/EDGAR X GUERRA-ERAZO/Primary Examiner, Art Unit 2656