DETAILED ACTION
Introduction
This office action is in response to Applicant’s submission filed on 04/05/2022. Claims 1-20 are pending in the application and have been examined.
	
Notice of Pre-AIA  or AIA  Status
The present application is being examined under the pre-AIA  first to invent provisions. 

Response to Amendment
The response filed on 04/05/2022 has been correspondingly accepted and considered in this Office Action. Claims 1-20 have been examined. Applicant’s amendments to claim 16, indicating an additional set of phonemes overcome the 35 U.S.C 112(d) rejections previously set forth in the Non-Final Office Action mailed 01/05/2022. Therefore, the above referenced rejection under 35 U.S.C. 112(d) is withdrawn.

Response to Arguments
Applicant's arguments filed 04/05/2022 have been fully considered as follows:
Applicant’s arguments with respect to claims 1-20 on pg. 12 state that
“…because the art of record fails to disclose, teach, or suggest each and every feature of the independent claims, this art fails to represent art sufficient to establish a prima facie obviousness rejection...”
	
Applicant’s arguments above with respect to independent claims 1, 17 and 20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
In response to the art rejection(s) of the remainder of dependent claims are rejected under 35 U.S.C 103, in case said claims are correspondingly discussed and/or argued for at least the same rationale presented in Remarks filed 04/05/2022, Examiner respectfully notes as follows. For completeness, should the mentioned claims are likewise traversed for similar reasons to independent claims 1, 17 and 20 correspondingly, Examiner respectfully directs Applicant to the same previous supra reasons provided in the response directed towards independent claims discussed above. For at least the same supra provided reasons, Examiner likewise respectfully disagrees, and Applicant's arguments have been fully considered but they are not persuasive.

Claim Rejections - 35 USC § 103
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
Claims 1, 2, 3, 9, 11, 12, 13, 16, 17, 18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Li et. al. (US Patent 10,699,705) in view of Theobald et. al. (US Patent Application Publication 2017/0154457) where Li has been cited in the IDS submitted on 12/20/2021 as US Patent Publication Application 2019/0392853.
Regarding claim 1, Li teaches training a machine-learning algorithm to identify visemes corresponding to phonemes in a set of training audio signals (see Li col 2, lines 38-41,  viseme-generation application 102 receives an audio sequence from audio input device 105, generates feature vector 115, and uses viseme prediction model 120,  col 4 lines 1-2, viseme-generation application 102 generates training data 130a-n for training viseme prediction model 120; viseme prediction model : interpreted as machine algorithm); using the trained machine-learning algorithm to identify two or more probable visemes corresponding to speech in a target audio signal (see Li, col 5, lines 47-50 at block 303, process 300 involves determining a sequence of predicted visemes representing speech for the present subset by applying the feature vector to the viseme prediction model). Li fails to teach receiving, from a user, a request to prioritize identification of visemes having a specified attribute; selecting, from the two or more probable visemes, a probable viseme having the specified attribute; and recording, as metadata of the target audio signal, where the probable viseme having the specified attribute occurs within the target audio signal.
However, Theobald teaches receiving, from a user, a request to prioritize identification of visemes having a specified attribute (see. Theobald, [0053] Returning to FIG. 1, the potential set component 112 may be configured to determine potential sets of viseme units that correspond to individual ones of the phoneme string portions. The potential sets may be determined based on viseme units and/or sets of viseme units that may be grouped together by the viseme manager component 108); selecting, from the two or more probable visemes, a probable viseme having the specified attribute (see. Theobald, [0057] In some implementations, the selection component 114 may be configured to determine a match between individual potential sets and a corresponding phoneme string portion based on one or more fit metrics. In some implementations, a fit metric may convey matches between potential sets and phoneme string portions based one or more of an animation cost for using a given potential set, a smoothness or “natural” look of an animation using a given potential set, and/or other metrics); and recording, as metadata of the target audio signal, where the probable viseme having the specified attribute occurs within the target audio signal (see. Theobald, US20170154457,  [0075] In some implementations, the presentation component 116 may be configured to synchronize jaw, lips, teeth, tongue, and/or other facial feature movement of an animation entity with audio corresponding to a phoneme string. This may be accomplished via one or more lip-synching techniques and/or other techniques. In some implementations, the audio may comprise an audio recording of user speech. In some implementations, the audio may comprise a machine-generated speech based on an input phoneme string (e.g., using text to speech techniques, and/or other techniques).
Li and Theobald are considered to be analogous to the claimed invention because both relate to relating visemes with speech. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Li on automation solution to predicting mouth movements based on speech with the speech animation teachings of Theobald to improve the quality of facial movement which is smoother and more closely resembles human facial movement during speech (see Theobald [0003]).
Regarding claim 2, Li in view of Theobald teach the method of claim 1, Theobald further teaches wherein training the machine-learning algorithm comprises causing the machine-learning algorithm to weight visually 2Application No.: 16/903,373Attorney's Docket No.: 010704.0054U1 distinctive visemes more heavily than other visemes (see Theobald, [0069] By way of non-limiting illustration in FIG. 6, an exemplary visual representation of fit metrics associated with the first set 400 and second set 500 is shown. In some implementations, a fit metric may be based on one or more of an animation cost, smoothness, and/or other metrics. The values of the fit metrics for the first set 400 and second set 500 are shown as numerical values. However, fit metrics may be expressed in other ways. The selection component 114 may be configured to select one of the potential sets based on one or more of the fit metrics; the fit metrics are interpreted as the weight visually distinctive vismes).
Regarding claim 3, Li in view of Theobald teach the method of claim 1, Li further teaches wherein: training the machine-learning algorithm comprises extracting a set of features from the set of training audio signals, wherein each feature in the set of features comprises a spectrogram indicating energy levels of a training audio signal (see Li col 5, lines 23-27 feature vector 115 can include energy component 403. Energy component 403 represents the energy of the sequence of the audio samples in the window, for example, using a function such as the log mean energy of the samples); and training the machine-learning algorithm on the set of training audio signals is performed using the extracted set of features (see Li col 8, lines 9-14, viseme prediction model 120 is trained using training data 130a-n. Training data can include a set of feature vector and corresponding predicted visemes. Viseme-generation application 102 can be used to generate training data 130a-n). 
Regarding claim 9, Li in view of Theobald teach the method of claim 1, Li further teaches wherein training the machine-learning algorithm comprises, for each audio segment in the set of training audio signals: calculating, for one or more visemes, the probability of the viseme mapping to the phoneme of the audio segment (see Li, Col 9, lines 1-10 At block 601, process 600 involves determining a feature vector for each sample of the respective audio sequence of each set of training data. For example, training data 130a includes audio samples. In that case, the viseme-generation application 102 determines, for a window of audio samples, feature vector 115 in a substantially similar manner as described with respect to block 302 in process 300; interpreted as the prediction model 102, RNN or LSTM model, is trained with training data); selecting the viseme with a high probability of mapping to the phoneme based on the context from the subsequent segment (see Li Col 9, lines 1-23 At block 603, process 600 involves receiving, from the viseme prediction model, a predicted viseme. The viseme-generation application 102 receives a predicted viseme from viseme prediction model 120. The predicted viseme corresponds to the feature vector 115, and to the corresponding input audio sequence from which the feature vector was generated); and modifying the machine-learning algorithm based on a comparison of the selected viseme to a known mapping of visemes to phonemes (see Li, col 9 lines 33-47 at block 605, process 600 involves adjusting internal parameters, or weights, of the viseme prediction model to minimize the loss function. With each iteration, the viseme-generation application 102 seeks to minimize the loss function until viseme prediction model 120 is sufficiently trained. Viseme-generation application 102 can use a backpropagation training method to optimize internal parameters of the LSTM model 500. Backpropagation updates internal parameters of the network to cause a predicted value to be closer to an expected output).
Regarding claim 11, Li in view of Theobald teach the method of claim 9. Li further teaches wherein selecting the viseme with the high probability of mapping to the phoneme further comprises adjusting the selection based on additional context from a prior segment that includes additional contextual audio that comes before the audio segment (see Li col 6 lines 14-23 LSTM model 500 predicts visemes based on feature vectors for past, present, or future windows in time. LSTM model 500 can consider feature vectors for future windows by delaying the output of the predicted viseme until subsequent feature vectors are received and analyzed. Delay 501, denoted by d, represents the number of time windows of look-ahead. For a current audio feature vector a.sub.t, LSTM model 500 predicts a viseme that appears d windows in the past at v.sub.t-d).
	Regarding claim 12, Li in view of Theobald teach the method of claim 1. Li further teaches wherein training the machine-learning algorithm further comprises: validating the machine-learning algorithm using a set of validation audio signals (see Li col 10, lines 36-47  At block 701, process 700 involves accessing a first set of training data including a first audio sequence representing a sentence spoken by a first speaker and having a first length. For example, viseme-generation application 102 accesses the first set of training data 801. The first set of training data 801 includes viseme sequence 811 and first audio sequence 812. The audio samples in first audio sequence 812 represent a sequence of phonemes. The visemes in viseme sequence 811 are a sequence of visemes, each of which correspond to one or more audio samples in first audio sequence 812; teaches validating ); and testing the machine-learning algorithm using a set of test audio signals (see Li col 11 lines 36-40  at block 705, process 700 involves training a viseme prediction model to predict a sequence of visemes from the first training set and the second training set; blocks 702-704 teaches the testing with the second audio sequence).
	Regarding claim 13, Li in view of Theobald teach the method of claim 12. Li further teaches wherein validating the machine-learning algorithm comprises: standardizing the set of validation audio signals (see Li, col 9 lines 1-10 At block 601, process 600 involves determining a feature vector for each sample of the respective audio sequence of each set of training data. For example, training data 130a includes audio samples. In that case, the viseme-generation application 102 determines, for a window of audio samples, feature vector 115 in a substantially similar manner as described with respect to block 302 in process 300. As discussed with respect to FIGS. 3 and 4, feature vector 115 can include one or more of MFCC component 402, energy component 403, MFCC derivatives 404, and energy level derivative 405); applying the machine-learning algorithm to the standardized set of validation audio signals (see Li col 9 lines 17-18 At block 603, process 600 involves receiving, from the viseme prediction model, a predicted viseme); and evaluating an accuracy of mapping visemes to phonemes of the set of validation audio signals by the machine-learning algorithm (see Li col 9 lines 24-27 At block 604, process 600 involves calculating a loss function by calculating a difference between predicted viseme and the expected viseme. The expected viseme for the feature vector is included in the training data).
Regarding claim 16, Li in view of Theobald teach the method of claim 1. Theobald further teaches identifying an additional set of phonemes that map to each identified probable viseme in the target audio signal (see Theobald, [0054] In some implementations, the potential set component 112 may be configured to employ a hash table and/or other information to determine potential sets. A hash table may associate keys with buckets. The keys may include phonemes and/or phoneme sequences. The buckets may include a list of viseme units and/or sets of viseme units that may match and/or substantially match a phoneme and/or phoneme sequence (e.g., based on context labels of the viseme units and/or other information)); and recording, as metadata of the target audio signal, where the set of phonemes occur within the target audio signal (see. Theobald, [0045] FIG. 5 illustrates an exemplary representation of a second set 500 of viseme units. The second set 500 may include a third viseme unit 502, a fourth viseme unit 504, and/or other viseme units. Also illustrated are exemplary labels 508 that may be associated with the viseme units. The labels 508 may describe one or more complete phonemes and/or phoneme context for the complete phonemes associated with the viseme units. The phoneme context may include a partial phoneme adjacent to a complete phoneme).
Regarding claim 17, is directed to a system claim corresponding to the method claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1.
Regarding claim 18, is directed to a system claim corresponding to the method claim presented in claim 2 and is rejected under the same grounds stated above regarding claim 2.
Regarding claim 20, is directed to a non-transitory computer readable medium claim corresponding to the method claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1.
Claims 4, 5, 6 and 7 are rejected under 35 U.S.C. 103 as being unpatentable over Li et. al. (US Patent 10,699,705) in view of Theobald et. al. (US Patent Application Publication 2017/0154457) further in view of Howard, (US Patent 11,004,461), where Li has been cited in the IDS submitted on 12/20/2021 as US Patent Publication Application 2019/0392853.
Regarding claim 4, Li in view of Theobald teach the method of claim 3, however Li and Theobald fail to teach wherein extracting the set of features comprises, for each training audio signal: dividing the training audio signal into overlapping windows of time; performing a transformation on each windowed audio signal to convert a frequency spectrum for the window of time to a power spectrum indicating a spectral density of the windowed audio signal; computing filter banks for the training audio signal by applying filters that at least partially reflect a scale of human hearing to each power spectrum; and calculating the spectrogram of the training audio signal by combining coefficients of the filter banks.  However Howard teaches wherein extracting the set of features comprises, for each training audio signal: dividing the training audio signal into overlapping windows of time; performing a transformation on each windowed audio signal to convert a frequency spectrum for the window of time to a power spectrum indicating a spectral density of the windowed audio signal; computing filter banks for the training audio signal by applying filters that at least partially reflect a scale of human hearing to each power spectrum; and calculating the spectrogram of the training audio signal by combining coefficients of the filter banks ( see Howard, [0099] An exemplary block diagram of MFCC computation 900 is shown in FIG. 9. As shown in this example, an audio signal 902 is input to pre-emphasis stage 904, the output of which is input to windowing stage 906, the output of which is input to discrete Fourier transform (DFT) stage 908, the output of which is input to power spectrum stage 910. The output from power spectrum stage 910 is input to both filter bank stage 912 and energy spectrum stage 914. The output from filter bank stage 912 is input to log stage 916, the output of which is input to discrete cosine transform (DCT) stage 918, the output of which is input to sinusoidal liftering stage 920, the output of which is 12 MFCC data samples 924. Howard [0100] the algorithm is trying to mimic and be as close as possible to the process of frequency perception by the human auditory system).  
Li, Theobald and Howard are considered to be analogous to the claimed invention because they relate to vocal feature processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Li and Theobald on automation solution to detect visemes based on speech with the vocal feature extraction teachings of Howard to interpret the paralinguistic content of speech than just the linguistic content(see Howard, [0003]).
Regarding claim 5, Li in view of Theobald further in view of Howard teach the method of claim 4, Li further teaches wherein extracting the set of features further comprises applying a pre-emphasis filter to the set of training audio signals to balance frequencies and reduce noise in the set of training audio signals (see Li col 5, lines 9-16 before computing MFCCs, viseme-generation application 102 can filter the input audio to boost signal quality).
Regarding claim 6, Li in view of Theobald further in view of Howard teach the method of claim 4, Li further teaches wherein dividing the training audio signal comprises applying a window function to taper the windowed audio signal within each overlapping window of time of the training audio signal (see Li col5, lines 23-26 feature vector 115 can include energy component 403. Energy component 403 represents the energy of the sequence of the audio samples in the window, for example, using a function such as the log mean energy of the samples).
Regarding claim 7, Li in view of Theobald further in view of Howard teach the method of claim 4, Li further teaches wherein calculating the spectrogram comprises at least one of: performing a logarithmic function to convert the frequency spectrum to a mel scale; extracting frequency bands by applying the filter banks to each power spectrum; performing an additional transformation to the filter banks to decorrelate the coefficients of the filter banks; or computing a new set of coefficients from the transformed filter banks (see Li col 5 lines 9-13, MFCCs are a frequency-based representation with non-linearly spaced frequency bands that roughly match the response of the human auditory system. Feature vector 115 can include any number of MFCCs derived from the audio sequence).
Claims 8, 15 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Li et. al. (US Patent 10,699,705) in view of Theobald et. al. (US Patent Application Publication 2017/0154457) further in view of M. A. Berger, G. Hofer and H. Shimodaira, "Carnival—Combining Speech Technology and Computer Animation," in IEEE Computer Graphics and Applications, vol. 31, no. 5, pp. 80-89, September-October 2011, where Li has been cited in the IDS submitted on 12/20/2021 as US Patent Publication Application 2019/0392853.
Regarding claim 8, Li in view of Theobald the method of claim 1, Theobald teaches the timeline comprising a graphical representation of the probable viseme placed in the timeline where the probable viseme occurs within the target audio signal (see Theobald, [0024] By way of non-limiting illustration in FIG. 3, an exemplary phoneme string portion 302 is shown. The phoneme string portion 302 may sequentially include a first phoneme 304, a second phoneme 306, a third phoneme 308, and/or other phonemes; Theobald [0041] By way of non-limiting illustration in FIG. 4, an exemplary representation of a first set 400 of viseme units is shown. The first set 400 may include a first viseme unit 402, a second viseme unit 404, and/or other viseme units. Fig. 3 depicts the timeline of the phonemes and with Fig. 4 is interpreted as graphical representation of the probable viseme in a timeline within the target audio signal). However, Li in view of Theobald fails to teach presenting, to the user via a graphical user interface, a timeline synchronized with a playhead 4Application No.: 16/903,373Attorney's Docket No.: 010704.0054U1 marker.
However, Berger teaches presenting, to the user via a graphical user interface, a timeline synchronized with a playhead 4Application No.: 16/903,373Attorney's Docket No.: 010704.0054U1 marker ( see Berger, pg. 87, Fig. 6, An Event, containing an 
    PNG
    media_image1.png
    655
    820
    media_image1.png
    Greyscale
ordered list of sequences representing the same temporal event, such as an utterance. The timeseries members share a common time domain with current elapsed time t. This event includes a string, an audio, two numericaltimeseries (to one of which a visualizer is bound), a categoricaltimeseries, and a video. An event has playback functions such as play, pause, and seek, which control synchronous output of the set's real-time members (signals and bound numericaltimeseries); the marker in the Figure depicts the playhead marker).
Li, Theobald and Berger are considered to be analogous to the claimed invention because they relate to vocal feature processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Li and Theobald on automation solution to detect visemes based on speech with the vocal feature extraction teachings of Howard to interpret the paralinguistic content of speech than just the linguistic content(see Howard, [0003]).
Regarding claim 15, Li in view of Theobald the method of claim 1. Theobald teaches a graphical representation of the probable viseme, the graphical representation of the probable viseme being placed in the timeline where the probable viseme occurs within the target audio signal(see Theobald, [0024] By way of non-limiting illustration in FIG. 3, an exemplary phoneme string portion 302 is shown. The phoneme string portion 302 may sequentially include a first phoneme 304, a second phoneme 306, a third phoneme 308, and/or other phonemes; Theobald [0041] By way of non-limiting illustration in FIG. 4, an exemplary representation of a first set 400 of viseme units is shown. The first set 400 may include a first viseme unit 402, a second viseme unit 404, and/or other viseme units. Fig. 3 depicts the timeline of the phonemes and with Fig. 4 is interpreted as graphical representation of the probable viseme in a timeline within the target audio signal). However, Li in view of Theobald fail to teach presenting, to the user via a graphical user interface, a synchronized timeline (see Berger, Fig. 6 , pg. 88, the developer determines the user interface of the Carnival. In designing our in-house application, we've studied GUIs from a variety of exemplars: audio analysis tools, video-editing systems, speech synthesis programs, and 3D modeling and animation packages. We also obtained feedback from animation professionals about the functionality they would like. Our application includes graphical editors for each type of Component comprising: at least one of: a graphical representation of the target audio signal (see Berger, Fig. 2 & Fig. 6, pg. 81, On the 
    PNG
    media_image2.png
    398
    820
    media_image2.png
    Greyscale
other hand, a categorical analysis (see Figure 2) provides a semantic description of speech events; Fig. 2 depicts the graphical representation along with Fig. 6); or a graphical representation of dialog in the target audio signal (see Berger, Fig. 2 & Fig. 6, pg. 81, On the other hand, a categorical analysis (see Figure 2) provides a semantic description of speech events; Fig. 2 depicts the graphical representation of the utterance or dialog along with Fig. 6).
Regarding claim 19, is directed to a system claim corresponding to the method claim presented in claim 15 and is rejected under the same grounds stated above regarding claim 15.
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Li et. al. (US Patent 10,699,705) in view of Theobald et. al. (US Patent Application Publication 2017/0154457) further in view of Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and Karan Singh, “Visemenet: audio-driven animator-centric speech animation” ACM Trans. Graph. 37, 4, Article 161 (August 2018), where Li has been cited in the IDS submitted on 12/20/2021 as US Patent Publication Application 2019/0392853.

    PNG
    media_image3.png
    360
    738
    media_image3.png
    Greyscale
Regarding claim 10, Li in view of Theobald teach the method of claim 9, but fail to teach wherein calculating the probability of mapping at least one viseme to the phoneme comprises weighting visually distinctive visemes more heavily than other visemes.  However, Zhou teaches wherein calculating the probability of mapping at least one viseme to the phoneme comprises weighting visually distinctive visemes more heavily than other visemes (see Zhou phoneme group probability and Viseme prediction in Fig. 3 , pg. 161:3 section 3, Viseme prediction. The last part of our network, the “viseme stage” in Figure 3(right-box), combines the intermediate predictions of phoneme groups, jaw and lip parameters, as well as the audio signal itself to produce visemes. By training our architecture on a combination of data sources containing audio, 2D video, and 3D animation of human speech, we are able to predict visemes accurately. We represent visemes based on the JALI model [Edwards et al. 2016], comprising a set of intensity values for 20 visemes and 9 co-articulation rules, and JAW and LIP parameters that capture; intensity values interpreted as the weighting visually distinctive visemes more heavily than other visemes).
Li, Theobald and Zhou are considered to be analogous to the claimed invention because they relate to correlation of speech expressiveness with lip movement. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Li and Theobald on automation solution to detect visemes based on speech with the deep learning based teachings of Zhou to improve the efficiency of producing realistic outputs for the animation (see Zhou, pg. 161:2, col 1 lines 36-46).
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Li et. al. (US Patent 10,699,705) in view of Theobald et. al. (US Patent Application Publication 2017/0154457) further in view of Thangthai, K., Bear, H. L., & Harvey, R. (2018) “Comparing phonemes and visemes with DNN-based lipreading.” arXiv preprint arXiv:1805.02924.
	Regarding claim 14, Li in view of Theobald teach the method of claim 12. Li further teaches wherein testing the machine-learning algorithm comprises: standardizing the set of test audio signals (see Li, Col 9 lines 1-10 at block 601, process 600 involves determining a feature vector for each sample of the respective audio sequence of each set of training data. For example, training data 130a includes audio samples. In that case, the viseme-generation application 102 determines, for a window of audio samples, feature vector 115 in a substantially similar manner as described with respect to block 302 in process 300. As discussed with respect to FIGS. 3 and 4, feature vector 115 can include one or more of MFCC component 402, energy component 403, MFCC derivatives 404, and energy level derivative 405); applying the machine-learning algorithm to the standardized set of test audio signals (see Li, col 9 lines 17-18 At block 603, process 600 involves receiving, from the viseme prediction model, a predicted viseme).
However, Li in view of Theobald fail to teach comparing an accuracy of mapping visemes to phonemes of the set of test audio signals by the machine-learning algorithm with an accuracy of at least one alternate machine-learning algorithm; and selecting an accurate machine-learning algorithm based on the comparison. However, Thangthai teaches comparing an accuracy of mapping visemes to phonemes of the set of test audio signals by the machine-learning algorithm with an accuracy of at least one alternate machine-learning algorithm (see Thangthai, pg. 4, section 2.3 Our DNN-HMM visual speech model training involves all five successive stages. Here, we detail the development of the visual speech model that we employ in this work including all steps and parameters); and selecting an accurate machine-learning algorithm based on the comparison (see Thangthai, pg. 7, section 5.2 & section 5.3  Table 4 shows the word and phoneme accuracies achieved with our phoneme-based lipreading system. This system achieved the most accurate lipreading with a word accuracy of 48.74%. It is interesting that with the phoneme recogniser, word accuracy is greater than phoneme accuracy, because in the viseme recogniser, this is vice versa. Again, highest accuracy is achieved with Eigenlip features rather than DCT. One interesting observation apparent in Tables 3 and 4 is that the introduction of the DNN makes little difference to the unit accuracy but a bigger difference to a word accuracy for both DCT and eignlips features).
Li, Theobald and Thangthai are considered to be analogous to the claimed invention because they relate to text to speech Animation and Synchronization techniques. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the teachings of Li and Theobald on deep learning solution to detect visemes based on speech with the phoneme classifier teachings of Thangthai to improve lip-reading accuracy (see Thangthai, pg. 9, section 6).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Agnoli, et. al. (US Patent Application Publication 2013/0073961) provide a media editing application for assigning roles to media content ( See Agnoli, Fig. 6).
Goldenberg et.al., (US Patent Application Publication 2013/0132835) teaches an interactive tool between a 3D animation and a corresponding script which includes: displaying a user interface that includes at least a 3D animation area and a script area, the 3D animation area including (i) a 3D view area for creating and playing a 3D animation and (ii) a timeline area for visualizing actions by one or more 3D animation characters, the script area comprising one or more objects representing lines from a script having one or more script characters (see Goldenberg, abstract, Fig.1A-D).
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NANDINI SUBRAMANI whose telephone number is (571)272-3916. The examiner can normally be reached Monday - Friday 2:00pm - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh M Mehta can be reached on (571)272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NANDINI SUBRAMANI/            Examiner, Art Unit 2656                                                                                                                                                                                            
/EDGAR X GUERRA-ERAZO/Primary Examiner, Art Unit 2656